`DataFrame.collect()` calls `SparkPlan.executeCollect()`, which consists of a single line:
```scala
execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
```
The problem is that, `QueryPlan.schema` is a function. And since 1.3.0, `convertRowToScala` starts returning a `GenericRowWithSchema`. Thus, every `GenericRowWithSchema` instance holds a separate copy of the schema object. Also, YJP profiling result of the following simple micro benchmark (executed in Spark shell) shows that constructing the schema object takes up to ~35% CPU time.
```scala
sc.parallelize(1 to 10000000).
map(i => (i, s"val_$i")).
toDF("key", "value").
saveAsParquetFile("file:///tmp/src.parquet")
// Profiling started from this line
sqlContext.parquetFile("file:///tmp/src.parquet").collect()
```
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5398)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#5398 from liancheng/spark-6748 and squashes the following commits:
3159469 [Cheng Lian] Makes QueryPlan.schema a lazy val
Now trait `StringComparison` is a `BinaryExpression`. In fact, it should be a `BinaryPredicate`.
By making `StringComparison` as `BinaryPredicate`, we can throw error when a `expressions.Predicate` can't translate to a data source `Filter` in function `selectFilters`.
Without this modification, because we will wrap a `Filter` outside the scanned results in `pruneFilterProjectRaw`, we can't detect about something is wrong in translating predicates to filters in `selectFilters`.
The unit test of #5285 demonstrates such problem. In that pr, even `expressions.Contains` is not properly translated to `sources.StringContains`, the filtering is still performed by the `Filter` and so the test passes.
Of course, by doing this modification, all `expressions.Predicate` classes need to have its data source `Filter` correspondingly.
There is a small bug in `FilteredScanSuite` for doing `StringEndsWith` filter. This pr also fixes it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5309 from viirya/translate_predicate and squashes the following commits:
b176385 [Liang-Chi Hsieh] Address comment.
275a493 [Liang-Chi Hsieh] More properly test for StringStartsWith, StringEndsWith and StringContains.
caf2347 [Liang-Chi Hsieh] Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter.
When union non-decimal types with decimals, we use the following rules:
- FIRST `intTypeToFixed`, then fixed union decimals with precision/scale p1/s2 and p2/s2 will be promoted to
DecimalType(max(p1, p2), max(s1, s2))
- FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE (this is the same as Hive,
but note that unlimited decimals are considered bigger than doubles in WidenTypes)
Author: guowei2 <guowei2@asiainfo.com>
Closes#4004 from guowei2/SPARK-5203 and squashes the following commits:
ff50f5f [guowei2] fix code style
11df1bf [guowei2] fix decimal union with double, double->Decimal(15,15)
0f345f9 [guowei2] fix structType merge with decimal
101ed4d [guowei2] fix build error after rebase
0b196e4 [guowei2] code style
fe2c2ca [guowei2] handle union decimal precision in 'DecimalPrecision'
421d840 [guowei2] fix union types for decimal precision
ef2c661 [guowei2] fix union with different decimal type
This builds on my earlier pull requests and turns on the explicit type checking in scalastyle.
Author: Reynold Xin <rxin@databricks.com>
Closes#5342 from rxin/SPARK-6428 and squashes the following commits:
7b531ab [Reynold Xin] import ordering
2d9a8a5 [Reynold Xin] jl
e668b1c [Reynold Xin] override
9b9e119 [Reynold Xin] Parenthesis.
82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
It did not conside that order.dataType does not match NativeType. So i add "case other => ..." for other cenarios.
Author: DoingDone9 <799203320@qq.com>
Closes#4959 from DoingDone9/case_ and squashes the following commits:
6278846 [DoingDone9] Update rows.scala
cb1852d [DoingDone9] Merge pull request #2 from apache/master
c3f046f [DoingDone9] Merge pull request #1 from apache/master
We assume that `RDD[Row]` contains Scala types. So we need to convert them into catalyst types in createDataFrame. liancheng
Author: Xiangrui Meng <meng@databricks.com>
Closes#5329 from mengxr/SPARK-6672 and squashes the following commits:
2d52644 [Xiangrui Meng] set needsConversion = false in jsonRDD
06896e4 [Xiangrui Meng] add createDataFrame without conversion
4a3767b [Xiangrui Meng] convert Row to catalyst
In order to do inbound checking and type conversion, we should use Literal.create() instead of constructor.
Author: Davies Liu <davies@databricks.com>
Closes#5320 from davies/literal and squashes the following commits:
1667604 [Davies Liu] fix style and add comment
5f8c0fd [Davies Liu] use Literal.create instread of constructor
Before it was possible for a query to flip back and forth from a resolved state, allowing resolution to propagate up before coercion had stabilized. The issue was that `ResolvedReferences` would run after `FunctionArgumentConversion`, but before `PropagateTypes` had run. This PR ensures we correctly `PropagateTypes` after any coercion has applied.
Author: Michael Armbrust <michael@databricks.com>
Closes#5278 from marmbrus/unionNull and squashes the following commits:
dc3581a [Michael Armbrust] [SPARK-5371][SQL] Propogate types after function conversion / before futher resolution
This PR is based on work by cloud-fan in #4904, but with two differences:
- We isolate the logic for Sort's special handling into `ResolveSortReferences`
- We avoid creating UnresolvedGetField expressions during resolution. Instead we either resolve GetField or we return None. This avoids us going down the wrong path early on.
Author: Michael Armbrust <michael@databricks.com>
Closes#5189 from marmbrus/nestedOrderBy and squashes the following commits:
b8cae45 [Michael Armbrust] fix another test
0f36a11 [Michael Armbrust] WIP
91820cd [Michael Armbrust] Fix bug.
Similar to `CreateArray`, we can add `CreateStruct` to create nested columns. marmbrus
Author: Xiangrui Meng <meng@databricks.com>
Closes#5195 from mengxr/SPARK-6542 and squashes the following commits:
3795c57 [Xiangrui Meng] update error message
ae7ac3e [Xiangrui Meng] move unit test to a separate suite
85dd559 [Xiangrui Meng] use NamedExpr
c78e31a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-6542
85f3106 [Xiangrui Meng] add CreateStruct
This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API.
Author: Reynold Xin <rxin@databricks.com>
Closes#5274 from rxin/df-missing-value and squashes the following commits:
4ee1b98 [Reynold Xin] Improve error reporting in Python.
33a330c [Reynold Xin] Remove replace for now.
bc4fdbb [Reynold Xin] Added documentation for replace.
d56f5a5 [Reynold Xin] Added replace for Scala/Java.
2385d00 [Reynold Xin] Feedback from Xiangrui on "how".
914a374 [Reynold Xin] fill with map.
185c67e [Reynold Xin] Allow specifying column subsets in fill.
749eb47 [Reynold Xin] fillna
249b94e [Reynold Xin] Removing undefined functions.
6a73c68 [Reynold Xin] Missing file.
67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
https://issues.apache.org/jira/browse/SPARK-6592
The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory
we need to include Row into the scaladoc while still excluding other classes of catalyst project
Thanks for the help on this patch from rxin and liancheng
Author: CodingCat <zhunansjtu@gmail.com>
Closes#5252 from CodingCat/SPARK-6592 and squashes the following commits:
02098a4 [CodingCat] ignore collection, enable types (except those protected classes)
f7af2cb [CodingCat] commit
3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan. As such, it needs to inherit from `MultiInstanceRelation` or self joins will break. I also add better debugging errors when our self join handling fails in case there are future bugs.
Author: Michael Armbrust <michael@databricks.com>
Closes#5251 from marmbrus/multiMetaStore and squashes the following commits:
4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation
Currently if trying to register an RDD (or DataFrame in 1.3) as a table that has types that have no supported Schema representation (e.g. type "Any") - it would throw a match error. e.g. scala.MatchError: Any (of class scala.reflect.internal.Types$ClassNoArgsTypeRef)
This fix is just to have a nicer error message than a MatchError
Author: Eran Medan <ehrann.mehdan@gmail.com>
Closes#5235 from eranation/patch-2 and squashes the following commits:
af4b1a2 [Eran Medan] Line should be under 100 chars
0c69e9d [Eran Medan] Change from sys.error UnsupportedOperationException
524be86 [Eran Medan] better exception than scala.MatchError: Any
Author: Reynold Xin <rxin@databricks.com>
Closes#5226 from rxin/empty-df and squashes the following commits:
1306d88 [Reynold Xin] Proper fix.
e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.
Author: Michael Armbrust <michael@databricks.com>
Closes#5191 from marmbrus/kryoRowsWithSchema and squashes the following commits:
bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo
f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
Previously this could result in sets compare equals when in fact the right was a subset of the left.
Based on #5133 by sisihj
Author: sisihj <jun.hejun@huawei.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#5194 from marmbrus/pr/5133 and squashes the following commits:
5ed4615 [Michael Armbrust] fix imports
d4cbbc0 [Michael Armbrust] Add test cases
0a0834f [sisihj] AttributeSet.equal should compare size
Current `castStruct` should be very slow. This pr slightly improves it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5017 from viirya/faster_caststruct and squashes the following commits:
385d5b0 [Liang-Chi Hsieh] Further improved.
746fcfb [Liang-Chi Hsieh] Make castStruct faster.
As issue [SPARK-6483](https://issues.apache.org/jira/browse/SPARK-6483) description, ScalaUdf is low performance because of calling *asInstanceOf* to convert per record.
With this, the performance of ScalaUdf is the same as other case.
thank lianhuiwang for telling me how to resolve this problem.
Author: zzcclp <xm_zzc@sina.com>
Closes#5154 from zzcclp/SPARK-6483 and squashes the following commits:
5ac6e09 [zzcclp] Add a newline at the end of source file
cc6868e [zzcclp] Fix for fail on unit test.
0a8cdc3 [zzcclp] indention issue
b73836a [zzcclp] Access Seq[Expression] element by :: operator, and update the code gen script.
7763848 [zzcclp] rebase from master
I think after this PR, we can finally turn the rule on. There are still some smaller ones that need to be fixed, but those are easier.
Author: Reynold Xin <rxin@databricks.com>
Closes#5162 from rxin/catalyst-explicit-types and squashes the following commits:
e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public methods in catalyst.
Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again. However, with eager analysis in `DataFrame`s this can cause errors for queries such as:
```scala
val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count()
```
As a result, in this PR we defer the elimination of subqueries until the optimization phase.
Author: Michael Armbrust <michael@databricks.com>
Closes#5160 from marmbrus/subqueriesInDfs and squashes the following commits:
a9bb262 [Michael Armbrust] Update Optimizer.scala
27d25bf [Michael Armbrust] fix hive tests
9137e03 [Michael Armbrust] add type
81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization
Author: Michael Armbrust <michael@databricks.com>
Closes#5155 from marmbrus/errorMessages and squashes the following commits:
b898188 [Michael Armbrust] Fix formatting of error messages.
Due to a recent change that made `StructType` a `Seq` we started inadvertently turning `StructType`s into generic `Traversable` when attempting nested tree transformations. In this PR we explicitly avoid descending into `DataType`s to avoid this bug.
Author: Michael Armbrust <michael@databricks.com>
Closes#5157 from marmbrus/udfFix and squashes the following commits:
26f7087 [Michael Armbrust] Fix transformations of TreeNodes that hold StructTypes
This is used by ML pipelines to embed ML attributes in columns created by ML transformers/estimators. marmbrus
Author: Xiangrui Meng <meng@databricks.com>
Closes#5151 from mengxr/SPARK-6361 and squashes the following commits:
bb30de3 [Xiangrui Meng] support adding a column with metadata in DF
In `CheckAnalysis`, `Filter` and `Aggregate` are checked in separate case clauses, thus never hit those clauses for unresolved operators and missing input attributes.
This PR also removes the `prettyString` call when generating error message for missing input attributes. Because result of `prettyString` doesn't contain expression ID, and may give confusing messages like
> resolved attributes a missing from a
cc rxin
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5129)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#5129 from liancheng/spark-6452 and squashes the following commits:
52cdc69 [Cheng Lian] Addresses comments
029f9bd [Cheng Lian] Checks for missing attributes and unresolved operator for all types of operator
https://github.com/apache/spark/pull/5082
/cc liancheng
Author: Yadong Qi <qiyadong2010@gmail.com>
Closes#5132 from watermen/sql-missingInput-new and squashes the following commits:
1e5bdc5 [Yadong Qi] Check the missingInput simply
Author: q00251598 <qiyadong@huawei.com>
Closes#5082 from watermen/sql-missingInput and squashes the following commits:
25766b9 [q00251598] Check the missingInput simply
This PR creates a trait `DataTypeParser` used to parse data types. This trait aims to be single place to provide the functionality of parsing data types' string representation. It is currently mixed in with `DDLParser` and `SqlParser`. It is also used to parse the data type for `DataFrame.cast` and to convert Hive metastore's data type string back to a `DataType`.
JIRA: https://issues.apache.org/jira/browse/SPARK-6250
Author: Yin Huai <yhuai@databricks.com>
Closes#5078 from yhuai/ddlKeywords and squashes the following commits:
0e66097 [Yin Huai] Special handle struct<>.
fea6012 [Yin Huai] Style.
c9733fb [Yin Huai] Create a trait to parse data types.
SELECT sum('a'), avg('a'), variance('a'), std('a') FROM src;
Should give output as
0.0 NULL NULL NULL
This fixes hive udaf_number_format.q
Author: Venkata Ramana G <ramana.gollamudihuawei.com>
Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com>
Closes#4466 from gvramana/sum_fix and squashes the following commits:
42e14d1 [Venkata Ramana Gollamudi] Added comments
39415c0 [Venkata Ramana Gollamudi] Handled the partitioned Sum expression scenario
df66515 [Venkata Ramana Gollamudi] code style fix
4be2606 [Venkata Ramana Gollamudi] Add udaf_number_format to whitelist and golden answer
330fd64 [Venkata Ramana Gollamudi] fix sum function for all null data
Because of no statistics override, in spute of super class say 'LeafNode must override'.
fix issue
[SPARK-5320: Joins on simple table created using select gives error](https://issues.apache.org/jira/browse/SPARK-5320)
Author: x1- <viva008@gmail.com>
Closes#5105 from x1-/SPARK-5320 and squashes the following commits:
e561aac [x1-] Add statistics method at NoRelation (override super).
Also implemented equals/hashCode when they are missing.
This is done in order to enable automatic public method type checking.
Author: Reynold Xin <rxin@databricks.com>
Closes#5104 from rxin/sql-hashcode-explicittype and squashes the following commits:
ffce6f3 [Reynold Xin] Code review feedback.
8b36733 [Reynold Xin] [SPARK-6428][SQL] Added explicit type for all public methods.
Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
Author: Sean Owen <sowen@cloudera.com>
Closes#5029 from srowen/SPARK-6338 and squashes the following commits:
27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir
4a212fa [Sean Owen] Standardize a bit more temp dir management
9004081 [Sean Owen] Revert some added recursive-delete calls
57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
We need to handle ambiguous `exprId`s that are produced by new aliases as well as those caused by leaf nodes (`MultiInstanceRelation`).
Attempting to fix this revealed a bug in `equals` for `Alias` as these objects were comparing equal even when the expression ids did not match. Additionally, `LocalRelation` did not correctly provide statistics, and some tests in `catalyst` and `hive` were not using the helper functions for comparing plans.
Based on #4991 by chenghao-intel
Author: Michael Armbrust <michael@databricks.com>
Closes#5062 from marmbrus/selfJoins and squashes the following commits:
8e9b84b [Michael Armbrust] check qualifier too
8038a36 [Michael Armbrust] handle aggs too
0b9c687 [Michael Armbrust] fix more tests
c3c574b [Michael Armbrust] revert change.
725f1ab [Michael Armbrust] add statistics
a925d08 [Michael Armbrust] check for conflicting attributes in join resolution
b022ef7 [Michael Armbrust] Handle project aliases.
d8caa40 [Michael Armbrust] test case: SPARK-6247
f9c67c2 [Michael Armbrust] Check for duplicate attributes in join resolution.
898af73 [Michael Armbrust] Fix Alias equality.
By default, the statistic for logical plan with multiple children is quite aggressive, and those statistic are quite critical for the join optimization, hence we need to estimate the statistics as accurate as possible.
For `Union`, which has 2 children, and overwrite the default implementation by `adding` its children `byteInSize` instead of `multiplying`.
For `Expand`, which only has a single child, but it will grows the size, and we need to multiply its inflating factor.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#4914 from chenghao-intel/statistic and squashes the following commits:
d466bbc [Cheng Hao] Update the default statistic
use prettyString instead of toString() (which include id of expression) as column name in agg()
Author: Davies Liu <davies@databricks.com>
Closes#5006 from davies/prettystring and squashes the following commits:
cb1fdcf [Davies Liu] use prettyString as column name in agg()
Removed an repeated "from" in the comments.
Author: Hongbo Liu <liuhb86@gmail.com>
Closes#4976 from liuhb86/mine and squashes the following commits:
e280e7c [Hongbo Liu] [SQL][Minor] fix typo in comments
The extra blank line is preventing the first lines from showing up in the package summary page.
Author: Reynold Xin <rxin@databricks.com>
Closes#4955 from rxin/datatype-docs and squashes the following commits:
1621114 [Reynold Xin] Minor doc: Remove the extra blank line in data types javadoc.
Based on #4904 with style errors fixed.
`LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain".
So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain".
Author: Wenchen Fan <cloud0fan@outlook.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#4918 from marmbrus/pr/4904 and squashes the following commits:
997f84e [Michael Armbrust] fix style
3eedbfc [Wenchen Fan] fix 6145
In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`.
Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4870 from viirya/codegen_type and squashes the following commits:
76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive.
This PR contains the following changes:
1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values).
2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types.
3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings.
4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust.
5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables.
JIRA: https://issues.apache.org/jira/browse/SPARK-5950
Thanks viirya for the initial work in #4729.
cc marmbrus liancheng
Author: Yin Huai <yhuai@databricks.com>
Closes#4826 from yhuai/insertNullabilityCheck and squashes the following commits:
3b61a04 [Yin Huai] Revert change on equals.
80e487e [Yin Huai] asNullable in UDT.
587d88b [Yin Huai] Make methods private.
0cb7ea2 [Yin Huai] marmbrus's comments.
3cec464 [Yin Huai] Cheng's comments.
486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
d3747d1 [Yin Huai] Remove unnecessary change.
8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck
8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check.
0eb5578 [Yin Huai] Fix tests.
f6ed813 [Yin Huai] Update old parquet path.
e4f397c [Yin Huai] Unit tests.
b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check.
8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data.
bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data.
0a703e7 [Yin Huai] Test failed again since we cannot read correct content.
9a26611 [Yin Huai] Make InsertIntoTable happy.
8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability
4ec17fd [Yin Huai] Failed test.
It should be `true` instead of `false`?
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4762 from viirya/doc_fix and squashes the following commits:
2e37482 [Liang-Chi Hsieh] Fix doc.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#4760 from viirya/dup_literal and squashes the following commits:
06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
Author: Michael Armbrust <michael@databricks.com>
Closes#4736 from marmbrus/asExprs and squashes the following commits:
5ba97e4 [Michael Armbrust] [SPARK-5910][SQL] Support for as in selectExpr
https://issues.apache.org/jira/browse/SPARK-5875 has a case to reproduce the bug and explain the root cause.
Author: Yin Huai <yhuai@databricks.com>
Closes#4663 from yhuai/projectResolved and squashes the following commits:
472f7b6 [Yin Huai] If a logical.Project has any AggregateExpression or Generator, it's resolved field should be false.
Author: Reynold Xin <rxin@databricks.com>
Closes#4640 from rxin/SPARK-5853 and squashes the following commits:
9c6f569 [Reynold Xin] [SPARK-5853][SQL] Schema support in Row.
Existing implementation of arithmetic operators and BinaryComparison operators have redundant type checking codes, e.g.:
Expression.n2 is used by Add/Subtract/Multiply.
(1) n2 always checks left.dataType == right.dataType. However, this checking should be done once when we resolve expression types;
(2) n2 requires dataType is a NumericType. This can be done once.
This PR optimizes arithmetic and predicate operators by removing such redundant type-checking codes.
Some preliminary benchmarking on 10G TPC-H data over 5 r3.2xlarge EC2 machines shows that this PR can reduce the query time by 5.5% to 11%.
The benchmark queries follow the template below, where OP is plus/minus/times/divide/remainder/bitwise and/bitwise or/bitwise xor.
SELECT l_returnflag, l_linestatus, SUM(l_quantity OP cnt1), SUM(l_quantity OP cnt2), ...., SUM(l_quantity OP cnt700)
FROM (
SELECT l_returnflag, l_linestatus, l_quantity, 1 AS cnt1, 2 AS cnt2, ..., 700 AS cnt700
FROM lineitem
WHERE l_shipdate <= '1998-09-01'
)
GROUP BY l_returnflag, l_linestatus;
Author: kai <kaizeng@eecs.berkeley.edu>
Closes#4472 from kai-zeng/arithmetic-optimize and squashes the following commits:
fef0cf1 [kai] Merge branch 'master' of github.com:apache/spark into arithmetic-optimize
4b3a1bb [kai] chmod a-x
5a41e49 [kai] chmod a-x Expression.scala
cb37c94 [kai] rebase onto spark master
7f6e968 [kai] chmod 100755 -> 100644
6cddb46 [kai] format
7490dbc [kai] fix unresolved-expression exception for EqualTo
9c40bc0 [kai] fix bitwisenot
3cbd363 [kai] clean up test code
ca47801 [kai] override evalInternal for bitwise ops
8fa84a1 [kai] add bitwise or and xor
6892fc4 [kai] revert override evalInternal
f8eba24 [kai] override evalInternal
31ccdd4 [kai] rewrite all bitwise op and remove evalInternal
86297e2 [kai] generalized
cb92ae1 [kai] bitwise-and: override eval
97a7d6c [kai] bitwise-and: override evalInternal using and func
0906c39 [kai] add bitwise test
62abbbc [kai] clean up predicate and arithmetic
b34d58d [kai] add caching and benmark option
12c5b32 [kai] override eval
1cd7571 [kai] fix sqrt and maxof
03fd0c3 [kai] fix predicate
16fd84c [kai] optimize + - * / % -(unary) abs < > <= >=
fd95823 [kai] remove unnecessary type checking
24d062f [kai] test suite
JIRA: https://issues.apache.org/jira/browse/SPARK-5746
liancheng marmbrus
Author: Yin Huai <yhuai@databricks.com>
Closes#4617 from yhuai/insertOverwrite and squashes the following commits:
8e3019d [Yin Huai] Fix compilation error.
499e8e7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
e76e85a [Yin Huai] Address comments.
ac31b3c [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
f30bdad [Yin Huai] Use toDF.
99da57e [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertOverwrite
6b7545c [Yin Huai] Add a pre write check to the data source API.
a88c516 [Yin Huai] DDLParser will take a parsering function to take care CTAS statements.
Lifts `HiveMetastoreCatalog.refreshTable` to `Catalog`. Adds `RefreshTable` command to refresh (possibly cached) metadata in external data sources tables.
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4624)
<!-- Reviewable:end -->
Author: Cheng Lian <lian@databricks.com>
Closes#4624 from liancheng/refresh-table and squashes the following commits:
8d1aa4c [Cheng Lian] Adds REFRESH TABLE command