Author: Michael Armbrust <michael@databricks.com>
Closes#6363 from marmbrus/windowErrors and squashes the following commits:
516b02d [Michael Armbrust] [SPARK-7834] [SQL] Better window error messages
(cherry picked from commit 3c1305107a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Santiago M. Mola <santi@mola.io>
Closes#6327 from smola/feature/catalyst-dsl-set-ops and squashes the following commits:
11db778 [Santiago M. Mola] [SPARK-7724] [SQL] Support Intersect/Except in Catalyst DSL.
(cherry picked from commit e4aef91fe7)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Michael Armbrust <michael@databricks.com>
Closes#6165 from marmbrus/wrongColumn and squashes the following commits:
4fad158 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into wrongColumn
aad7eab [Michael Armbrust] rxins comments
f1e8df1 [Michael Armbrust] [SPARK-6743][SQL] Fix empty projections of cached data
(cherry picked from commit 3b68cb0430)
Signed-off-by: Michael Armbrust <michael@databricks.com>
follow up for #5806
Author: scwf <wangfei1@huawei.com>
Closes#6164 from scwf/FunctionRegistry and squashes the following commits:
15e6697 [scwf] use catalogconf in FunctionRegistry
(cherry picked from commit 60336e3bc0)
Signed-off-by: Michael Armbrust <michael@databricks.com>
```
select explode(map(value, key)) from src;
```
Throws exception
```
org.apache.spark.sql.AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns output by the UDTF expected 2 aliases but got _c0 ;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:38)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:43)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveGenerate$$makeGeneratorOutput(Analyzer.scala:605)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:562)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16$$anonfun$22.apply(Analyzer.scala:548)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:548)
at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate$$anonfun$apply$16.applyOrElse(Analyzer.scala:538)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:222)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes#6178 from chenghao-intel/explode and squashes the following commits:
916fbe9 [Cheng Hao] add more strict rules for TGF alias
5c3f2c5 [Cheng Hao] fix bug in unit test
e1d93ab [Cheng Hao] Add more unit test
19db09e [Cheng Hao] resolve names for generator in projection
(cherry picked from commit bcb1ff8146)
Signed-off-by: Michael Armbrust <michael@databricks.com>
A modified version of https://github.com/apache/spark/pull/6110, use `semanticEquals` to make it more efficient.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6173 from cloud-fan/7269 and squashes the following commits:
e4a3cc7 [Wenchen Fan] address comments
cc02045 [Wenchen Fan] consider elements length equal
d7ff8f4 [Wenchen Fan] fix 7269
(cherry picked from commit 103c863c2e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
spark-sql>
> explain extended
> select * from (
> select key from src union all
> select key from src) t;
now the spark plan will print children in argString
```
== Physical Plan ==
Union[ HiveTableScan key#1, (MetastoreRelation default, src, None), None,
HiveTableScan key#3, (MetastoreRelation default, src, None), None]
HiveTableScan key#1, (MetastoreRelation default, src, None), None
HiveTableScan key#3, (MetastoreRelation default, src, None), None
```
after this patch:
```
== Physical Plan ==
Union
HiveTableScan [key#1], (MetastoreRelation default, src, None), None
HiveTableScan [key#3], (MetastoreRelation default, src, None), None
```
I have tested this locally
Author: scwf <wangfei1@huawei.com>
Closes#6144 from scwf/fix-argString and squashes the following commits:
1a642e0 [scwf] fix treenode argString
(cherry picked from commit fc2480ed13)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6235 from cloud-fan/tmp and squashes the following commits:
8f16367 [Wenchen Fan] use private[this]
(cherry picked from commit 56ede88485)
Signed-off-by: Michael Armbrust <michael@databricks.com>
It's a follow-up of https://github.com/apache/spark/pull/5154, we can speed up scala udf evaluation by create type converter in advance.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6182 from cloud-fan/tmp and squashes the following commits:
241cfe9 [Wenchen Fan] use converter in ScalaUdf
(cherry picked from commit 2f22424e9f)
Signed-off-by: Yin Huai <yhuai@databricks.com>
JIRA: https://issues.apache.org/jira/browse/SPARK-7098
The WHERE clause with timstamp shows inconsistent results. This pr fixes it.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#5682 from viirya/consistent_timestamp and squashes the following commits:
171445a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into consistent_timestamp
4e98520 [Liang-Chi Hsieh] Make the WHERE clause with timestamp show consistent result.
(cherry picked from commit f9705d4613)
Signed-off-by: Reynold Xin <rxin@databricks.com>
Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions. There are currently the following restrictions:
- only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`)
- only one may be present in a single select to avoid potentially confusing implicit Cartesian products.
TODO:
- [ ] Python
Author: Michael Armbrust <michael@databricks.com>
Closes#6107 from marmbrus/explodeFunction and squashes the following commits:
7ee2c87 [Michael Armbrust] whitespace
6f80ba3 [Michael Armbrust] Update dataframe.py
c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
81b5da3 [Michael Armbrust] style
d3faa05 [Michael Armbrust] fix self join case
f9e1e3e [Michael Armbrust] fix python, add since
4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
e710fe4 [Michael Armbrust] add java and python
52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
(cherry picked from commit 6d0633e3ec)
Signed-off-by: Michael Armbrust <michael@databricks.com>
A follow-up of https://github.com/apache/spark/pull/5624
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6142 from cloud-fan/tmp and squashes the following commits:
971a92b [Wenchen Fan] use plan instead of execute
24c5ffe [Wenchen Fan] rename apply
(cherry picked from commit f2cd00be35)
Signed-off-by: Reynold Xin <rxin@databricks.com>
for example:
table: src(key string, value string)
sql: with v1 as(select key, count(value) over (partition by key) cnt_val from src), v2 as(select v1.key, v1_lag.cnt_val from v1, v1 v1_lag where v1.key = v1_lag.key) select * from v2 limit 5;
then will analyze fail when resolving conflicting references in Join:
'Limit 5
'Project [*]
'Subquery v2
'Project ['v1.key,'v1_lag.cnt_val]
'Filter ('v1.key = 'v1_lag.key)
'Join Inner, None
Subquery v1
Project [key#95,cnt_val#94L]
Window [key#95,value#96], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#96) WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#95], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Project [key#95,value#96]
MetastoreRelation default, src, None
Subquery v1_lag
Subquery v1
Project [key#97,cnt_val#94L]
Window [key#97,value#98], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCount(value#98) WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS cnt_val#94L], WindowSpecDefinition [key#97], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
Project [key#97,value#98]
MetastoreRelation default, src, None
Conflicting attributes: cnt_val#94L
Author: linweizhong <linweizhong@huawei.com>
Closes#6114 from Sephiroth-Lin/spark-7595 and squashes the following commits:
f8f2637 [linweizhong] Add unit test
dfe9169 [linweizhong] Handle windowExpression with self join
(cherry picked from commit 13e652b61a)
Signed-off-by: Michael Armbrust <michael@databricks.com>
JavaTypeInference into catalyst
types.DateUtils into catalyst
CacheManager into execution
DefaultParserDialect into catalyst
Author: Reynold Xin <rxin@databricks.com>
Closes#6108 from rxin/sql-rename and squashes the following commits:
3fc9613 [Reynold Xin] Fixed import ordering.
83d9ff4 [Reynold Xin] Fixed codegen tests.
e271e86 [Reynold Xin] mima
f4e24a6 [Reynold Xin] [SQL] Move some classes into packages that are more appropriate.
(cherry picked from commit e683182c3e)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Optimize the case of `project(_, sort)` , a example is:
`select key from (select * from testData order by key) t`
before this PR:
```
== Parsed Logical Plan ==
'Project ['key]
'Subquery t
'Sort ['key ASC], true
'Project [*]
'UnresolvedRelation [testData], None
== Analyzed Logical Plan ==
Project [key#0]
Subquery t
Sort [key#0 ASC], true
Project [key#0,value#1]
Subquery testData
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Optimized Logical Plan ==
Project [key#0]
Sort [key#0 ASC], true
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Physical Plan ==
Project [key#0]
Sort [key#0 ASC], true
Exchange (RangePartitioning [key#0 ASC], 5), []
PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
```
after this PR
```
== Parsed Logical Plan ==
'Project ['key]
'Subquery t
'Sort ['key ASC], true
'Project [*]
'UnresolvedRelation [testData], None
== Analyzed Logical Plan ==
Project [key#0]
Subquery t
Sort [key#0 ASC], true
Project [key#0,value#1]
Subquery testData
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Optimized Logical Plan ==
Sort [key#0 ASC], true
Project [key#0]
LogicalRDD [key#0,value#1], MapPartitionsRDD[1]
== Physical Plan ==
Sort [key#0 ASC], true
Exchange (RangePartitioning [key#0 ASC], 5), []
Project [key#0]
PhysicalRDD [key#0,value#1], MapPartitionsRDD[1]
```
with this rule we will first do column pruning on the table and then do sorting.
Author: scwf <wangfei1@huawei.com>
This patch had conflicts when merged, resolved by
Committer: Michael Armbrust <michael@databricks.com>
Closes#5838 from scwf/pruning and squashes the following commits:
b00d833 [scwf] address michael's comment
e230155 [scwf] fix tests failure
b09b895 [scwf] improve column pruning
(cherry picked from commit 59250fe514)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Some third-party UDTF extensions generate additional rows in the "GenericUDTF.close()" method, which is supported / documented by Hive.
https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide+UDTF
However, Spark SQL ignores the "GenericUDTF.close()", and it causes bug while porting job from Hive to Spark SQL.
Author: Cheng Hao <hao.cheng@intel.com>
Closes#5383 from chenghao-intel/udtf_close and squashes the following commits:
98b4e4b [Cheng Hao] Support UDTF.close
(cherry picked from commit 0da254fb29)
Signed-off-by: Cheng Lian <lian@databricks.com>
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5831 from cloud-fan/7276 and squashes the following commits:
ee4a1e1 [Wenchen Fan] fix rebase mistake
a3b565d [Wenchen Fan] refactor
99deb5d [Wenchen Fan] add test
f1f67ad [Wenchen Fan] fix 7276
(cherry picked from commit 4e290522c2)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#6079 from cloud-fan/unapply and squashes the following commits:
40da442 [Wenchen Fan] one more
7d90a05 [Wenchen Fan] cleanup unapply in DataTypes
(cherry picked from commit 831504cf6b)
Signed-off-by: Reynold Xin <rxin@databricks.com>
As a follow-up to https://github.com/apache/spark/pull/5944
Author: Reynold Xin <rxin@databricks.com>
Closes#6064 from rxin/jointype-better-error and squashes the following commits:
7629bf7 [Reynold Xin] [SQL] Show better error messages for incorrect join types in DataFrames.
(cherry picked from commit 4f4dbb030c)
Signed-off-by: Reynold Xin <rxin@databricks.com>
It's the first step: generalize UnresolvedGetField to support all map, struct, and array
TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes#5744 from cloud-fan/generalize and squashes the following commits:
715c589 [Wenchen Fan] address comments
7ea5b31 [Wenchen Fan] fix python test
4f0833a [Wenchen Fan] add python test
f515d69 [Wenchen Fan] add apply method and test cases
8df6199 [Wenchen Fan] fix python test
239730c [Wenchen Fan] fix test compile
2a70526 [Wenchen Fan] use _bin_op in dataframe.py
6bf72bc [Wenchen Fan] address comments
3f880c3 [Wenchen Fan] add java doc
ab35ab5 [Wenchen Fan] fix python test
b5961a9 [Wenchen Fan] fix style
c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
(cherry picked from commit 2d05f325dc)
Signed-off-by: Michael Armbrust <michael@databricks.com>