Commit graph

24733 commits

Author SHA1 Message Date
Arun Pandian a0a58cf2ef [SPARK-28464][DOC][SS] Document Kafka source minPartitions option
Adding doc for the kafka source minPartitions option to "Structured Streaming + Kafka Integration Guide"

The text is based on the content in  https://docs.databricks.com/spark/latest/structured-streaming/kafka.html#configuration

Closes #25219 from arunpandianp/SPARK-28464.

Authored-by: Arun Pandian <apandian@groupon.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-21 13:13:30 -07:00
Takeshi Yamamuro 6e65d39576 [SPARK-28189][SQL][FOLLOW-UP] Remove the unnecessary test in DataFrameSuite
## What changes were proposed in this pull request?
This pr is to remove the unnecessary test  in DataFrameSuite.

## How was this patch tested?
N/A

Closes #25216 from maropu/SPARK-28189-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-21 00:07:35 -07:00
Huaxin Gao 72c80ee81c [SPARK-28243][PYSPARK][ML] Remove setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams
## What changes were proposed in this pull request?

Remove deprecated setFeatureSubsetStrategy and setSubsamplingRate from Python TreeEnsembleParams

## How was this patch tested?

Use existing tests.

Closes #25046 from huaxingao/spark-28243.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-20 10:44:33 -05:00
Xingbo Jiang 36d7d81d23 [SPARK-27815][SQL][FOLLOWUP][DOC] Update comment that references PushDownPredicate
## What changes were proposed in this pull request?

The optimize rule `PushDownPredicate` has been combined into `PushDownPredicates`, update the comment that references the old rule.

## How was this patch tested?

N/A

Closes #25207 from jiangxb1987/comment.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-20 16:44:28 +09:00
Terry Kim 771616eac9 [SPARK-28282][SQL][PYTHON][TESTS] Convert and port 'inline-table.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from `inline-table.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

<details><summary>Diff comparing to 'inline-table.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/inline-table.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-inline-table.sql.out
index 4e80f0bda5..2cf24e50c8 100644
--- a/sql/core/src/test/resources/sql-tests/results/inline-table.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-inline-table.sql.out
 -3,33 +3,33

 -- !query 0
-select * from values ("one", 1)
+select udf(col1), udf(col2) from values ("one", 1)
 -- !query 0 schema
-struct<col1:string,col2:int>
+struct<CAST(udf(cast(col1 as string)) AS STRING):string,CAST(udf(cast(col2 as string)) AS INT):int>
 -- !query 0 output
 one	1

 -- !query 1
-select * from values ("one", 1) as data
+select udf(col1), udf(udf(col2)) from values ("one", 1) as data
 -- !query 1 schema
-struct<col1:string,col2:int>
+struct<CAST(udf(cast(col1 as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(col2 as string)) as int) as string)) AS INT):int>
 -- !query 1 output
 one	1

 -- !query 2
-select * from values ("one", 1) as data(a, b)
+select udf(a), b from values ("one", 1) as data(a, b)
 -- !query 2 schema
-struct<a:string,b:int>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:int>
 -- !query 2 output
 one	1

 -- !query 3
-select * from values 1, 2, 3 as data(a)
+select udf(a) from values 1, 2, 3 as data(a)
 -- !query 3 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
 -- !query 3 output
 1
 2
 -37,9 +37,9  struct<a:int>

 -- !query 4
-select * from values ("one", 1), ("two", 2), ("three", null) as data(a, b)
+select udf(a), b from values ("one", 1), ("two", 2), ("three", null) as data(a, b)
 -- !query 4 schema
-struct<a:string,b:int>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:int>
 -- !query 4 output
 one	1
 three	NULL
 -47,107 +47,107  two	2

 -- !query 5
-select * from values ("one", null), ("two", null) as data(a, b)
+select udf(a), b from values ("one", null), ("two", null) as data(a, b)
 -- !query 5 schema
-struct<a:string,b:null>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:null>
 -- !query 5 output
 one	NULL
 two	NULL

 -- !query 6
-select * from values ("one", 1), ("two", 2L) as data(a, b)
+select udf(a), b from values ("one", 1), ("two", 2L) as data(a, b)
 -- !query 6 schema
-struct<a:string,b:bigint>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:bigint>
 -- !query 6 output
 one	1
 two	2

 -- !query 7
-select * from values ("one", 1 + 0), ("two", 1 + 3L) as data(a, b)
+select udf(udf(a)), udf(b) from values ("one", 1 + 0), ("two", 1 + 3L) as data(a, b)
 -- !query 7 schema
-struct<a:string,b:bigint>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as string) as string)) AS STRING):string,CAST(udf(cast(b as string)) AS BIGINT):bigint>
 -- !query 7 output
 one	1
 two	4

 -- !query 8
-select * from values ("one", array(0, 1)), ("two", array(2, 3)) as data(a, b)
+select udf(a), b from values ("one", array(0, 1)), ("two", array(2, 3)) as data(a, b)
 -- !query 8 schema
-struct<a:string,b:array<int>>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:array<int>>
 -- !query 8 output
 one	[0,1]
 two	[2,3]

 -- !query 9
-select * from values ("one", 2.0), ("two", 3.0D) as data(a, b)
+select udf(a), b from values ("one", 2.0), ("two", 3.0D) as data(a, b)
 -- !query 9 schema
-struct<a:string,b:double>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:double>
 -- !query 9 output
 one	2.0
 two	3.0

 -- !query 10
-select * from values ("one", rand(5)), ("two", 3.0D) as data(a, b)
+select udf(a), b from values ("one", rand(5)), ("two", 3.0D) as data(a, b)
 -- !query 10 schema
 struct<>
 -- !query 10 output
 org.apache.spark.sql.AnalysisException
-cannot evaluate expression rand(5) in inline table definition; line 1 pos 29
+cannot evaluate expression rand(5) in inline table definition; line 1 pos 37

 -- !query 11
-select * from values ("one", 2.0), ("two") as data(a, b)
+select udf(a), udf(b) from values ("one", 2.0), ("two") as data(a, b)
 -- !query 11 schema
 struct<>
 -- !query 11 output
 org.apache.spark.sql.AnalysisException
-expected 2 columns but found 1 columns in row 1; line 1 pos 14
+expected 2 columns but found 1 columns in row 1; line 1 pos 27

 -- !query 12
-select * from values ("one", array(0, 1)), ("two", struct(1, 2)) as data(a, b)
+select udf(a), udf(b) from values ("one", array(0, 1)), ("two", struct(1, 2)) as data(a, b)
 -- !query 12 schema
 struct<>
 -- !query 12 output
 org.apache.spark.sql.AnalysisException
-incompatible types found in column b for inline table; line 1 pos 14
+incompatible types found in column b for inline table; line 1 pos 27

 -- !query 13
-select * from values ("one"), ("two") as data(a, b)
+select udf(a), udf(b) from values ("one"), ("two") as data(a, b)
 -- !query 13 schema
 struct<>
 -- !query 13 output
 org.apache.spark.sql.AnalysisException
-expected 2 columns but found 1 columns in row 0; line 1 pos 14
+expected 2 columns but found 1 columns in row 0; line 1 pos 27

 -- !query 14
-select * from values ("one", random_not_exist_func(1)), ("two", 2) as data(a, b)
+select udf(a), udf(b) from values ("one", random_not_exist_func(1)), ("two", 2) as data(a, b)
 -- !query 14 schema
 struct<>
 -- !query 14 output
 org.apache.spark.sql.AnalysisException
-Undefined function: 'random_not_exist_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 29
+Undefined function: 'random_not_exist_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 42

 -- !query 15
-select * from values ("one", count(1)), ("two", 2) as data(a, b)
+select udf(a), udf(b) from values ("one", count(1)), ("two", 2) as data(a, b)
 -- !query 15 schema
 struct<>
 -- !query 15 output
 org.apache.spark.sql.AnalysisException
-cannot evaluate expression count(1) in inline table definition; line 1 pos 29
+cannot evaluate expression count(1) in inline table definition; line 1 pos 42

 -- !query 16
-select * from values (timestamp('1991-12-06 00:00:00.0'), array(timestamp('1991-12-06 01:00:00.0'), timestamp('1991-12-06 12:00:00.0'))) as data(a, b)
+select udf(a), b from values (timestamp('1991-12-06 00:00:00.0'), array(timestamp('1991-12-06 01:00:00.0'), timestamp('1991-12-06 12:00:00.0'))) as data(a, b)
 -- !query 16 schema
-struct<a:timestamp,b:array<timestamp>>
+struct<CAST(udf(cast(a as string)) AS TIMESTAMP):timestamp,b:array<timestamp>>
 -- !query 16 output
 1991-12-06 00:00:00	[1991-12-06 01:00:00.0,1991-12-06 12:00:00.0]

```
</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25124 from imback82/inline-table-sql.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-20 15:21:28 +09:00
Stavros Kontopoulos 9e5e511ca0 [SPARK-28279][SQL][PYTHON][TESTS] Convert and port 'group-analytics.sql' into UDF test base
## What changes were proposed in this pull request?
This PR adds some tests converted from group-analytics.sql to test UDFs. Please see contribution guide of this umbrella ticket - SPARK-27921.

<details><summary>Diff comparing to 'group-analytics.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
index 31e9e08e2c..3439a05727 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
 -13,9 +13,9  struct<>

 -- !query 1
-SELECT a + b, b, udf(SUM(a - b)) FROM testData GROUP BY a + b, b WITH CUBE
+SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH CUBE
 -- !query 1 schema
-struct<(a + b):int,b:int,CAST(udf(cast(sum(cast((a - b) as bigint)) as string)) AS BIGINT):bigint>
+struct<(a + b):int,b:int,sum((a - b)):bigint>
 -- !query 1 output
 2	1	0
 2	NULL	0
 -33,9 +33,9  NULL	NULL	3

 -- !query 2
-SELECT a, udf(b), SUM(b) FROM testData GROUP BY a, b WITH CUBE
+SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH CUBE
 -- !query 2 schema
-struct<a:int,CAST(udf(cast(b as string)) AS INT):int,sum(b):bigint>
+struct<a:int,b:int,sum(b):bigint>
 -- !query 2 output
 1	1	1
 1	2	2
 -52,9 +52,9  NULL	NULL	9

 -- !query 3
-SELECT udf(a + b), b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP
+SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP
 -- !query 3 schema
-struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,sum((a - b)):bigint>
+struct<(a + b):int,b:int,sum((a - b)):bigint>
 -- !query 3 output
 2	1	0
 2	NULL	0
 -70,9 +70,9  NULL	NULL	3

 -- !query 4
-SELECT a, b, udf(SUM(b)) FROM testData GROUP BY a, b WITH ROLLUP
+SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH ROLLUP
 -- !query 4 schema
-struct<a:int,b:int,CAST(udf(cast(sum(cast(b as bigint)) as string)) AS BIGINT):bigint>
+struct<a:int,b:int,sum(b):bigint>
 -- !query 4 output
 1	1	1
 1	2	2
 -97,7 +97,7  struct<>

 -- !query 6
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY udf(course), year
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY course, year
 -- !query 6 schema
 struct<course:string,year:int,sum(earnings):bigint>
 -- !query 6 output
 -111,7 +111,7  dotNET	2013	48000

 -- !query 7
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, udf(year)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, year
 -- !query 7 schema
 struct<course:string,year:int,sum(earnings):bigint>
 -- !query 7 output
 -127,9 +127,9  dotNET	2013	48000

 -- !query 8
-SELECT course, udf(year), SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year)
 -- !query 8 schema
-struct<course:string,CAST(udf(cast(year as string)) AS INT):int,sum(earnings):bigint>
+struct<course:string,year:int,sum(earnings):bigint>
 -- !query 8 output
 Java	NULL	50000
 NULL	2012	35000
 -138,26 +138,26  dotNET	NULL	63000

 -- !query 9
-SELECT course, year, udf(SUM(earnings)) FROM courseSales GROUP BY course, year GROUPING SETS(course)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course)
 -- !query 9 schema
-struct<course:string,year:int,CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint>
+struct<course:string,year:int,sum(earnings):bigint>
 -- !query 9 output
 Java	NULL	50000
 dotNET	NULL	63000

 -- !query 10
-SELECT udf(course), year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year)
 -- !query 10 schema
-struct<CAST(udf(cast(course as string)) AS STRING):string,year:int,sum(earnings):bigint>
+struct<course:string,year:int,sum(earnings):bigint>
 -- !query 10 output
 NULL	2012	35000
 NULL	2013	78000

 -- !query 11
-SELECT course, udf(SUM(earnings)) AS sum FROM courseSales
-GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, udf(sum)
+SELECT course, SUM(earnings) AS sum FROM courseSales
+GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum
 -- !query 11 schema
 struct<course:string,sum:bigint>
 -- !query 11 output
 -173,7 +173,7  dotNET	63000

 -- !query 12
 SELECT course, SUM(earnings) AS sum, GROUPING_ID(course, earnings) FROM courseSales
-GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY udf(course), sum
+GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum
 -- !query 12 schema
 struct<course:string,sum:bigint,grouping_id(course, earnings):int>
 -- !query 12 output
 -188,10 +188,10  dotNET	63000	1

 -- !query 13
-SELECT udf(course), udf(year), GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales
+SELECT course, year, GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales
 GROUP BY CUBE(course, year)
 -- !query 13 schema
-struct<CAST(udf(cast(course as string)) AS STRING):string,CAST(udf(cast(year as string)) AS INT):int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int>
+struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int>
 -- !query 13 output
 Java	2012	0	0	0
 Java	2013	0	0	0
 -205,7 +205,7  dotNET	NULL	0	1	1

 -- !query 14
-SELECT course, udf(year), GROUPING(course) FROM courseSales GROUP BY course, year
+SELECT course, year, GROUPING(course) FROM courseSales GROUP BY course, year
 -- !query 14 schema
 struct<>
 -- !query 14 output
 -214,7 +214,7  grouping() can only be used with GroupingSets/Cube/Rollup;

 -- !query 15
-SELECT course, udf(year), GROUPING_ID(course, year) FROM courseSales GROUP BY course, year
+SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY course, year
 -- !query 15 schema
 struct<>
 -- !query 15 output
 -223,7 +223,7  grouping_id() can only be used with GroupingSets/Cube/Rollup;

 -- !query 16
-SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, udf(year)
+SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year
 -- !query 16 schema
 struct<course:string,year:int,grouping__id:int>
 -- !query 16 output
 -240,7 +240,7  NULL	NULL	3

 -- !query 17
 SELECT course, year FROM courseSales GROUP BY CUBE(course, year)
-HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, udf(year)
+HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, year
 -- !query 17 schema
 struct<course:string,year:int>
 -- !query 17 output
 -250,7 +250,7  dotNET	NULL

 -- !query 18
-SELECT course, udf(year) FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0
+SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0
 -- !query 18 schema
 struct<>
 -- !query 18 output
 -259,7 +259,7  grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;

 -- !query 19
-SELECT course, udf(udf(year)) FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0
+SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0
 -- !query 19 schema
 struct<>
 -- !query 19 output
 -268,9 +268,9  grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;

 -- !query 20
-SELECT udf(course), year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0
+SELECT course, year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0
 -- !query 20 schema
-struct<CAST(udf(cast(course as string)) AS STRING):string,year:int>
+struct<course:string,year:int>
 -- !query 20 output
 Java	NULL
 NULL	2012
 -281,7 +281,7  dotNET	NULL

 -- !query 21
 SELECT course, year, GROUPING(course), GROUPING(year) FROM courseSales GROUP BY CUBE(course, year)
-ORDER BY GROUPING(course), GROUPING(year), course, udf(year)
+ORDER BY GROUPING(course), GROUPING(year), course, year
 -- !query 21 schema
 struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint>
 -- !query 21 output
 -298,7 +298,7  NULL	NULL	1	1

 -- !query 22
 SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year)
-ORDER BY GROUPING(course), GROUPING(year), course, udf(year)
+ORDER BY GROUPING(course), GROUPING(year), course, year
 -- !query 22 schema
 struct<course:string,year:int,grouping_id(course, year):int>
 -- !query 22 output
 -314,7 +314,7  NULL	NULL	3

 -- !query 23
-SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING(course)
+SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING(course)
 -- !query 23 schema
 struct<>
 -- !query 23 output
 -323,7 +323,7  grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;

 -- !query 24
-SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING_ID(course)
+SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING_ID(course)
 -- !query 24 schema
 struct<>
 -- !query 24 output
 -332,7 +332,7  grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;

 -- !query 25
-SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, udf(course), year
+SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year
 -- !query 25 schema
 struct<course:string,year:int>
 -- !query 25 output
 -348,7 +348,7  NULL	NULL

 -- !query 26
-SELECT udf(a + b) AS k1, udf(b) AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2)
+SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2)
 -- !query 26 schema
 struct<k1:int,k2:int,sum((a - b)):bigint>
 -- !query 26 output
 -368,7 +368,7  NULL	NULL	3

 -- !query 27
-SELECT udf(udf(a + b)) AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b)
+SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b)
 -- !query 27 schema
 struct<k:int,b:int,sum((a - b)):bigint>
 -- !query 27 output
 -386,9 +386,9  NULL	NULL	3

 -- !query 28
-SELECT udf(a + b), udf(udf(b)) AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k)
+SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k)
 -- !query 28 schema
-struct<CAST(udf(cast((a + b) as string)) AS INT):int,k:int,sum((a - b)):bigint>
+struct<(a + b):int,k:int,sum((a - b)):bigint>
 -- !query 28 output
 NULL	1	3
 NULL	2	0

```

</p>
</details>

## How was this patch tested?

Tested as guided in SPARK-27921.
Verified pandas & pyarrow versions:
```$python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import pyarrow
>>> pyarrow.__version__
'0.14.0'
>>> pandas.__version__
'0.24.2'
```
From the sql output it seems that sql statements are evaluated correctly given that udf returns a string and may change results as Null will be returned as None and will be counted in returned values.

Closes #25196 from skonto/group-analytics.sql.

Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-20 15:19:57 +09:00
huangtianhua aeec6a7b28 [SPARK-28433][SQL][TEST] Remove hardware-dependent 0.0/0.0 and NaN comparison assertions
## What changes were proposed in this pull request?

This PR removes a few hardware-dependent assertions which can cause a failure in `aarch64`.

**x86_64**
```
rootdonotdel-openlab-allinone-l00242678:/home/ubuntu# uname -a
Linux donotdel-openlab-allinone-l00242678 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux

scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res0: Int = -4194304
scala> floatToRawIntBits(Float.NaN)
res1: Int = 2143289344
```

**aarch64**
```
[rootarm-huangtianhua spark]# uname -a
Linux arm-huangtianhua 4.14.0-49.el7a.aarch64 #1 SMP Tue Apr 10 17:22:26 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux

scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res1: Int = 2143289344
scala> floatToRawIntBits(Float.NaN)
res2: Int = 2143289344
```

## How was this patch tested?

Pass the Jenkins (This removes the test coverage).

Closes #25186 from huangtianhua/special-test-case-for-aarch64.

Authored-by: huangtianhua <huangtianhua@huawei.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-19 16:02:12 -07:00
Jungtaek Lim (HeartSaVioR) 4196d7bd34 [SPARK-28199][SS][FOLLOWUP] Remove unnecessary annotations for private API
## What changes were proposed in this pull request?

SPARK-28199 (#24996) hid implementations of Triggers into `private[sql]` and encourage end users to use `Trigger.xxx` methods instead.

As I got some post review comment on 7548a8826d (r34366934) we could remove annotations which are meant to be used with public API.

## How was this patch tested?

N/A

Closes #25200 from HeartSaVioR/SPARK-28199-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-19 08:26:42 -07:00
Terry Kim 453cbf3dd8 [SPARK-28284][SQL][PYTHON][TESTS] Convert and port 'join-empty-relation.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from `join-empty-relation.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

<details><summary>Diff comparing to 'join-empty-relation.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/join-empty-relation.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-join-empty-relation.sql.out
index 857073a827..e79d01fb14 100644
--- a/sql/core/src/test/resources/sql-tests/results/join-empty-relation.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-join-empty-relation.sql.out
 -27,111 +27,111  struct<>

 -- !query 3
-SELECT * FROM t1 INNER JOIN empty_table
+SELECT udf(t1.a), udf(empty_table.a) FROM t1 INNER JOIN empty_table ON (udf(t1.a) = udf(udf(empty_table.a)))
 -- !query 3 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
 -- !query 3 output

 -- !query 4
-SELECT * FROM t1 CROSS JOIN empty_table
+SELECT udf(t1.a), udf(udf(empty_table.a)) FROM t1 CROSS JOIN empty_table ON (udf(udf(t1.a)) = udf(empty_table.a))
 -- !query 4 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 4 output

 -- !query 5
-SELECT * FROM t1 LEFT OUTER JOIN empty_table
+SELECT udf(udf(t1.a)), empty_table.a FROM t1 LEFT OUTER JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
 -- !query 5 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,a:int>
 -- !query 5 output
 1	NULL

 -- !query 6
-SELECT * FROM t1 RIGHT OUTER JOIN empty_table
+SELECT udf(t1.a), udf(empty_table.a) FROM t1 RIGHT OUTER JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
 -- !query 6 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
 -- !query 6 output

 -- !query 7
-SELECT * FROM t1 FULL OUTER JOIN empty_table
+SELECT udf(t1.a), empty_table.a FROM t1 FULL OUTER JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
 -- !query 7 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,a:int>
 -- !query 7 output
 1	NULL

 -- !query 8
-SELECT * FROM t1 LEFT SEMI JOIN empty_table
+SELECT udf(udf(t1.a)) FROM t1 LEFT SEMI JOIN empty_table ON (udf(t1.a) = udf(udf(empty_table.a)))
 -- !query 8 schema
-struct<a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 8 output

 -- !query 9
-SELECT * FROM t1 LEFT ANTI JOIN empty_table
+SELECT udf(t1.a) FROM t1 LEFT ANTI JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
 -- !query 9 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
 -- !query 9 output
 1

 -- !query 10
-SELECT * FROM empty_table INNER JOIN t1
+SELECT udf(empty_table.a), udf(t1.a) FROM empty_table INNER JOIN t1 ON (udf(udf(empty_table.a)) = udf(t1.a))
 -- !query 10 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
 -- !query 10 output

 -- !query 11
-SELECT * FROM empty_table CROSS JOIN t1
+SELECT udf(empty_table.a), udf(udf(t1.a)) FROM empty_table CROSS JOIN t1 ON (udf(empty_table.a) = udf(udf(t1.a)))
 -- !query 11 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 11 output

 -- !query 12
-SELECT * FROM empty_table LEFT OUTER JOIN t1
+SELECT udf(udf(empty_table.a)), udf(t1.a) FROM empty_table LEFT OUTER JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
 -- !query 12 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
 -- !query 12 output

 -- !query 13
-SELECT * FROM empty_table RIGHT OUTER JOIN t1
+SELECT empty_table.a, udf(t1.a) FROM empty_table RIGHT OUTER JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
 -- !query 13 schema
-struct<a:int,a:int>
+struct<a:int,CAST(udf(cast(a as string)) AS INT):int>
 -- !query 13 output
 NULL	1

 -- !query 14
-SELECT * FROM empty_table FULL OUTER JOIN t1
+SELECT empty_table.a, udf(udf(t1.a)) FROM empty_table FULL OUTER JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
 -- !query 14 schema
-struct<a:int,a:int>
+struct<a:int,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 14 output
 NULL	1

 -- !query 15
-SELECT * FROM empty_table LEFT SEMI JOIN t1
+SELECT udf(udf(empty_table.a)) FROM empty_table LEFT SEMI JOIN t1 ON (udf(empty_table.a) = udf(udf(t1.a)))
 -- !query 15 schema
-struct<a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 15 output

 -- !query 16
-SELECT * FROM empty_table LEFT ANTI JOIN t1
+SELECT empty_table.a FROM empty_table LEFT ANTI JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
 -- !query 16 schema
 struct<a:int>
 -- !query 16 output
 -139,56 +139,56  struct<a:int>

 -- !query 17
-SELECT * FROM empty_table INNER JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table INNER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(udf(empty_table2.a)))
 -- !query 17 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
 -- !query 17 output

 -- !query 18
-SELECT * FROM empty_table CROSS JOIN empty_table
+SELECT udf(udf(empty_table.a)) FROM empty_table CROSS JOIN empty_table AS empty_table2 ON (udf(udf(empty_table.a)) = udf(empty_table2.a))
 -- !query 18 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 18 output

 -- !query 19
-SELECT * FROM empty_table LEFT OUTER JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table LEFT OUTER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
 -- !query 19 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
 -- !query 19 output

 -- !query 20
-SELECT * FROM empty_table RIGHT OUTER JOIN empty_table
+SELECT udf(udf(empty_table.a)) FROM empty_table RIGHT OUTER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(udf(empty_table2.a)))
 -- !query 20 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 20 output

 -- !query 21
-SELECT * FROM empty_table FULL OUTER JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table FULL OUTER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
 -- !query 21 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
 -- !query 21 output

 -- !query 22
-SELECT * FROM empty_table LEFT SEMI JOIN empty_table
+SELECT udf(udf(empty_table.a)) FROM empty_table LEFT SEMI JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
 -- !query 22 schema
-struct<a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
 -- !query 22 output

 -- !query 23
-SELECT * FROM empty_table LEFT ANTI JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table LEFT ANTI JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
 -- !query 23 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
 -- !query 23 output

```
</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25127 from imback82/join-empty-relation-sql.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-19 16:27:26 +09:00
Ievgen Prokhorenko 52ddf038ec [SPARK-28440][MLLIB][TEST] Use TestingUtils to compare floating point values
## What changes were proposed in this pull request?

Use `org.apache.spark.mllib.util.TestingUtils` object across `MLLIB` component to compare floating point values in tests.

## How was this patch tested?

`build/mvn test` - existing tests against updated code.

Closes #25191 from eugen-prokhorenko/mllib-testingutils-double-comparison.

Authored-by: Ievgen Prokhorenko <eugen.prokhorenko@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 23:48:12 -07:00
Liang-Chi Hsieh 127bc899ae [SPARK-27707][SQL] Prune unnecessary nested fields from Generate
## What changes were proposed in this pull request?

Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example:

```scala
val df = spark.sparkContext.parallelize(Seq(("1",
  Array.fill(M)({
    val i = math.random
    (i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString)
  })))).toDF("col", "arr")
  .selectExpr("col", "struct(col, arr) as st")
  .selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col")
```

The explode causes `st` to be duplicated as many as the exploded elements.

Benchmarks it:

```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
[info] Intel(R) Core(TM) i7-8750H CPU  2.20GHz
[info] generate big nested struct array:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] generate big nested struct array wholestage off          52668          53162         699          0.0      877803.4       1.0X
[info] generate big nested struct array wholestage on          47261          49093        1125          0.0      787690.2       1.1X
[info]
```

The query plan:
```
== Physical Plan ==
 Project [col#508, st#512.col AS col1#515, arr_col#519]
 +- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519]
    +- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512]
       +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84,      ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true)))     null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String,     StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true,  false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84,   MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504]
          +- Scan[obj#534]
```

This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of `Generate`, and substitutes them by alias attributes on the projection on top of `Generate`.

Benchmarks it after the change:
```
 [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
 [info] Intel(R) Core(TM) i7-8750H CPU  2.20GHz
 [info] generate big nested struct array:         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 [info] ------------------------------------------------------------------------------------------------------------------------
 [info] generate big nested struct array wholestage off            311            331          28          0.2        5188.6       1.0X
 [info] generate big nested struct array wholestage on            297            312          15          0.2        4947.3       1.0X
 [info]
```

The query plan:
```
== Physical Plan ==
 Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603]
 +- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603]
    +- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608]
       +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in
 put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4),
 if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1,        staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102,              MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String,    StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2,      true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString,                                                 knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4,            staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102,              MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2,      None) AS _2#588]
          +- Scan[obj#586]
```

This behavior is controlled by a SQL config `spark.sql.optimizer.expression.nestedPruning.enabled`.

## How was this patch tested?

Added benchmark.

Closes #24637 from viirya/SPARK-27707.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 23:32:07 -07:00
HyukjinKwon 0512af1668 [SPARK-28389][SQL][FOLLOW-UP] Use one example in 'add_months' behavior change at migration guide
## What changes were proposed in this pull request?

This PR proposes to add one example to describe 'add_months'  behaviour change by https://github.com/apache/spark/pull/25153.

**Spark 2.4:**

```sql
select add_months(DATE'2019-02-28', 1)
```

```
+--------------------------------+
|add_months(DATE '2019-02-28', 1)|
+--------------------------------+
|                      2019-03-31|
+--------------------------------+
```

**Current master:**

```sql
select add_months(DATE'2019-02-28', 1)
```

```
+--------------------------------+
|add_months(DATE '2019-02-28', 1)|
+--------------------------------+
|                      2019-03-28|
+--------------------------------+
```

## How was this patch tested?

Manually tested on Spark 2.4.1 and the current master.

Closes #25199 from HyukjinKwon/SPARK-28389.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-19 14:29:16 +09:00
Huaxin Gao cd676e9f5e [SPARK-28277][SQL][PYTHON][TESTS] Convert and port 'except.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from ```except.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'except.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/except.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out
index c9b712d4d2..27ca7ea226 100644
--- a/sql/core/src/test/resources/sql-tests/results/except.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out
 -30,16 +30,16  struct<>

 -- !query 2
-SELECT * FROM t1 EXCEPT SELECT * FROM t2
+SELECT udf(k), udf(v) FROM t1 EXCEPT SELECT udf(k), udf(v) FROM t2
 -- !query 2 schema
-struct<k:string,v:int>
+struct<CAST(udf(cast(k as string)) AS STRING):string,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 2 output
 three  3
 two    2

 -- !query 3
-SELECT * FROM t1 EXCEPT SELECT * FROM t1 where v <> 1 and v <> 2
+SELECT * FROM t1 EXCEPT SELECT * FROM t1 where udf(v) <> 1 and v <> udf(2)
 -- !query 3 schema
 struct<k:string,v:int>
 -- !query 3 output
 -49,7 +49,7  two   2

 -- !query 4
-SELECT * FROM t1 where v <> 1 and v <> 22 EXCEPT SELECT * FROM t1 where v <> 2 and v >= 3
+SELECT * FROM t1 where udf(v) <> 1 and v <> udf(22) EXCEPT SELECT * FROM t1 where udf(v) <> 2 and v >= udf(3)
 -- !query 4 schema
 struct<k:string,v:int>
 -- !query 4 output
 -59,7 +59,7  two   2
 -- !query 5
 SELECT t1.* FROM t1, t2 where t1.k = t2.k
 EXCEPT
-SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != 'one'
+SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != udf('one')
 -- !query 5 schema
 struct<k:string,v:int>
 -- !query 5 output
 -68,7 +68,7  one   NULL

 -- !query 6
-SELECT * FROM t2 where v >= 1 and v <> 22 EXCEPT SELECT * FROM t1
+SELECT * FROM t2 where v >= udf(1) and udf(v) <> 22 EXCEPT SELECT * FROM t1
 -- !query 6 schema
 struct<k:string,v:int>
 -- !query 6 output
 -77,9 +77,9  one   5

 -- !query 7
-SELECT (SELECT min(k) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
+SELECT (SELECT min(udf(k)) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
 MINUS
-SELECT (SELECT min(k) FROM t2) abs_min_t2 FROM t1 WHERE  t1.k = 'one'
+SELECT (SELECT udf(min(k)) FROM t2) abs_min_t2 FROM t1 WHERE  t1.k = udf('one')
 -- !query 7 schema
 struct<min_t2:string>
 -- !query 7 output
 -90,16 +90,17  two
 -- !query 8
 SELECT t1.k
 FROM   t1
-WHERE  t1.v <= (SELECT   max(t2.v)
+WHERE  t1.v <= (SELECT   udf(max(udf(t2.v)))
                 FROM     t2
-                WHERE    t2.k = t1.k)
+                WHERE    udf(t2.k) = udf(t1.k))
 MINUS
 SELECT t1.k
 FROM   t1
-WHERE  t1.v >= (SELECT   min(t2.v)
+WHERE  udf(t1.v) >= (SELECT   min(udf(t2.v))
                 FROM     t2
                 WHERE    t2.k = t1.k)
 -- !query 8 schema
-struct<k:string>
+struct<>
 -- !query 8 output
-two
+java.lang.UnsupportedOperationException
+Cannot evaluate expression: udf(cast(null as string))
```

</p>
</details>

## How was this patch tested?
Tested as guided in [SPARK-27921.](https://issues.apache.org/jira/browse/SPARK-27921)

Closes #25101 from huaxingao/spark-28277.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-19 13:44:26 +09:00
Huaxin Gao 20578e81a7 [SPARK-28285][SQL][PYTHON][TESTS] Convert and port 'outer-join.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from ```outer-join.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'outer-join.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/outer-join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-outer-join.sql.out
index 5db3bae5d0..819f786070 100644
--- a/sql/core/src/test/resources/sql-tests/results/outer-join.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-outer-join.sql.out
 -24,17 +24,17  struct<>

 -- !query 2
 SELECT
-  (SUM(COALESCE(t1.int_col1, t2.int_col0))),
-     ((COALESCE(t1.int_col1, t2.int_col0)) * 2)
+  (udf(SUM(udf(COALESCE(t1.int_col1, t2.int_col0))))),
+     (udf(COALESCE(t1.int_col1, t2.int_col0)) * 2)
 FROM t1
 RIGHT JOIN t2
-  ON (t2.int_col0) = (t1.int_col1)
-GROUP BY GREATEST(COALESCE(t2.int_col1, 109), COALESCE(t1.int_col1, -449)),
+  ON udf(t2.int_col0) = udf(t1.int_col1)
+GROUP BY udf(GREATEST(COALESCE(udf(t2.int_col1), 109), COALESCE(t1.int_col1, udf(-449)))),
          COALESCE(t1.int_col1, t2.int_col0)
-HAVING (SUM(COALESCE(t1.int_col1, t2.int_col0)))
-            > ((COALESCE(t1.int_col1, t2.int_col0)) * 2)
+HAVING (udf(SUM(COALESCE(udf(t1.int_col1), udf(t2.int_col0)))))
+            > (udf(COALESCE(t1.int_col1, t2.int_col0)) * 2)
 -- !query 2 schema
-struct<sum(coalesce(int_col1, int_col0)):bigint,(coalesce(int_col1, int_col0) * 2):int>
+struct<CAST(udf(cast(sum(cast(cast(udf(cast(coalesce(int_col1, int_col0) as string)) as int) as bigint)) as string)) AS BIGINT):bigint,(CAST(udf(cast(coalesce(int_col1, int_col0) as string)) AS INT) * 2):int>
 -- !query 2 output
 -367   -734
 -507   -1014
 -70,10 +70,10  spark.sql.crossJoin.enabled true
 SELECT *
 FROM (
 SELECT
-    COALESCE(t2.int_col1, t1.int_col1) AS int_col
+    udf(COALESCE(udf(t2.int_col1), udf(t1.int_col1))) AS int_col
     FROM t1
     LEFT JOIN t2 ON false
-) t where (t.int_col) is not null
+) t where (udf(t.int_col)) is not null
 -- !query 6 schema
 struct<int_col:int>
 -- !query 6 output
```

</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25103 from huaxingao/spark-28285.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-19 12:16:41 +09:00
Vinod KC d2598fee3b [SPARK-28287][SQL][PYTHON][TESTS] Convert and port 'udaf.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from 'udaf.sql' to test UDFs

<details><summary>Diff comparing to 'udaf.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udaf.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out
index f4455bb717..e1747f4667 100644
--- a/sql/core/src/test/resources/sql-tests/results/udaf.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out
 -3,6 +3,8

 -- !query 0
+-- This test file was converted from udaf.sql.
+
 CREATE OR REPLACE TEMPORARY VIEW t1 AS SELECT * FROM VALUES
 (1), (2), (3), (4)
 as t1(int_col1)
 -21,15 +23,15  struct<>

 -- !query 2
-SELECT default.myDoubleAvg(int_col1) as my_avg from t1
+SELECT default.myDoubleAvg(udf(int_col1)) as my_avg, udf(default.myDoubleAvg(udf(int_col1))) as my_avg2, udf(default.myDoubleAvg(int_col1)) as my_avg3 from t1
 -- !query 2 schema
-struct<my_avg:double>
+struct<my_avg:double,my_avg2:double,my_avg3:double>
 -- !query 2 output
-102.5
+102.5  102.5   102.5

 -- !query 3
-SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
+SELECT default.myDoubleAvg(udf(int_col1), udf(3)) as my_avg from t1
 -- !query 3 schema
 struct<>
 -- !query 3 output
 -46,12 +48,12  struct<>

 -- !query 5
-SELECT default.udaf1(int_col1) as udaf1 from t1
+SELECT default.udaf1(udf(int_col1)) as udaf1, udf(default.udaf1(udf(int_col1))) as udaf2, udf(default.udaf1(int_col1)) as udaf3 from t1
 -- !query 5 schema
 struct<>
 -- !query 5 output
 org.apache.spark.sql.AnalysisException
-Can not load class 'test.non.existent.udaf' when registering the function 'default.udaf1', please make sure it is on the classpath; line 1 pos 7
+Can not load class 'test.non.existent.udaf' when registering the function 'default.udaf1', please make sure it is on the classpath; line 1 pos 94

 -- !query 6
```

</p>
</details>

## How was this patch tested?

Tested as guided in SPARK-27921.

Closes #25194 from vinodkc/br_Fix_SPARK-27921_3.

Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-19 10:48:13 +09:00
Maxim Gekk 54e058dff2 [SPARK-28416][SQL] Use java.time API in timestampAddInterval
## What changes were proposed in this pull request?

The `DateTimeUtils.timestampAddInterval` method was rewritten by using Java 8 time API. To add months and microseconds, I used the `plusMonths()` and `plus()` methods of `ZonedDateTime`. Also the signature of `timestampAddInterval()` was changed to accept an `ZoneId` instance instead of `TimeZone`. Using `ZoneId` allows to avoid the conversion `TimeZone` -> `ZoneId` on every invoke of `timestampAddInterval()`.

## How was this patch tested?

By existing test suites `DateExpressionsSuite`, `TypeCoercionSuite` and `CollectionExpressionsSuite`.

Closes #25173 from MaxGekk/timestamp-add-interval.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-18 19:17:23 -04:00
Yuming Wang 0c21404f7c [SPARK-28312][SQL][TEST] Port numeric.sql
## What changes were proposed in this pull request?

This PR is to port numeric.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/numeric.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/numeric.out

When porting the test cases, found four PostgreSQL specific features that do not exist in Spark SQL:
[SPARK-28315](https://issues.apache.org/jira/browse/SPARK-28315): Decimal can not accept `NaN` as input
[SPARK-28317](https://issues.apache.org/jira/browse/SPARK-28317): Built-in Mathematical Functions: SCALE
[SPARK-28318](https://issues.apache.org/jira/browse/SPARK-28318): Decimal can only support precision up to 38
[SPARK-28322](https://issues.apache.org/jira/browse/SPARK-28322): DIV support decimal type

Also, found four inconsistent behavior:
[SPARK-28316](https://issues.apache.org/jira/browse/SPARK-28316): Decimal precision issue
[SPARK-28324](https://issues.apache.org/jira/browse/SPARK-28324): The LOG function using 10 as the base, but Spark using E
[SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL
[SPARK-28007](https://issues.apache.org/jira/browse/SPARK-28007):  Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres

## How was this patch tested?

N/A

Closes #25092 from wangyum/SPARK-28312.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 13:49:51 -07:00
Josh Rosen 3776fbdfde [SPARK-28430][UI] Fix stage table rendering when some tasks' metrics are missing
## What changes were proposed in this pull request?

The Spark UI's stages table misrenders the input/output metrics columns when some tasks are missing input metrics. See the screenshot below for an example of the problem:

![image](https://user-images.githubusercontent.com/50748/61420042-a3abc100-a8b5-11e9-8a92-7986563ee712.png)

This is because those columns' are defined as

```scala
 {if (hasInput(stage)) {
  metricInfo(task) { m =>
    ...
   <td>....</td>
  }
}
```

where `metricInfo` renders the node returned by the closure in case metrics are defined or returns `Nil` in case metrics are not defined. If metrics are undefined then we'll fail to render the empty `<td></td>` tag, causing columns to become misaligned as shown in the screenshot.

To fix this, this patch changes this to

```scala
 {if (hasInput(stage)) {
  <td>{
    metricInfo(task) { m =>
      ...
     Unparsed(...)
    }
  }</td>
}
```

which is an idiom that's already in use for the shuffle read / write columns.

## How was this patch tested?

It isn't. I'm arguing for correctness because the modifications are consistent with rendering methods that work correctly for other columns.

Closes #25183 from JoshRosen/joshrosen/fix-task-table-with-partial-io-metrics.

Authored-by: Josh Rosen <rosenville@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 13:15:39 -07:00
zero323 a0c2fa63ab [SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat
## What changes were proposed in this pull request?

This adds simple check for `count` argument:

- If it is a `Column` we apply `_to_java_column` before invoking JVM counterpart
- Otherwise we proceed as before.

## How was this patch tested?

Manual testing.

Closes #25193 from zero323/SPARK-28278.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 12:58:48 -07:00
Yuming Wang 2cf0491a97 [SPARK-28388][SQL][TEST] Port select_implicit.sql
## What changes were proposed in this pull request?

This PR is to port numeric.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/select_implicit.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/select_implicit.out

When porting the test cases, found one PostgreSQL specific features that do not exist in Spark SQL:
[SPARK-28329](https://issues.apache.org/jira/browse/SPARK-28329): SELECT INTO syntax

## How was this patch tested?

N/A

Closes #25152 from wangyum/SPARK-28388.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 08:58:27 -07:00
Yuming Wang 8acc22ca64 [SPARK-28138][SQL][TEST] Port timestamp.sql
## What changes were proposed in this pull request?

This PR is to port timestamp.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/timestamp.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/timestamp.out

When porting the test cases, found five PostgreSQL specific features that do not exist in Spark SQL:
[SPARK-28141](https://issues.apache.org/jira/browse/SPARK-28141): Timestamp type can not accept special values
[SPARK-28259](https://issues.apache.org/jira/browse/SPARK-28259): Date/Time Output Styles and Date Order Conventions
[SPARK-28425](https://issues.apache.org/jira/browse/SPARK-28425): Add more Date/Time Operators
[SPARK-28420](https://issues.apache.org/jira/browse/SPARK-28420): Date/Time Functions: date_part
[SPARK-28137](https://issues.apache.org/jira/browse/SPARK-28137): Data Type Formatting Functions
[SPARK-28432](https://issues.apache.org/jira/browse/SPARK-28432): Date/Time Functions: make_date/make_timestamp

Also, found one inconsistent behavior:
[SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL

## How was this patch tested?

N/A

Closes #25181 from wangyum/SPARK-28138.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-18 08:50:31 -07:00
chitralverma 4b865104b3 [SPARK-28286][SQL][PYTHON][TESTS] Convert and port 'pivot.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from pivot.sql to test UDFs following the combination guide in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

<details><summary>Diff comparing to 'pivot.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out
index 9a8f783da4..cb9e4d736c 100644
--- a/sql/core/src/test/resources/sql-tests/results/pivot.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-pivot.sql.out
 -1,5 +1,5
 -- Automatically generated by SQLQueryTestSuite
--- Number of queries: 32
+-- Number of queries: 30

 -- !query 0
 -40,14 +40,14  struct<>

 -- !query 3
 SELECT * FROM (
-  SELECT year, course, earnings FROM courseSales
+  SELECT udf(year), course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 3 schema
-struct<year:int,dotNET:bigint,Java:bigint>
+struct<CAST(udf(cast(year as string)) AS INT):int,dotNET:bigint,Java:bigint>
 -- !query 3 output
 2012   15000   20000
 2013   48000   30000
 -56,7 +56,7  struct<year:int,dotNET:bigint,Java:bigint>
 -- !query 4
 SELECT * FROM courseSales
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (2012, 2013)
 )
 -- !query 4 schema
 -71,11 +71,11  SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings), avg(earnings)
+  udf(sum(earnings)), udf(avg(earnings))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 5 schema
-struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_avg(CAST(earnings AS BIGINT)):double,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_avg(CAST(earnings AS BIGINT)):double>
+struct<year:int,dotNET_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(avg(cast(earnings as bigint)) as string)) AS DOUBLE):double,Java_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(avg(cast(earnings as bigint)) as string)) AS DOUBLE):double>
 -- !query 5 output
 2012   15000   7500.0  20000   20000.0
 2013   48000   48000.0 30000   30000.0
 -83,10 +83,10  struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_avg(CAST(earn

 -- !query 6
 SELECT * FROM (
-  SELECT course, earnings FROM courseSales
+  SELECT udf(course) as course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 6 schema
 -100,23 +100,23  SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings), min(year)
+  udf(sum(udf(earnings))), udf(min(year))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 7 schema
-struct<dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_min(year):int,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_min(year):int>
+struct<dotNET_CAST(udf(cast(sum(cast(cast(udf(cast(earnings as string)) as int) as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(min(year) as string)) AS INT):int,Java_CAST(udf(cast(sum(cast(cast(udf(cast(earnings as string)) as int) as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(min(year) as string)) AS INT):int>
 -- !query 7 output
 63000  2012    50000   2012

 -- !query 8
 SELECT * FROM (
-  SELECT course, year, earnings, s
+  SELECT course, year, earnings, udf(s) as s
   FROM courseSales
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR s IN (1, 2)
 )
 -- !query 8 schema
 -135,11 +135,11  SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings), min(s)
+  udf(sum(earnings)), udf(min(s))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 9 schema
-struct<year:int,dotNET_sum(CAST(earnings AS BIGINT)):bigint,dotNET_min(s):int,Java_sum(CAST(earnings AS BIGINT)):bigint,Java_min(s):int>
+struct<year:int,dotNET_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,dotNET_CAST(udf(cast(min(s) as string)) AS INT):int,Java_CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint,Java_CAST(udf(cast(min(s) as string)) AS INT):int>
 -- !query 9 output
 2012   15000   1       20000   1
 2013   48000   2       30000   2
 -152,7 +152,7  SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings * s)
+  udf(sum(earnings * s))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 10 schema
 -167,7 +167,7  SELECT 2012_s, 2013_s, 2012_a, 2013_a, c FROM (
   SELECT year y, course c, earnings e FROM courseSales
 )
 PIVOT (
-  sum(e) s, avg(e) a
+  udf(sum(e)) s, udf(avg(e)) a
   FOR y IN (2012, 2013)
 )
 -- !query 11 schema
 -182,7 +182,7  SELECT firstYear_s, secondYear_s, firstYear_a, secondYear_a, c FROM (
   SELECT year y, course c, earnings e FROM courseSales
 )
 PIVOT (
-  sum(e) s, avg(e) a
+  udf(sum(e)) s, udf(avg(e)) a
   FOR y IN (2012 as firstYear, 2013 secondYear)
 )
 -- !query 12 schema
 -195,7 +195,7  struct<firstYear_s:bigint,secondYear_s:bigint,firstYear_a:double,secondYear_a:do
 -- !query 13
 SELECT * FROM courseSales
 PIVOT (
-  abs(earnings)
+  udf(abs(earnings))
   FOR year IN (2012, 2013)
 )
 -- !query 13 schema
 -210,7 +210,7  SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings), year
+  udf(sum(earnings)), year
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 14 schema
 -225,7 +225,7  SELECT * FROM (
   SELECT course, earnings FROM courseSales
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (2012, 2013)
 )
 -- !query 15 schema
 -240,11 +240,11  SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  ceil(sum(earnings)), avg(earnings) + 1 as a1
+  udf(ceil(udf(sum(earnings)))), avg(earnings) + 1 as a1
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 16 schema
-struct<year:int,dotNET_CEIL(sum(CAST(earnings AS BIGINT))):bigint,dotNET_a1:double,Java_CEIL(sum(CAST(earnings AS BIGINT))):bigint,Java_a1:double>
+struct<year:int,dotNET_CAST(udf(cast(CEIL(cast(udf(cast(sum(cast(earnings as bigint)) as string)) as bigint)) as string)) AS BIGINT):bigint,dotNET_a1:double,Java_CAST(udf(cast(CEIL(cast(udf(cast(sum(cast(earnings as bigint)) as string)) as bigint)) as string)) AS BIGINT):bigint,Java_a1:double>
 -- !query 16 output
 2012   15000   7501.0  20000   20001.0
 2013   48000   48001.0 30000   30001.0
 -255,7 +255,7  SELECT * FROM (
   SELECT year, course, earnings FROM courseSales
 )
 PIVOT (
-  sum(avg(earnings))
+  sum(udf(avg(earnings)))
   FOR course IN ('dotNET', 'Java')
 )
 -- !query 17 schema
 -272,7 +272,7  SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, year) IN (('dotNET', 2012), ('Java', 2013))
 )
 -- !query 18 schema
 -289,7 +289,7  SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, s) IN (('dotNET', 2) as c1, ('Java', 1) as c2)
 )
 -- !query 19 schema
 -306,7 +306,7  SELECT * FROM (
   JOIN years ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, year) IN ('dotNET', 'Java')
 )
 -- !query 20 schema
 -319,7 +319,7  Invalid pivot value 'dotNET': value data type string does not match pivot column
 -- !query 21
 SELECT * FROM courseSales
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (s, 2013)
 )
 -- !query 21 schema
 -332,7 +332,7  cannot resolve '`s`' given input columns: [coursesales.course, coursesales.earni
 -- !query 22
 SELECT * FROM courseSales
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR year IN (course, 2013)
 )
 -- !query 22 schema
 -343,151 +343,118  Literal expressions required for pivot values, found 'course#x';

 -- !query 23
-SELECT * FROM (
-  SELECT course, year, a
-  FROM courseSales
-  JOIN yearsWithComplexTypes ON year = y
-)
-PIVOT (
-  min(a)
-  FOR course IN ('dotNET', 'Java')
-)
--- !query 23 schema
-struct<year:int,dotNET:array<int>,Java:array<int>>
--- !query 23 output
-2012   [1,1]   [1,1]
-2013   [2,2]   [2,2]
-
-
--- !query 24
-SELECT * FROM (
-  SELECT course, year, y, a
-  FROM courseSales
-  JOIN yearsWithComplexTypes ON year = y
-)
-PIVOT (
-  max(a)
-  FOR (y, course) IN ((2012, 'dotNET'), (2013, 'Java'))
-)
--- !query 24 schema
-struct<year:int,[2012, dotNET]:array<int>,[2013, Java]:array<int>>
--- !query 24 output
-2012   [1,1]   NULL
-2013   NULL    [2,2]
-
-
--- !query 25
 SELECT * FROM (
   SELECT earnings, year, a
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR a IN (array(1, 1), array(2, 2))
 )
--- !query 25 schema
+-- !query 23 schema
 struct<year:int,[1, 1]:bigint,[2, 2]:bigint>
--- !query 25 output
+-- !query 23 output
 2012   35000   NULL
 2013   NULL    78000

--- !query 26
+-- !query 24
 SELECT * FROM (
-  SELECT course, earnings, year, a
+  SELECT course, earnings, udf(year) as year, a
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, a) IN (('dotNET', array(1, 1)), ('Java', array(2, 2)))
 )
--- !query 26 schema
+-- !query 24 schema
 struct<year:int,[dotNET, [1, 1]]:bigint,[Java, [2, 2]]:bigint>
--- !query 26 output
+-- !query 24 output
 2012   15000   NULL
 2013   NULL    30000

--- !query 27
+-- !query 25
 SELECT * FROM (
   SELECT earnings, year, s
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR s IN ((1, 'a'), (2, 'b'))
 )
--- !query 27 schema
+-- !query 25 schema
 struct<year:int,[1, a]:bigint,[2, b]:bigint>
--- !query 27 output
+-- !query 25 output
 2012   35000   NULL
 2013   NULL    78000

--- !query 28
+-- !query 26
 SELECT * FROM (
   SELECT course, earnings, year, s
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, s) IN (('dotNET', (1, 'a')), ('Java', (2, 'b')))
 )
--- !query 28 schema
+-- !query 26 schema
 struct<year:int,[dotNET, [1, a]]:bigint,[Java, [2, b]]:bigint>
--- !query 28 output
+-- !query 26 output
 2012   15000   NULL
 2013   NULL    30000

--- !query 29
+-- !query 27
 SELECT * FROM (
   SELECT earnings, year, m
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR m IN (map('1', 1), map('2', 2))
 )
--- !query 29 schema
+-- !query 27 schema
 struct<>
--- !query 29 output
+-- !query 27 output
 org.apache.spark.sql.AnalysisException
 Invalid pivot column 'm#x'. Pivot columns must be comparable.;

--- !query 30
+-- !query 28
 SELECT * FROM (
   SELECT course, earnings, year, m
   FROM courseSales
   JOIN yearsWithComplexTypes ON year = y
 )
 PIVOT (
-  sum(earnings)
+  udf(sum(earnings))
   FOR (course, m) IN (('dotNET', map('1', 1)), ('Java', map('2', 2)))
 )
--- !query 30 schema
+-- !query 28 schema
 struct<>
--- !query 30 output
+-- !query 28 output
 org.apache.spark.sql.AnalysisException
 Invalid pivot column 'named_struct(course, course#x, m, m#x)'. Pivot columns must be comparable.;

--- !query 31
+-- !query 29
 SELECT * FROM (
-  SELECT course, earnings, "a" as a, "z" as z, "b" as b, "y" as y, "c" as c, "x" as x, "d" as d, "w" as w
+  SELECT course, earnings, udf("a") as a, udf("z") as z, udf("b") as b, udf("y") as y,
+  udf("c") as c, udf("x") as x, udf("d") as d, udf("w") as w
   FROM courseSales
 )
 PIVOT (
-  sum(Earnings)
+  udf(sum(Earnings))
   FOR Course IN ('dotNET', 'Java')
 )
--- !query 31 schema
+-- !query 29 schema
 struct<a:string,z:string,b:string,y:string,c:string,x:string,d:string,w:string,dotNET:bigint,Java:bigint>
--- !query 31 output
+-- !query 29 output
 a      z       b       y       c       x       d       w       63000   50000

```

</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25122 from chitralverma/SPARK-28286.

Authored-by: chitralverma <chitralverma@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-18 22:19:14 +09:00
Terry Kim eaaf1aa2ac [SPARK-28278][SQL][PYTHON][TESTS] Convert and port 'except-all.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from `except-all.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

<details><summary>Diff comparing to 'except-all.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/except-all.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-except-all.sql.out
index 01091a2f75..b7bfad0e53 100644
--- a/sql/core/src/test/resources/sql-tests/results/except-all.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-except-all.sql.out
 -49,11 +49,11  struct<>

 -- !query 4
-SELECT * FROM tab1
+SELECT udf(c1) FROM tab1
 EXCEPT ALL
-SELECT * FROM tab2
+SELECT udf(c1) FROM tab2
 -- !query 4 schema
-struct<c1:int>
+struct<CAST(udf(cast(c1 as string)) AS INT):int>
 -- !query 4 output
 0
 2
 -62,11 +62,11  NULL

 -- !query 5
-SELECT * FROM tab1
+SELECT udf(c1) FROM tab1
 MINUS ALL
-SELECT * FROM tab2
+SELECT udf(c1) FROM tab2
 -- !query 5 schema
-struct<c1:int>
+struct<CAST(udf(cast(c1 as string)) AS INT):int>
 -- !query 5 output
 0
 2
 -75,11 +75,11  NULL

 -- !query 6
-SELECT * FROM tab1
+SELECT udf(c1) FROM tab1
 EXCEPT ALL
-SELECT * FROM tab2 WHERE c1 IS NOT NULL
+SELECT udf(c1) FROM tab2 WHERE udf(c1) IS NOT NULL
 -- !query 6 schema
-struct<c1:int>
+struct<CAST(udf(cast(c1 as string)) AS INT):int>
 -- !query 6 output
 0
 2
 -89,21 +89,21  NULL

 -- !query 7
-SELECT * FROM tab1 WHERE c1 > 5
+SELECT udf(c1) FROM tab1 WHERE udf(c1) > 5
 EXCEPT ALL
-SELECT * FROM tab2
+SELECT udf(c1) FROM tab2
 -- !query 7 schema
-struct<c1:int>
+struct<CAST(udf(cast(c1 as string)) AS INT):int>
 -- !query 7 output

 -- !query 8
-SELECT * FROM tab1
+SELECT udf(c1) FROM tab1
 EXCEPT ALL
-SELECT * FROM tab2 WHERE c1 > 6
+SELECT udf(c1) FROM tab2 WHERE udf(c1 > udf(6))
 -- !query 8 schema
-struct<c1:int>
+struct<CAST(udf(cast(c1 as string)) AS INT):int>
 -- !query 8 output
 0
 1
 -117,11 +117,11  NULL

 -- !query 9
-SELECT * FROM tab1
+SELECT udf(c1) FROM tab1
 EXCEPT ALL
-SELECT CAST(1 AS BIGINT)
+SELECT CAST(udf(1) AS BIGINT)
 -- !query 9 schema
-struct<c1:bigint>
+struct<CAST(udf(cast(c1 as string)) AS INT):bigint>
 -- !query 9 output
 0
 2
 -134,7 +134,7  NULL

 -- !query 10
-SELECT * FROM tab1
+SELECT udf(c1) FROM tab1
 EXCEPT ALL
 SELECT array(1)
 -- !query 10 schema
 -145,62 +145,62  ExceptAll can only be performed on tables with the compatible column types. arra

 -- !query 11
-SELECT * FROM tab3
+SELECT udf(k), v FROM tab3
 EXCEPT ALL
-SELECT * FROM tab4
+SELECT k, udf(v) FROM tab4
 -- !query 11 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 11 output
 1	2
 1	3

 -- !query 12
-SELECT * FROM tab4
+SELECT k, udf(v) FROM tab4
 EXCEPT ALL
-SELECT * FROM tab3
+SELECT udf(k), v FROM tab3
 -- !query 12 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 12 output
 2	2
 2	20

 -- !query 13
-SELECT * FROM tab4
+SELECT udf(k), udf(v) FROM tab4
 EXCEPT ALL
-SELECT * FROM tab3
+SELECT udf(k), udf(v) FROM tab3
 INTERSECT DISTINCT
-SELECT * FROM tab4
+SELECT udf(k), udf(v) FROM tab4
 -- !query 13 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 13 output
 2	2
 2	20

 -- !query 14
-SELECT * FROM tab4
+SELECT udf(k), v FROM tab4
 EXCEPT ALL
-SELECT * FROM tab3
+SELECT k, udf(v) FROM tab3
 EXCEPT DISTINCT
-SELECT * FROM tab4
+SELECT udf(k), udf(v) FROM tab4
 -- !query 14 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 14 output

 -- !query 15
-SELECT * FROM tab3
+SELECT k, udf(v) FROM tab3
 EXCEPT ALL
-SELECT * FROM tab4
+SELECT udf(k), udf(v) FROM tab4
 UNION ALL
-SELECT * FROM tab3
+SELECT udf(k), v FROM tab3
 EXCEPT DISTINCT
-SELECT * FROM tab4
+SELECT k, udf(v) FROM tab4
 -- !query 15 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 15 output
 1	3

 -217,83 +217,83  ExceptAll can only be performed on tables with the same number of columns, but t

 -- !query 17
-SELECT * FROM tab3
+SELECT udf(k), udf(v) FROM tab3
 EXCEPT ALL
-SELECT * FROM tab4
+SELECT udf(k), udf(v) FROM tab4
 UNION
-SELECT * FROM tab3
+SELECT udf(k), udf(v) FROM tab3
 EXCEPT DISTINCT
-SELECT * FROM tab4
+SELECT udf(k), udf(v) FROM tab4
 -- !query 17 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 17 output
 1	3

 -- !query 18
-SELECT * FROM tab3
+SELECT udf(k), udf(v) FROM tab3
 MINUS ALL
-SELECT * FROM tab4
+SELECT k, udf(v) FROM tab4
 UNION
-SELECT * FROM tab3
+SELECT udf(k), udf(v) FROM tab3
 MINUS DISTINCT
-SELECT * FROM tab4
+SELECT k, udf(v) FROM tab4
 -- !query 18 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 18 output
 1	3

 -- !query 19
-SELECT * FROM tab3
+SELECT k, udf(v) FROM tab3
 EXCEPT ALL
-SELECT * FROM tab4
+SELECT udf(k), v FROM tab4
 EXCEPT DISTINCT
-SELECT * FROM tab3
+SELECT k, udf(v) FROM tab3
 EXCEPT DISTINCT
-SELECT * FROM tab4
+SELECT udf(k), v FROM tab4
 -- !query 19 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 19 output

 -- !query 20
 SELECT *
-FROM   (SELECT tab3.k,
-               tab4.v
+FROM   (SELECT tab3.k,
+               udf(tab4.v)
         FROM   tab3
                JOIN tab4
-                 ON tab3.k = tab4.k)
+                 ON udf(tab3.k) = tab4.k)
 EXCEPT ALL
 SELECT *
-FROM   (SELECT tab3.k,
-               tab4.v
+FROM   (SELECT udf(tab3.k),
+               tab4.v
         FROM   tab3
                JOIN tab4
-                 ON tab3.k = tab4.k)
+                 ON tab3.k = udf(tab4.k))
 -- !query 20 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 20 output

 -- !query 21
 SELECT *
-FROM   (SELECT tab3.k,
-               tab4.v
+FROM   (SELECT udf(udf(tab3.k)),
+               udf(tab4.v)
         FROM   tab3
                JOIN tab4
-                 ON tab3.k = tab4.k)
+                 ON udf(udf(tab3.k)) = udf(tab4.k))
 EXCEPT ALL
 SELECT *
-FROM   (SELECT tab4.v AS k,
-               tab3.k AS v
+FROM   (SELECT udf(tab4.v) AS k,
+               udf(udf(tab3.k)) AS v
         FROM   tab3
                JOIN tab4
-                 ON tab3.k = tab4.k)
+                 ON udf(tab3.k) = udf(tab4.k))
 -- !query 21 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 21 output
 1	2
 1	2
 -305,11 +305,11  struct<k:int,v:int>

 -- !query 22
-SELECT v FROM tab3 GROUP BY v
+SELECT udf(v) FROM tab3 GROUP BY v
 EXCEPT ALL
-SELECT k FROM tab4 GROUP BY k
+SELECT udf(k) FROM tab4 GROUP BY k
 -- !query 22 schema
-struct<v:int>
+struct<CAST(udf(cast(v as string)) AS INT):int>
 -- !query 22 output
 3

```
</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25090 from imback82/except-all.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-18 19:51:50 +09:00
Terry Kim 62004f1c0f [SPARK-28283][SQL][PYTHON][TESTS] Convert and port 'intersect-all.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from `intersect-all.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

<details><summary>Diff comparing to 'intersect-all.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-intersect-all.sql.out
index 63dd56ce46..0cb82be2da 100644
--- a/sql/core/src/test/resources/sql-tests/results/intersect-all.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-intersect-all.sql.out
 -34,11 +34,11  struct<>

 -- !query 2
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT k, udf(v) FROM tab2
 -- !query 2 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 2 output
 1	2
 1	2
 -48,11 +48,11  NULL	NULL

 -- !query 3
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab1 WHERE k = 1
+SELECT udf(k), v FROM tab1 WHERE udf(k) = 1
 -- !query 3 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 3 output
 1	2
 1	2
 -61,39 +61,39  struct<k:int,v:int>

 -- !query 4
-SELECT * FROM tab1 WHERE k > 2
+SELECT udf(k), udf(v) FROM tab1 WHERE k > udf(2)
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), udf(v) FROM tab2
 -- !query 4 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 4 output

 -- !query 5
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2 WHERE k > 3
+SELECT udf(k), v FROM tab2 WHERE udf(udf(k)) > 3
 -- !query 5 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 5 output

 -- !query 6
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 INTERSECT ALL
-SELECT CAST(1 AS BIGINT), CAST(2 AS BIGINT)
+SELECT CAST(udf(1) AS BIGINT), CAST(udf(2) AS BIGINT)
 -- !query 6 schema
-struct<k:bigint,v:bigint>
+struct<CAST(udf(cast(k as string)) AS INT):bigint,v:bigint>
 -- !query 6 output
 1	2

 -- !query 7
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT array(1), 2
+SELECT array(1), udf(2)
 -- !query 7 schema
 struct<>
 -- !query 7 output
 -102,9 +102,9  IntersectAll can only be performed on tables with the compatible column types. a

 -- !query 8
-SELECT k FROM tab1
+SELECT udf(k) FROM tab1
 INTERSECT ALL
-SELECT k, v FROM tab2
+SELECT udf(k), udf(v) FROM tab2
 -- !query 8 schema
 struct<>
 -- !query 8 output
 -113,13 +113,13  IntersectAll can only be performed on tables with the same number of columns, bu

 -- !query 9
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 INTERSECT ALL
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), udf(v) FROM tab2
 -- !query 9 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 9 output
 1	2
 1	2
 -129,15 +129,15  NULL	NULL

 -- !query 10
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT k, udf(v) FROM tab2
 UNION ALL
-SELECT * FROM tab1
+SELECT k, udf(udf(v)) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 -- !query 10 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 10 output
 1	2
 1	2
 -148,15 +148,15  NULL	NULL

 -- !query 11
-SELECT * FROM tab1
+SELECT udf(k), udf(v) FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 EXCEPT
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(k), udf(udf(v)) FROM tab2
 -- !query 11 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 11 output
 1	3

 -165,38 +165,38  struct<k:int,v:int>
 (
   (
     (
-      SELECT * FROM tab1
+      SELECT udf(k), v FROM tab1
       EXCEPT
-      SELECT * FROM tab2
+      SELECT k, udf(v) FROM tab2
     )
     EXCEPT
-    SELECT * FROM tab1
+    SELECT udf(k), udf(v) FROM tab1
   )
   INTERSECT ALL
-  SELECT * FROM tab2
+  SELECT udf(k), udf(v) FROM tab2
 )
 -- !query 12 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 12 output

 -- !query 13
 SELECT *
-FROM   (SELECT tab1.k,
-               tab2.v
+FROM   (SELECT udf(tab1.k),
+               udf(tab2.v)
         FROM   tab1
                JOIN tab2
-                 ON tab1.k = tab2.k)
+                 ON udf(udf(tab1.k)) = tab2.k)
 INTERSECT ALL
 SELECT *
-FROM   (SELECT tab1.k,
-               tab2.v
+FROM   (SELECT udf(tab1.k),
+               udf(tab2.v)
         FROM   tab1
                JOIN tab2
-                 ON tab1.k = tab2.k)
+                 ON udf(tab1.k) = udf(udf(tab2.k)))
 -- !query 13 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 13 output
 1	2
 1	2
 -211,30 +211,30  struct<k:int,v:int>

 -- !query 14
 SELECT *
-FROM   (SELECT tab1.k,
-               tab2.v
+FROM   (SELECT udf(tab1.k),
+               udf(tab2.v)
         FROM   tab1
                JOIN tab2
-                 ON tab1.k = tab2.k)
+                 ON udf(tab1.k) = udf(tab2.k))
 INTERSECT ALL
 SELECT *
-FROM   (SELECT tab2.v AS k,
-               tab1.k AS v
+FROM   (SELECT udf(tab2.v) AS k,
+               udf(tab1.k) AS v
         FROM   tab1
                JOIN tab2
-                 ON tab1.k = tab2.k)
+                 ON tab1.k = udf(tab2.k))
 -- !query 14 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 14 output

 -- !query 15
-SELECT v FROM tab1 GROUP BY v
+SELECT udf(v) FROM tab1 GROUP BY v
 INTERSECT ALL
-SELECT k FROM tab2 GROUP BY k
+SELECT udf(udf(k)) FROM tab2 GROUP BY k
 -- !query 15 schema
-struct<v:int>
+struct<CAST(udf(cast(v as string)) AS INT):int>
 -- !query 15 output
 2
 3
 -250,15 +250,15  spark.sql.legacy.setopsPrecedence.enabled	true

 -- !query 17
-SELECT * FROM tab1
+SELECT udf(k), v FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT k, udf(v) FROM tab2
 UNION ALL
-SELECT * FROM tab1
+SELECT udf(k), udf(v) FROM tab1
 INTERSECT ALL
-SELECT * FROM tab2
+SELECT udf(udf(k)), udf(v) FROM tab2
 -- !query 17 schema
-struct<k:int,v:int>
+struct<CAST(udf(cast(k as string)) AS INT):int,v:int>
 -- !query 17 output
 1	2
 1	2
 -268,15 +268,15  NULL	NULL

 -- !query 18
-SELECT * FROM tab1
+SELECT k, udf(v) FROM tab1
 EXCEPT
-SELECT * FROM tab2
+SELECT udf(k), v FROM tab2
 UNION ALL
-SELECT * FROM tab1
+SELECT udf(k), udf(v) FROM tab1
 INTERSECT
-SELECT * FROM tab2
+SELECT udf(k), udf(udf(v)) FROM tab2
 -- !query 18 schema
-struct<k:int,v:int>
+struct<k:int,CAST(udf(cast(v as string)) AS INT):int>
 -- !query 18 output
 1	2
 2	3

```
</p>
</details>

## How was this patch tested?

Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).

Closes #25119 from imback82/intersect-all-sql.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-18 19:49:57 +09:00
Liang-Chi Hsieh 4645ffb08a [SPARK-28276][SQL][PYTHON][TEST] Convert and port 'cross-join.sql' into UDF test base
## What changes were proposed in this pull request?

This PR adds some tests converted from `cross-join.sql'` to test UDFs.

<details><summary>Diff comparing to 'cross-join.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/cross-join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-cross-join.sql.out
index 3833c42bdf..11c1e01d54 100644
--- a/sql/core/src/test/resources/sql-tests/results/cross-join.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-cross-join.sql.out
 -43,7 +43,7  two   2       two     22

 -- !query 3
-SELECT * FROM nt1 cross join nt2 where nt1.k = nt2.k
+SELECT * FROM nt1 cross join nt2 where udf(nt1.k) = udf(nt2.k)
 -- !query 3 schema
 struct<k:string,v1:int,k:string,v2:int>
 -- !query 3 output
 -53,7 +53,7  two   2       two     22

 -- !query 4
-SELECT * FROM nt1 cross join nt2 on (nt1.k = nt2.k)
+SELECT * FROM nt1 cross join nt2 on (udf(nt1.k) = udf(nt2.k))
 -- !query 4 schema
 struct<k:string,v1:int,k:string,v2:int>
 -- !query 4 output
 -63,7 +63,7  two   2       two     22

 -- !query 5
-SELECT * FROM nt1 cross join nt2 where nt1.v1 = 1 and nt2.v2 = 22
+SELECT * FROM nt1 cross join nt2 where udf(nt1.v1) = "1" and udf(nt2.v2) = "22"
 -- !query 5 schema
 struct<k:string,v1:int,k:string,v2:int>
 -- !query 5 output
 -71,12 +71,12  one 1       two     22

-- !query 6
-SELECT a.key, b.key FROM
-(SELECT k key FROM nt1 WHERE v1 < 2) a
+SELECT udf(a.key), udf(b.key) FROM
+(SELECT udf(k) key FROM nt1 WHERE v1 < 2) a
 CROSS JOIN
-(SELECT k key FROM nt2 WHERE v2 = 22) b
+(SELECT udf(k) key FROM nt2 WHERE v2 = 22) b
 -- !query 6 schema
-struct<key:string,key:string>
+struct<udf(key):string,udf(key):string>
 -- !query 6 output
 one    two

 -114,23 +114,29  struct<>

-- !query 11
-select * from ((A join B on (a = b)) cross join C) join D on (a = d)
+select * from ((A join B on (udf(a) = udf(b))) cross join C) join D on (udf(a) = udf(d))
 -- !query 11 schema
-struct<a:string,va:int,b:string,vb:int,c:string,vc:int,d:string,vd:int>
+struct<>
 -- !query 11 output
-one    1       one     1       one     1       one     1
-one    1       one     1       three   3       one     1
-one    1       one     1       two     2       one     1
-three  3       three   3       one     1       three   3
-three  3       three   3       three   3       three   3
-three  3       three   3       two     2       three   3
-two    2       two     2       one     1       two     2
-two    2       two     2       three   3       two     2
-two    2       two     2       two     2       two     2
+org.apache.spark.sql.AnalysisException
+Detected implicit cartesian product for INNER join between logical plans
+Filter (udf(a#x) = udf(b#x))
++- Join Inner
+   :- Project [k#x AS a#x, v1#x AS va#x]
+   :  +- LocalRelation [k#x, v1#x]
+   +- Project [k#x AS b#x, v1#x AS vb#x]
+      +- LocalRelation [k#x, v1#x]
+and
+Project [k#x AS d#x, v1#x AS vd#x]
++- LocalRelation [k#x, v1#x]
+Join condition is missing or trivial.
+Either: use the CROSS JOIN syntax to allow cartesian products between these
+relations, or: enable implicit cartesian products by setting the configuration
+variable spark.sql.crossJoin.enabled=true;

 -- !query 12
-SELECT * FROM nt1 CROSS JOIN nt2 ON (nt1.k > nt2.k)
+SELECT * FROM nt1 CROSS JOIN nt2 ON (udf(nt1.k) > udf(nt2.k))
 -- !query 12 schema
 struct<k:string,v1:int,k:string,v2:int>
 -- !query 12 output
```

</p>
</details>

## How was this patch tested?

Added test.

Closes #25168 from viirya/SPARK-28276.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-18 19:46:00 +09:00
Huaxin Gao 971e832e0e [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored
## What changes were proposed in this pull request?
In the following python code
```
df.write.mode("overwrite").insertInto("table")
```
```insertInto``` ignores ```mode("overwrite")```  and appends by default.

## How was this patch tested?

Add Unit test.

Closes #25175 from huaxingao/spark-28411.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-18 13:37:59 +09:00
Maxim Gekk 70073b19eb [SPARK-27609][PYTHON] Convert values of function options to strings
## What changes were proposed in this pull request?

In the PR, I propose to convert options values to strings by using `to_str()` for the following functions:  `from_csv()`, `to_csv()`, `from_json()`, `to_json()`, `schema_of_csv()` and `schema_of_json()`. This will make handling of function options consistent to option handling in `DataFrameReader`/`DataFrameWriter`.

For example:
```Python
df.select(from_csv(df.value, "s string", {'ignoreLeadingWhiteSpace': True})
```

## How was this patch tested?

Added an example for `from_csv()` which was tested by:
```Shell
./python/run-tests --testnames pyspark.sql.functions
```

Closes #25182 from MaxGekk/options_to_str.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-18 13:37:03 +09:00
Seth Fitzsimmons eb5dc746c2 [SPARK-28097][SQL] Map ByteType to SMALLINT for PostgresDialect
## What changes were proposed in this pull request?

PostgreSQL doesn't have `TINYINT`, which would map directly, but `SMALLINT`s are sufficient for uni-directional translation.

A side-effect of this fix is that `AggregatedDialect` is now usable with multiple dialects targeting `jdbc:postgresql`, as `PostgresDialect.getJDBCType` no longer throws (for which reason backporting this fix would be lovely):

1217996f15/sql/core/src/main/scala/org/apache/spark/sql/jdbc/AggregatedDialect.scala (L42)

`dialects.flatMap` currently throws on the first attempt to get a JDBC type preventing subsequent dialects in the chain from providing an alternative.

## How was this patch tested?

Unit tests.

Closes #24845 from mojodna/postgres-byte-type-mapping.

Authored-by: Seth Fitzsimmons <seth@mojodna.net>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-17 15:10:01 -07:00
HyukjinKwon 28774cd2ea [SPARK-28359][SQL][PYTHON][TESTS] Make integrated UDF tests robust by making UDFs (virtually) no-op
## What changes were proposed in this pull request?

Current UDFs available in `IntegratedUDFTestUtils` are not exactly no-op. It converts input column to strings and outputs to strings.

It causes some issues when we convert and port the tests at SPARK-27921. Integrated UDF test cases share one output file and it should outputs the same. However,

1. Special values are converted into strings differently:

    | Scala      | Python |
    | ---------- | ------ |
    | `null`     | `None` |
    | `Infinity` | `inf`  |
    | `-Infinity`| `-inf` |
    | `NaN`      | `nan`  |

2. Due to float limitation at Python (see https://docs.python.org/3/tutorial/floatingpoint.html), if float is passed into Python and sent back to JVM, the values are potentially not exactly correct. See https://github.com/apache/spark/pull/25128 and https://github.com/apache/spark/pull/25110

To work around this, this PR targets to change the current UDF to be wrapped by cast. So, Input column is casted into string, UDF returns strings as are, and then output column is casted back to the input column.

Roughly:

**Before:**

```
JVM (col1) -> (cast to string within Python) Python (string) -> (string) JVM
```

**After:**

```
JVM (cast col1 to string) -> (string) Python (string) -> (cast back to col1's type) JVM
```

In this way, UDF is virtually no-op although there might be some subtleties due to roundtrip in string cast. I believe this is good enough.

Python native functions and Scala native functions will take strings and output strings as are. So, there will be no potential test failures due to differences of conversion between Python and Scala.

After this fix, for instance, `udf-aggregates_part1.sql` outputs exactly same as `aggregates_part1.sql`:

<details><summary>Diff comparing to 'pgSQL/aggregates_part1.sql'</summary>
<p>

```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out
index 51ca1d55869..801735781c7 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part1.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part1.sql.out
 -3,7 +3,7

 -- !query 0
-SELECT avg(four) AS avg_1 FROM onek
+SELECT avg(udf(four)) AS avg_1 FROM onek
 -- !query 0 schema
 struct<avg_1:double>
 -- !query 0 output
 -11,7 +11,7  struct<avg_1:double>

 -- !query 1
-SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100
+SELECT udf(avg(a)) AS avg_32 FROM aggtest WHERE a < 100
 -- !query 1 schema
 struct<avg_32:double>
 -- !query 1 output
 -19,7 +19,7  struct<avg_32:double>

 -- !query 2
-select CAST(avg(b) AS Decimal(10,3)) AS avg_107_943 FROM aggtest
+select CAST(avg(udf(b)) AS Decimal(10,3)) AS avg_107_943 FROM aggtest
 -- !query 2 schema
 struct<avg_107_943:decimal(10,3)>
 -- !query 2 output
 -27,7 +27,7  struct<avg_107_943:decimal(10,3)>

 -- !query 3
-SELECT sum(four) AS sum_1500 FROM onek
+SELECT sum(udf(four)) AS sum_1500 FROM onek
 -- !query 3 schema
 struct<sum_1500:bigint>
 -- !query 3 output
 -35,7 +35,7  struct<sum_1500:bigint>

 -- !query 4
-SELECT sum(a) AS sum_198 FROM aggtest
+SELECT udf(sum(a)) AS sum_198 FROM aggtest
 -- !query 4 schema
 struct<sum_198:bigint>
 -- !query 4 output
 -43,7 +43,7  struct<sum_198:bigint>

 -- !query 5
-SELECT sum(b) AS avg_431_773 FROM aggtest
+SELECT udf(udf(sum(b))) AS avg_431_773 FROM aggtest
 -- !query 5 schema
 struct<avg_431_773:double>
 -- !query 5 output
 -51,7 +51,7  struct<avg_431_773:double>

 -- !query 6
-SELECT max(four) AS max_3 FROM onek
+SELECT udf(max(four)) AS max_3 FROM onek
 -- !query 6 schema
 struct<max_3:int>
 -- !query 6 output
 -59,7 +59,7  struct<max_3:int>

 -- !query 7
-SELECT max(a) AS max_100 FROM aggtest
+SELECT max(udf(a)) AS max_100 FROM aggtest
 -- !query 7 schema
 struct<max_100:int>
 -- !query 7 output
 -67,7 +67,7  struct<max_100:int>

 -- !query 8
-SELECT max(aggtest.b) AS max_324_78 FROM aggtest
+SELECT udf(udf(max(aggtest.b))) AS max_324_78 FROM aggtest
 -- !query 8 schema
 struct<max_324_78:float>
 -- !query 8 output
 -75,237 +75,238  struct<max_324_78:float>

 -- !query 9
-SELECT stddev_pop(b) FROM aggtest
+SELECT stddev_pop(udf(b)) FROM aggtest
 -- !query 9 schema
-struct<stddev_pop(CAST(b AS DOUBLE)):double>
+struct<stddev_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE)):double>
 -- !query 9 output
 131.10703231895047

 -- !query 10
-SELECT stddev_samp(b) FROM aggtest
+SELECT udf(stddev_samp(b)) FROM aggtest
 -- !query 10 schema
-struct<stddev_samp(CAST(b AS DOUBLE)):double>
+struct<CAST(udf(cast(stddev_samp(cast(b as double)) as string)) AS DOUBLE):double>
 -- !query 10 output
 151.38936080399804

 -- !query 11
-SELECT var_pop(b) FROM aggtest
+SELECT var_pop(udf(b)) FROM aggtest
 -- !query 11 schema
-struct<var_pop(CAST(b AS DOUBLE)):double>
+struct<var_pop(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE)):double>
 -- !query 11 output
 17189.053923482323

 -- !query 12
-SELECT var_samp(b) FROM aggtest
+SELECT udf(var_samp(b)) FROM aggtest
 -- !query 12 schema
-struct<var_samp(CAST(b AS DOUBLE)):double>
+struct<CAST(udf(cast(var_samp(cast(b as double)) as string)) AS DOUBLE):double>
 -- !query 12 output
 22918.738564643096

 -- !query 13
-SELECT stddev_pop(CAST(b AS Decimal(38,0))) FROM aggtest
+SELECT udf(stddev_pop(CAST(b AS Decimal(38,0)))) FROM aggtest
 -- !query 13 schema
-struct<stddev_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
+struct<CAST(udf(cast(stddev_pop(cast(cast(b as decimal(38,0)) as double)) as string)) AS DOUBLE):double>
 -- !query 13 output
 131.18117242958306

 -- !query 14
-SELECT stddev_samp(CAST(b AS Decimal(38,0))) FROM aggtest
+SELECT stddev_samp(CAST(udf(b) AS Decimal(38,0))) FROM aggtest
 -- !query 14 schema
-struct<stddev_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
+struct<stddev_samp(CAST(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DECIMAL(38,0)) AS DOUBLE)):double>
 -- !query 14 output
 151.47497042966097

 -- !query 15
-SELECT var_pop(CAST(b AS Decimal(38,0))) FROM aggtest
+SELECT udf(var_pop(CAST(b AS Decimal(38,0)))) FROM aggtest
 -- !query 15 schema
-struct<var_pop(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
+struct<CAST(udf(cast(var_pop(cast(cast(b as decimal(38,0)) as double)) as string)) AS DOUBLE):double>
 -- !query 15 output
 17208.5

 -- !query 16
-SELECT var_samp(CAST(b AS Decimal(38,0))) FROM aggtest
+SELECT var_samp(udf(CAST(b AS Decimal(38,0)))) FROM aggtest
 -- !query 16 schema
-struct<var_samp(CAST(CAST(b AS DECIMAL(38,0)) AS DOUBLE)):double>
+struct<var_samp(CAST(CAST(udf(cast(cast(b as decimal(38,0)) as string)) AS DECIMAL(38,0)) AS DOUBLE)):double>
 -- !query 16 output
 22944.666666666668

 -- !query 17
-SELECT var_pop(1.0), var_samp(2.0)
+SELECT udf(var_pop(1.0)), var_samp(udf(2.0))
 -- !query 17 schema
-struct<var_pop(CAST(1.0 AS DOUBLE)):double,var_samp(CAST(2.0 AS DOUBLE)):double>
+struct<CAST(udf(cast(var_pop(cast(1.0 as double)) as string)) AS DOUBLE):double,var_samp(CAST(CAST(udf(cast(2.0 as string)) AS DECIMAL(2,1)) AS DOUBLE)):double>
 -- !query 17 output
 0.0    NaN

 -- !query 18
-SELECT stddev_pop(CAST(3.0 AS Decimal(38,0))), stddev_samp(CAST(4.0 AS Decimal(38,0)))
+SELECT stddev_pop(udf(CAST(3.0 AS Decimal(38,0)))), stddev_samp(CAST(udf(4.0) AS Decimal(38,0)))
 -- !query 18 schema
-struct<stddev_pop(CAST(CAST(3.0 AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(4.0 AS DECIMAL(38,0)) AS DOUBLE)):double>
+struct<stddev_pop(CAST(CAST(udf(cast(cast(3.0 as decimal(38,0)) as string)) AS DECIMAL(38,0)) AS DOUBLE)):double,stddev_samp(CAST(CAST(CAST(udf(cast(4.0 as string)) AS DECIMAL(2,1)) AS DECIMAL(38,0)) AS DOUBLE)):double>
 -- !query 18 output
 0.0    NaN

 -- !query 19
-select sum(CAST(null AS int)) from range(1,4)
+select sum(udf(CAST(null AS int))) from range(1,4)
 -- !query 19 schema
-struct<sum(CAST(NULL AS INT)):bigint>
+struct<sum(CAST(udf(cast(cast(null as int) as string)) AS INT)):bigint>
 -- !query 19 output
 NULL

 -- !query 20
-select sum(CAST(null AS long)) from range(1,4)
+select sum(udf(CAST(null AS long))) from range(1,4)
 -- !query 20 schema
-struct<sum(CAST(NULL AS BIGINT)):bigint>
+struct<sum(CAST(udf(cast(cast(null as bigint) as string)) AS BIGINT)):bigint>
 -- !query 20 output
 NULL

 -- !query 21
-select sum(CAST(null AS Decimal(38,0))) from range(1,4)
+select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4)
 -- !query 21 schema
-struct<sum(CAST(NULL AS DECIMAL(38,0))):decimal(38,0)>
+struct<sum(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS DECIMAL(38,0))):decimal(38,0)>
 -- !query 21 output
 NULL

 -- !query 22
-select sum(CAST(null AS DOUBLE)) from range(1,4)
+select sum(udf(CAST(null AS DOUBLE))) from range(1,4)
 -- !query 22 schema
-struct<sum(CAST(NULL AS DOUBLE)):double>
+struct<sum(CAST(udf(cast(cast(null as double) as string)) AS DOUBLE)):double>
 -- !query 22 output
 NULL

 -- !query 23
-select avg(CAST(null AS int)) from range(1,4)
+select avg(udf(CAST(null AS int))) from range(1,4)
 -- !query 23 schema
-struct<avg(CAST(NULL AS INT)):double>
+struct<avg(CAST(udf(cast(cast(null as int) as string)) AS INT)):double>
 -- !query 23 output
 NULL

 -- !query 24
-select avg(CAST(null AS long)) from range(1,4)
+select avg(udf(CAST(null AS long))) from range(1,4)
 -- !query 24 schema
-struct<avg(CAST(NULL AS BIGINT)):double>
+struct<avg(CAST(udf(cast(cast(null as bigint) as string)) AS BIGINT)):double>
 -- !query 24 output
 NULL

 -- !query 25
-select avg(CAST(null AS Decimal(38,0))) from range(1,4)
+select avg(udf(CAST(null AS Decimal(38,0)))) from range(1,4)
 -- !query 25 schema
-struct<avg(CAST(NULL AS DECIMAL(38,0))):decimal(38,4)>
+struct<avg(CAST(udf(cast(cast(null as decimal(38,0)) as string)) AS DECIMAL(38,0))):decimal(38,4)>
 -- !query 25 output
 NULL

 -- !query 26
-select avg(CAST(null AS DOUBLE)) from range(1,4)
+select avg(udf(CAST(null AS DOUBLE))) from range(1,4)
 -- !query 26 schema
-struct<avg(CAST(NULL AS DOUBLE)):double>
+struct<avg(CAST(udf(cast(cast(null as double) as string)) AS DOUBLE)):double>
 -- !query 26 output
 NULL

 -- !query 27
-select sum(CAST('NaN' AS DOUBLE)) from range(1,4)
+select sum(CAST(udf('NaN') AS DOUBLE)) from range(1,4)
 -- !query 27 schema
-struct<sum(CAST(NaN AS DOUBLE)):double>
+struct<sum(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double>
 -- !query 27 output
 NaN

 -- !query 28
-select avg(CAST('NaN' AS DOUBLE)) from range(1,4)
+select avg(CAST(udf('NaN') AS DOUBLE)) from range(1,4)
 -- !query 28 schema
-struct<avg(CAST(NaN AS DOUBLE)):double>
+struct<avg(CAST(CAST(udf(cast(NaN as string)) AS STRING) AS DOUBLE)):double>
 -- !query 28 output
 NaN

 -- !query 30
-SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
+SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE))
 FROM (VALUES ('Infinity'), ('1')) v(x)
 -- !query 30 schema
-struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
+struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double>
 -- !query 30 output
 Infinity       NaN

 -- !query 31
-SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
+SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE))
 FROM (VALUES ('Infinity'), ('Infinity')) v(x)
 -- !query 31 schema
-struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
+struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double>
 -- !query 31 output
 Infinity       NaN

 -- !query 32
-SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
+SELECT avg(CAST(udf(x) AS DOUBLE)), var_pop(CAST(udf(x) AS DOUBLE))
 FROM (VALUES ('-Infinity'), ('Infinity')) v(x)
 -- !query 32 schema
-struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
+struct<avg(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double,var_pop(CAST(CAST(udf(cast(x as string)) AS STRING) AS DOUBLE)):double>
 -- !query 32 output
 NaN    NaN

 -- !query 33
-SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
+SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE)))
 FROM (VALUES (100000003), (100000004), (100000006), (100000007)) v(x)
 -- !query 33 schema
-struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
+struct<avg(CAST(udf(cast(cast(x as double) as string)) AS DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS DOUBLE):double>
 -- !query 33 output
 1.00000005E8   2.5

 -- !query 34
-SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
+SELECT avg(udf(CAST(x AS DOUBLE))), udf(var_pop(CAST(x AS DOUBLE)))
 FROM (VALUES (7000000000005), (7000000000007)) v(x)
 -- !query 34 schema
-struct<avg(CAST(x AS DOUBLE)):double,var_pop(CAST(x AS DOUBLE)):double>
+struct<avg(CAST(udf(cast(cast(x as double) as string)) AS DOUBLE)):double,CAST(udf(cast(var_pop(cast(x as double)) as string)) AS DOUBLE):double>
 -- !query 34 output
 7.000000000006E12      1.0

 -- !query 35
-SELECT covar_pop(b, a), covar_samp(b, a) FROM aggtest
+SELECT udf(covar_pop(b, udf(a))), covar_samp(udf(b), a) FROM aggtest
 -- !query 35 schema
-struct<covar_pop(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double,covar_samp(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double>
+struct<CAST(udf(cast(covar_pop(cast(b as double), cast(cast(udf(cast(a as string)) as int) as double)) as string)) AS DOUBLE):double,covar_samp(CAST(CAST(udf(cast(b as string)) AS FLOAT) AS DOUBLE), CAST(a AS DOUBLE)):double>
 -- !query 35 output
 653.6289553875104      871.5052738500139

 -- !query 36
-SELECT corr(b, a) FROM aggtest
+SELECT corr(b, udf(a)) FROM aggtest
 -- !query 36 schema
-struct<corr(CAST(b AS DOUBLE), CAST(a AS DOUBLE)):double>
+struct<corr(CAST(b AS DOUBLE), CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double>
 -- !query 36 output
 0.1396345165178734

 -- !query 37
-SELECT count(four) AS cnt_1000 FROM onek
+SELECT count(udf(four)) AS cnt_1000 FROM onek
 -- !query 37 schema
 struct<cnt_1000:bigint>
 -- !query 37 output
 -313,7 +314,7  struct<cnt_1000:bigint>

 -- !query 38
-SELECT count(DISTINCT four) AS cnt_4 FROM onek
+SELECT udf(count(DISTINCT four)) AS cnt_4 FROM onek
 -- !query 38 schema
 struct<cnt_4:bigint>
 -- !query 38 output
 -321,10 +322,10  struct<cnt_4:bigint>

 -- !query 39
-select ten, count(*), sum(four) from onek
+select ten, udf(count(*)), sum(udf(four)) from onek
 group by ten order by ten
 -- !query 39 schema
-struct<ten:int,count(1):bigint,sum(four):bigint>
+struct<ten:int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint,sum(CAST(udf(cast(four as string)) AS INT)):bigint>
 -- !query 39 output
 0      100     100
 1      100     200
 -339,10 +340,10  struct<ten:int,count(1):bigint,sum(four):bigint>

 -- !query 40
-select ten, count(four), sum(DISTINCT four) from onek
+select ten, count(udf(four)), udf(sum(DISTINCT four)) from onek
 group by ten order by ten
 -- !query 40 schema
-struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint>
+struct<ten:int,count(CAST(udf(cast(four as string)) AS INT)):bigint,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS BIGINT):bigint>
 -- !query 40 output
 0      100     2
 1      100     4
 -357,11 +358,11  struct<ten:int,count(four):bigint,sum(DISTINCT four):bigint>

 -- !query 41
-select ten, sum(distinct four) from onek a
+select ten, udf(sum(distinct four)) from onek a
 group by ten
-having exists (select 1 from onek b where sum(distinct a.four) = b.four)
+having exists (select 1 from onek b where udf(sum(distinct a.four)) = b.four)
 -- !query 41 schema
-struct<ten:int,sum(DISTINCT four):bigint>
+struct<ten:int,CAST(udf(cast(sum(distinct cast(four as bigint)) as string)) AS BIGINT):bigint>
 -- !query 41 output
 0      2
 2      2
 -374,23 +375,23  struct<ten:int,sum(DISTINCT four):bigint>
 select ten, sum(distinct four) from onek a
 group by ten
 having exists (select 1 from onek b
-               where sum(distinct a.four + b.four) = b.four)
+               where sum(distinct a.four + b.four) = udf(b.four))
 -- !query 42 schema
 struct<>
 -- !query 42 output
 org.apache.spark.sql.AnalysisException

 Aggregate/Window/Generate expressions are not valid in where clause of the query.
-Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(b.`four` AS BIGINT))]
+Expression in where clause: [(sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT)) = CAST(CAST(udf(cast(four as string)) AS INT) AS BIGINT))]
 Invalid expressions: [sum(DISTINCT CAST((outer() + b.`four`) AS BIGINT))];

 -- !query 43
 select
-  (select max((select i.unique2 from tenk1 i where i.unique1 = o.unique1)))
+  (select udf(max((select i.unique2 from tenk1 i where i.unique1 = o.unique1))))
 from tenk1 o
 -- !query 43 schema
 struct<>
 -- !query 43 output
 org.apache.spark.sql.AnalysisException
-cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 63
+cannot resolve '`o.unique1`' given input columns: [i.even, i.fivethous, i.four, i.hundred, i.odd, i.string4, i.stringu1, i.stringu2, i.ten, i.tenthous, i.thousand, i.twenty, i.two, i.twothousand, i.unique1, i.unique2]; line 2 pos 67
```
</p>
</details>

## How was this patch tested?

Manually tested.

Closes #25130 from HyukjinKwon/SPARK-28359.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-07-17 21:49:43 +08:00
HyukjinKwon 66179fa842 [SPARK-28418][PYTHON][SQL] Wait for event process in 'test_query_execution_listener_on_collect'
## What changes were proposed in this pull request?

It fixes a flaky test:

```
ERROR [0.164s]: test_query_execution_listener_on_collect (pyspark.sql.tests.test_dataframe.QueryExecutionListenerTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/python/pyspark/sql/tests/test_dataframe.py", line 758, in test_query_execution_listener_on_collect
    "The callback from the query execution listener should be called after 'collect'")
AssertionError: The callback from the query execution listener should be called after 'collect'
```

Seems it can be failed because the event was somehow delayed but checked first.

## How was this patch tested?

Manually.

Closes #25177 from HyukjinKwon/SPARK-28418.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-17 18:44:11 +09:00
Marcelo Vanzin 2ddeff97d7 [SPARK-27963][CORE] Allow dynamic allocation without a shuffle service.
This change adds a new option that enables dynamic allocation without
the need for a shuffle service. This mode works by tracking which stages
generate shuffle files, and keeping executors that generate data for those
shuffles alive while the jobs that use them are active.

A separate timeout is also added for shuffle data; so that executors that
hold shuffle data can use a separate timeout before being removed because
of being idle. This allows the shuffle data to be kept around in case it
is needed by some new job, or allow users to be more aggressive in timing
out executors that don't have shuffle data in active use.

The code also hooks up to the context cleaner so that shuffles that are
garbage collected are detected, and the respective executors not held
unnecessarily.

Testing done with added unit tests, and also with TPC-DS workloads on
YARN without a shuffle service.

Closes #24817 from vanzin/SPARK-27963.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-07-16 16:37:38 -07:00
nooberfsh 1134faecf4 [SPARK-18299][SQL] Allow more aggregations on KeyValueGroupedDataset
## What changes were proposed in this pull request?

Add 4 additional agg to KeyValueGroupedDataset

## How was this patch tested?

New test in DatasetSuite for typed aggregation

Closes #24993 from nooberfsh/sqlagg.

Authored-by: nooberfsh <nooberfsh@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-16 16:35:04 -07:00
Thomas Graves 43d68cd4ff [SPARK-27959][YARN] Change YARN resource configs to use .amount
## What changes were proposed in this pull request?

we are adding in generic resource support into spark where we have suffix for the amount of the resource so that we could support other configs.

Spark on yarn already had added configs to request resources via the configs spark.yarn.{executor/driver/am}.resource=<some amount>, where the <some amount> is value and unit together.  We should change those configs to have a `.amount` suffix on them to match the spark configs and to allow future configs to be more easily added. YARN itself already supports tags and attributes so if we want the user to be able to pass those from spark at some point having a suffix makes sense. it would allow for a spark.yarn.{executor/driver/am}.resource.{resource}.tag= type config.

## How was this patch tested?

Tested via unit tests and manually on a yarn 3.x cluster with GPU resources configured on.

Closes #24989 from tgravescs/SPARK-27959-yarn-resourceconfigs.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-07-16 10:56:07 -07:00
Yuming Wang 71882f119e [SPARK-28343][FOLLOW-UP][SQL][TEST] Enable spark.sql.function.preferIntegralDivision for PostgreSQL testing
## What changes were proposed in this pull request?

This PR enables `spark.sql.function.preferIntegralDivision` for PostgreSQL testing.

## How was this patch tested?

N/A

Closes #25170 from wangyum/SPARK-28343-2.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-16 08:46:01 -07:00
zhengruifeng 282a12d331 [SPARK-27944][ML] Unify the behavior of checking empty output column names
## What changes were proposed in this pull request?
In regression/clustering/ovr/als, if an output column name is empty, igore it. And if all names are empty, log a warning msg, then do nothing.

## How was this patch tested?
existing tests

Closes #24793 from zhengruifeng/aft_iso_check_empty_outputCol.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-16 09:56:12 -04:00
Gabor Somogyi 113f62dd8c [SPARK-27485][FOLLOWUP] Do not reduce the number of partitions for repartition in adaptive execution - fix compilation
## What changes were proposed in this pull request?

PR builder failed with the following error:
```
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:714: wrong number of arguments for pattern org.apache.spark.sql.execution.exchange.ShuffleExchangeExec(outputPartitioning: org.apache.spark.sql.catalyst.plans.physical.Partitioning,child: org.apache.spark.sql.execution.SparkPlan,canChangeNumPartitions: Boolean)
[error]                ShuffleExchangeExec(HashPartitioning(leftPartitioningExpressions, _), _), _),
[error]                                   ^
[error] /home/jenkins/workspace/SparkPullRequestBuilder/sql/core/src/test/scala/org/apache/spark/sql/execution/PlannerSuite.scala:716: wrong number of arguments for pattern org.apache.spark.sql.execution.exchange.ShuffleExchangeExec(outputPartitioning: org.apache.spark.sql.catalyst.plans.physical.Partitioning,child: org.apache.spark.sql.execution.SparkPlan,canChangeNumPartitions: Boolean)
[error]                ShuffleExchangeExec(HashPartitioning(rightPartitioningExpressions, _), _), _)) =>
[error]                                   ^
```

## How was this patch tested?

Existing unit test.

Closes #25171 from gaborgsomogyi/SPARK-27485.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: herman <herman@databricks.com>
2019-07-16 12:56:13 +02:00
Yuming Wang f74ad3d700 [SPARK-28129][SQL][TEST] Port float8.sql
## What changes were proposed in this pull request?

This PR is to port float8.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/float8.sql

The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/float8.out

When porting the test cases, found six PostgreSQL specific features that do not exist in Spark SQL:
[SPARK-28060](https://issues.apache.org/jira/browse/SPARK-28060): Double type can not accept some special inputs
[SPARK-28027](https://issues.apache.org/jira/browse/SPARK-28027): Spark SQL does not support prefix operator `` and `|/`
[SPARK-28061](https://issues.apache.org/jira/browse/SPARK-28061): Support for converting float to binary format
[SPARK-23906](https://issues.apache.org/jira/browse/SPARK-23906): Support Truncate number
[SPARK-28134](https://issues.apache.org/jira/browse/SPARK-28134): Missing Trigonometric Functions

Also, found two bug:
[SPARK-28024](https://issues.apache.org/jira/browse/SPARK-28024): Incorrect value when out of range
[SPARK-28135](https://issues.apache.org/jira/browse/SPARK-28135): ceil/ceiling/floor/power returns incorrect values

Also, found four inconsistent behavior:
[SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL insert bad inputs to NULL
[SPARK-28028](https://issues.apache.org/jira/browse/SPARK-28028): Cast numeric to integral type need round
[SPARK-27923](https://issues.apache.org/jira/browse/SPARK-27923): Spark SQL returns NULL when dividing by zero
[SPARK-28007](https://issues.apache.org/jira/browse/SPARK-28007):  Caret operator (^) means bitwise XOR in Spark/Hive and exponentiation in Postgres

## How was this patch tested?

N/A

Closes #24931 from wangyum/SPARK-28129.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-16 19:31:20 +09:00
Carson Wang d1a1376029 [SPARK-28356][SQL] Do not reduce the number of partitions for repartition in adaptive execution
## What changes were proposed in this pull request?
Adaptive execution reduces the number of post-shuffle partitions at runtime, even for shuffles caused by repartition. However, the user likely wants to get the desired number of partition when he calls repartition even in adaptive execution. This PR adds an internal config to control this and by default adaptive execution will not change the number of post-shuffle partition for repartition.

## How was this patch tested?
New tests added.

Closes #25121 from carsonwang/AE_repartition.

Authored-by: Carson Wang <carson.wang@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-07-16 17:35:46 +08:00
herman 421d9d56ef [SPARK-27485] EnsureRequirements.reorder should handle duplicate expressions gracefully
## What changes were proposed in this pull request?
When reordering joins EnsureRequirements only checks if all the join keys are present in the partitioning expression seq. This is problematic when the joins keys and and partitioning expressions both contain duplicates but not the same number of duplicates for each expression, e.g. `Seq(a, a, b)` vs `Seq(a, b, b)`. This fails with an index lookup failure in the `reorder` function.

This PR fixes this removing the equality checking logic from the `reorderJoinKeys` function, and by doing the multiset equality in the `reorder` function while building the reordered key sequences.

## How was this patch tested?
Added a unit test to the `PlannerSuite` and added an integration test to `JoinSuite`

Closes #25167 from hvanhovell/SPARK-27485.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-07-16 17:09:52 +08:00
Dongjoon Hyun 9a7f01d944 [SPARK-28201][SQL][TEST][FOLLOWUP] Fix Integration test suite according to the new exception message
## What changes were proposed in this pull request?

#25010 breaks the integration test suite due to the changing the user-facing exception like the following. This PR fixes the integration test suite.

```scala
-    require(
-      decimalVal.precision <= precision,
-      s"Decimal precision ${decimalVal.precision} exceeds max precision $precision")
+    if (decimalVal.precision > precision) {
+      throw new ArithmeticException(
+        s"Decimal precision ${decimalVal.precision} exceeds max precision $precision")
+    }
```

## How was this patch tested?

Manual test.
```
$ build/mvn install -DskipTests
$ build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12 test
```

Closes #25165 from dongjoon-hyun/SPARK-28201.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-07-16 16:02:49 +08:00
Yuming Wang 6926849247 [SPARK-28395][SQL] Division operator support integral division
## What changes were proposed in this pull request?

PostgreSQL, Teradata, SQL Server, DB2 and Presto perform integral division with the `/` operator.
But Oracle, Vertica, Hive, MySQL and MariaDB perform fractional division with the `/` operator.

This pr add a flag(`spark.sql.function.preferIntegralDivision`) to control whether to use integral division with the `/` operator.

Examples:

**PostgreSQL**:
```sql
postgres=# select substr(version(), 0, 16), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
     substr      | ?column? |     ?column?     |    ?column?     |     ?column?
-----------------+----------+------------------+-----------------+------------------
 PostgreSQL 11.3 |        3 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)
```
**SQL Server**:
```sql
1> select cast(10 as int) / cast(3 as int), cast(10.1 as float) / cast(3 as int), cast(10 as int) / cast(3.1 as float), cast(10.1 as float)/cast(3.1 as float);
2> go

----------- ------------------------ ------------------------ ------------------------
          3       3.3666666666666667        3.225806451612903        3.258064516129032

(1 rows affected)
```
**DB2**:
```sql
[db2inst12f3c821d36b7 ~]$ db2 "select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double) from table (sysproc.env_get_inst_info())"

1           2                        3                        4
----------- ------------------------ ------------------------ ------------------------
          3   +3.36666666666667E+000   +3.22580645161290E+000   +3.25806451612903E+000

  1 record(s) selected.
```
**Presto**:
```sql
presto> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
 _col0 |       _col1        |       _col2       |       _col3
-------+--------------------+-------------------+-------------------
     3 | 3.3666666666666667 | 3.225806451612903 | 3.258064516129032
(1 row)
```
**Teradata**:
![image](https://user-images.githubusercontent.com/5399861/61200701-e97d5380-a714-11e9-9a1d-57fd99d38c8d.png)

**Oracle**:
```sql
SQL> select 10 / 3 from dual;

      10/3
----------
3.33333333
```
**Vertica**
```sql
dbadmin=> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as float8) / cast(3 as int), cast(10 as int) / cast(3.1 as float8), cast(10.1 as float8)/cast(3.1 as float8);
              version               |       ?column?       |     ?column?     |    ?column?     |     ?column?
------------------------------------+----------------------+------------------+-----------------+------------------
 Vertica Analytic Database v9.1.1-0 | 3.333333333333333333 | 3.36666666666667 | 3.2258064516129 | 3.25806451612903
(1 row)
```
**Hive**:
```sql
hive> select cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
OK
3.3333333333333335	3.3666666666666667	3.225806451612903	3.258064516129032
Time taken: 0.143 seconds, Fetched: 1 row(s)
```
**MariaDB**:
```sql
MariaDB [(none)]> select version(), cast(10 as int) / cast(3 as int), cast(10.1 as double) / cast(3 as int), cast(10 as int) / cast(3.1 as double), cast(10.1 as double)/cast(3.1 as double);
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| version()                            | cast(10 as int) / cast(3 as int) | cast(10.1 as double) / cast(3 as int) | cast(10 as int) / cast(3.1 as double) | cast(10.1 as double)/cast(3.1 as double) |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
| 10.4.6-MariaDB-1:10.4.6+maria~bionic |                           3.3333 |                    3.3666666666666667 |                     3.225806451612903 |                        3.258064516129032 |
+--------------------------------------+----------------------------------+---------------------------------------+---------------------------------------+------------------------------------------+
1 row in set (0.000 sec)
```
**MySQL**:
```sql
mysql>  select version(), 10 / 3, 10 / 3.1, 10.1 / 3, 10.1 / 3.1;
+-----------+--------+----------+----------+------------+
| version() | 10 / 3 | 10 / 3.1 | 10.1 / 3 | 10.1 / 3.1 |
+-----------+--------+----------+----------+------------+
| 8.0.16    | 3.3333 |   3.2258 |  3.36667 |    3.25806 |
+-----------+--------+----------+----------+------------+
1 row in set (0.00 sec)
```
## How was this patch tested?

unit tests

Closes #25158 from wangyum/SPARK-28395.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-07-16 15:43:15 +08:00
Angers be4a55220a [SPARK-28106][SQL] When Spark SQL use "add jar" , before add to SparkContext, check jar path exist first.
## What changes were proposed in this pull request?
ISSUE :  https://issues.apache.org/jira/browse/SPARK-28106
When we use add jar in SQL, it will have three step:

- add jar to HiveClient's classloader
- HiveClientImpl.runHiveSQL("ADD JAR" + PATH)
- SessionStateBuilder.addJar

The second step seems has no impact to the whole process. Since event it failed, we still can execute.
The first step will add jar path to HiveClient's ClassLoader, then we can use the jar in HiveClientImpl
The Third Step will add this jar path to SparkContext. But expect local file path, it will call RpcServer's FileServer to add this to Env, the is you pass wrong path. it will cause error, but if you pass HDFS path or VIEWFS path, it won't check it and just add it to jar Path Map.

Then when next TaskSetManager send out Task, this path will be brought by TaskDescription. Then Executor will call updateDependencies, this method will check all jar path and file path in TaskDescription. Then error happends like below:

![image](https://user-images.githubusercontent.com/46485123/59817635-4a527f80-9353-11e9-9e08-9407b2b54023.png)

## How was this patch tested?
Exist Unit Test
Environment Test

Closes #24909 from AngersZhuuuu/SPARK-28106.

Lead-authored-by: Angers <angers.zhu@gamil.com>
Co-authored-by: 朱夷 <zhuyi01@corp.netease.com>
Signed-off-by: jerryshao <jerryshao@tencent.com>
2019-07-16 15:29:05 +08:00
Liang-Chi Hsieh b94fa979ef [SPARK-28345][SQL][PYTHON] PythonUDF predicate should be able to pushdown to join
## What changes were proposed in this pull request?

A `Filter` predicate using `PythonUDF` can't be push down into join condition, currently. A predicate like that should be able to push down to join condition. For `PythonUDF`s that can't be evaluated in join condition, `PullOutPythonUDFInJoinCondition` will pull them out later.

An example like:

```scala
val pythonTestUDF = TestPythonUDF(name = "udf")

val left = Seq((1, 2), (2, 3)).toDF("a", "b")
val right = Seq((1, 2), (3, 4)).toDF("c", "d")
val df = left.crossJoin(right).where(pythonTestUDF($"a") === pythonTestUDF($"c"))
```

Query plan before the PR:
```
== Physical Plan ==
*(3) Project [a#2121, b#2122, c#2132, d#2133]
+- *(3) Filter (pythonUDF0#2142 = pythonUDF1#2143)
   +- BatchEvalPython [udf(a#2121), udf(c#2132)], [pythonUDF0#2142, pythonUDF1#2143]
      +- BroadcastNestedLoopJoin BuildRight, Cross
         :- *(1) Project [_1#2116 AS a#2121, _2#2117 AS b#2122]
         :  +- LocalTableScan [_1#2116, _2#2117]
         +- BroadcastExchange IdentityBroadcastMode
            +- *(2) Project [_1#2127 AS c#2132, _2#2128 AS d#2133]
               +- LocalTableScan [_1#2127, _2#2128]
```

Query plan after the PR:
```
== Physical Plan ==
*(3) Project [a#2121, b#2122, c#2132, d#2133]
+- *(3) BroadcastHashJoin [pythonUDF0#2142], [pythonUDF0#2143], Cross, BuildRight
   :- BatchEvalPython [udf(a#2121)], [pythonUDF0#2142]
   :  +- *(1) Project [_1#2116 AS a#2121, _2#2117 AS b#2122]
   :     +- LocalTableScan [_1#2116, _2#2117]
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[2, string, true]))
      +- BatchEvalPython [udf(c#2132)], [pythonUDF0#2143]
         +- *(2) Project [_1#2127 AS c#2132, _2#2128 AS d#2133]
            +- LocalTableScan [_1#2127, _2#2128]
```

After this PR, the join can use `BroadcastHashJoin`, instead of `BroadcastNestedLoopJoin`.

## How was this patch tested?

Added tests.

Closes #25106 from viirya/pythonudf-join-condition.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-07-16 16:15:49 +09:00
Maxim Gekk 8e26d4d616 [SPARK-28408][SQL][TEST] Restrict test values for DateType, TimestampType and CalendarIntervalType
## What changes were proposed in this pull request?

Existing random generators in tests produce wide ranges of values that can be out of supported ranges for:
- `DateType`, the valid range is `[0001-01-01, 9999-12-31]`
- `TimestampType` supports values in `[0001-01-01T00:00:00.000000Z, 9999-12-31T23:59:59.999999Z]`
- `CalendarIntervalType` should define intervals for the ranges above.

Dates and timestamps produced by random literal generators are usually out of valid ranges for those types. And tests just check invalid values or values caused by arithmetic overflow.

In the PR, I propose to restrict tested pseudo-random values by valid ranges of `DateType`, `TimestampType` and `CalendarIntervalType`. This should allow to check valid values in test, and avoid wasting time on a priori invalid inputs.

## How was this patch tested?

The changes were checked by `DateExpressionsSuite` and modified `DateTimeUtils.dateAddMonths`:
```Scala
  def dateAddMonths(days: SQLDate, months: Int): SQLDate = {
    localDateToDays(LocalDate.ofEpochDay(days).plusMonths(months))
  }
```

Closes #25166 from MaxGekk/datetime-lit-random-gen.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-15 20:42:33 -07:00
shivsood d8996fd940 [SPARK-28152][SQL] Mapped ShortType to SMALLINT and FloatType to REAL for MsSqlServerDialect
## What changes were proposed in this pull request?
This PR aims to correct mappings in `MsSqlServerDialect`. `ShortType` is mapped to `SMALLINT` and `FloatType` is mapped to `REAL` per [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017) respectively.

ShortType and FloatTypes are not correctly mapped to right JDBC types when using JDBC connector. This results in tables and spark data frame being created with unintended types. The issue was observed when validating against SQLServer.

Refer [JBDC mapping]( https://docs.microsoft.com/en-us/sql/connect/jdbc/using-basic-data-types?view=sql-server-2017  ) for guidance on mappings between SQLServer, JDBC and Java. Note that java "Short" type should be mapped to JDBC "SMALLINT" and java Float should be mapped to JDBC "REAL".

Some example issue that can happen because of wrong mappings
    - Write from df with column type results in a SQL table of with column type as INTEGER as opposed to SMALLINT.Thus a larger table that expected.
    - Read results in a dataframe with type INTEGER as opposed to ShortType

- ShortType has a problem in both the the write and read path
- FloatTypes only have an issue with read path. In the write path Spark data type 'FloatType' is correctly mapped to JDBC equivalent data type 'Real'. But in the read path when JDBC data types need to be converted to Catalyst data types ( getCatalystType) 'Real' gets incorrectly gets mapped to 'DoubleType' rather than 'FloatType'.

Refer #28151 which contained this fix as one part of a larger PR.  Following PR #28151 discussion it was decided to file seperate PRs for each of the fixes.

## How was this patch tested?
UnitTest added in JDBCSuite.scala and these were tested.
Integration test updated and passed in MsSqlServerDialect.scala
E2E test done with SQLServer

Closes #25146 from shivsood/float_short_type_fix.

Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-15 12:12:36 -07:00
Gabor Somogyi 8f7ccc5e9c [SPARK-28404][SS] Fix negative timeout value in RateStreamContinuousPartitionReader
## What changes were proposed in this pull request?

`System.currentTimeMillis` read two times in a loop in `RateStreamContinuousPartitionReader`. If the test machine is slow enough and it spends quite some time between the `while` condition check and the `Thread.sleep` then the timeout value is negative and throws `IllegalArgumentException`.

In this PR I've fixed this issue.

## How was this patch tested?

Existing unit tests.

Closes #25162 from gaborgsomogyi/SPARK-28404.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-07-15 11:01:03 -07:00
Yesheng Ma 2f3997fddc [SPARK-28306][SQL][FOLLOWUP] Fix NormalizeFloatingNumbers rule idempotence for equi-join with <=> predicates
## What changes were proposed in this pull request?
Idempotence of the `NormalizeFloatingNumbers` rule was broken due to the implementation of `ExtractEquiJoinKeys`. There is no reason that we don't remove `EqualNullSafe` join keys from an equi-join's `otherPredicates`.

## How was this patch tested?
A new UT.

Closes #25126 from yeshengm/spark-28306.

Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2019-07-15 10:38:49 -07:00
Marcelo Vanzin 8d1e87ac90 [SPARK-28150][CORE][FOLLOW-UP] Don't try to log in when impersonating.
When fetching delegation tokens for a proxy user, don't try to log in,
since it will fail.

Closes #25141 from vanzin/SPARK-28150.2.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-07-15 10:32:34 -07:00
朱夷 8ecbb67b3b [SPARK-28311][SQL] Fix STS OpenSession failed return wrong origin PROTOCOL_VERSION
## What changes were proposed in this pull request?

For Thrift server, It's downward compatible. Such as if a PROTOCOL_VERSION_V7 client connect to a  PROTOCOL_VERSION_V8 server, when OpenSession, server will change his response's protocol version to min of (client and server).
`TProtocolVersion protocol = getMinVersion(CLIService.SERVER_VERSION,`
 `       req.getClient_protocol());`
then set it to OpenSession's response.

But if OpenSession failed , it won't execute behavior of reset response's protocol_version.
Then it will return server's origin protocol version.
Finally client will get en error as below:
![image](https://user-images.githubusercontent.com/46485123/61023164-54f4b780-a3db-11e9-8c49-60217b36287b.png)
Since we write a wrong database,, OpenSession failed, right protocol version haven't been rest.

## How was this patch tested?

Since I really don't know how to write unit test about this, so I build a jar with this PR,and retry the error above, then it will return a reasonable Error of DB not found :
![image](https://user-images.githubusercontent.com/46485123/61023923-67242500-a3de-11e9-8e98-8f391a038480.png)

Closes #25083 from AngersZhuuuu/SPARK-28311.

Authored-by: 朱夷 <zhuyi01@corp.netease.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-07-15 09:35:56 -05:00
Maxim Gekk f241fc7776 [SPARK-28389][SQL] Use Java 8 API in add_months
## What changes were proposed in this pull request?

In the PR, I propose to use the `plusMonths()` method of `LocalDate` to add months to a date. This method adds the specified amount to the months field of `LocalDate` in three steps:
1. Add the input months to the month-of-year field
2. Check if the resulting date would be invalid
3. Adjust the day-of-month to the last valid day if necessary

The difference between current behavior and propose one is in handling the last day of month in the original date. For example, adding 1 month to `2019-02-28` will produce `2019-03-28` comparing to the current implementation where the result is `2019-03-31`.

The proposed behavior is implemented in MySQL and PostgreSQL.

## How was this patch tested?

By existing test suites `DateExpressionsSuite`, `DateFunctionsSuite` and `DateTimeUtilsSuite`.

Closes #25153 from MaxGekk/add-months.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-07-15 20:49:39 +08:00