ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Mick Jermsurawong	b79cf0d143	[SPARK-28224][SQL] Check overflow in decimal Sum aggregate ## What changes were proposed in this pull request? - Currently `sum` in aggregates for decimal type can overflow and return null. - `Sum` expression codegens arithmetic on `sql.Decimal` and the output which preserves scale and precision goes into `UnsafeRowWriter`. Here overflowing will be converted to null when writing out. - It also does not go through this branch in `DecimalAggregates` because it's expecting precision of the sum (not the elements to be summed) to be less than 5. `4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L1400-L1403)` - This PR adds the check at the final result of the sum operator itself. `4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (L372-L376)` https://issues.apache.org/jira/browse/SPARK-28224 ## How was this patch tested? - Added an integration test on dataframe suite cc mgaido91 JoshRosen Closes #25033 from mickjermsurawong-stripe/SPARK-28224. Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-20 09:47:04 +09:00
Huaxin Gao	ec14b6eb65	[SPARK-28393][SQL][PYTHON][TESTS] Convert and port 'pgSQL/join.sql' into UDF test base ## What changes were proposed in this pull request? This PR adds some tests converted from ```pgSQL/join.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). <details><summary>Diff comparing to 'join.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out index f75fe05196..ad2b5dd0db 100644 --- a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out -240,10 +240,10 struct<> -- !query 27 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t) FROM J1_TBL AS tx -- !query 27 schema -struct<xxx:string,i:int,j:int,t:string> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string> -- !query 27 output 0 NULL zero 1 4 one -259,10 +259,10 struct<xxx:string,i:int,j:int,t:string> -- !query 28 -SELECT '' AS `xxx`, * +SELECT udf(udf('')) AS `xxx`, udf(udf(i)), udf(j), udf(t) FROM J1_TBL tx -- !query 28 schema -struct<xxx:string,i:int,j:int,t:string> +struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string> -- !query 28 output 0 NULL zero 1 4 one -278,10 +278,10 struct<xxx:string,i:int,j:int,t:string> -- !query 29 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, a, udf(udf(b)), c FROM J1_TBL AS t1 (a, b, c) -- !query 29 schema -struct<xxx:string,a:int,b:int,c:string> +struct<xxx:string,a:int,CAST(udf(cast(cast(udf(cast(b as string)) as int) as string)) AS INT):int,c:string> -- !query 29 output 0 NULL zero 1 4 one -297,10 +297,10 struct<xxx:string,a:int,b:int,c:string> -- !query 30 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(a), udf(b), udf(udf(c)) FROM J1_TBL t1 (a, b, c) -- !query 30 schema -struct<xxx:string,a:int,b:int,c:string> +struct<xxx:string,CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(c as string)) as string) as string)) AS STRING):string> -- !query 30 output 0 NULL zero 1 4 one -316,10 +316,10 struct<xxx:string,a:int,b:int,c:string> -- !query 31 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(a), b, udf(c), udf(d), e FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e) -- !query 31 schema -struct<xxx:string,a:int,b:int,c:string,d:int,e:int> +struct<xxx:string,CAST(udf(cast(a as string)) AS INT):int,b:int,CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(d as string)) AS INT):int,e:int> -- !query 31 output 0 NULL zero 0 NULL 0 NULL zero 1 -1 -423,7 +423,7 struct<xxx:string,a:int,b:int,c:string,d:int,e:int> -- !query 32 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, * FROM J1_TBL CROSS JOIN J2_TBL -- !query 32 schema struct<xxx:string,i:int,j:int,t:string,i:int,k:int> -530,20 +530,20 struct<xxx:string,i:int,j:int,t:string,i:int,k:int> -- !query 33 -SELECT '' AS `xxx`, i, k, t +SELECT udf('') AS `xxx`, udf(i) AS i, udf(k), udf(t) AS t FROM J1_TBL CROSS JOIN J2_TBL -- !query 33 schema struct<> -- !query 33 output org.apache.spark.sql.AnalysisException -Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 20 +Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 29 -- !query 34 -SELECT '' AS `xxx`, t1.i, k, t +SELECT udf('') AS `xxx`, udf(t1.i) AS i, udf(k), udf(t) FROM J1_TBL t1 CROSS JOIN J2_TBL t2 -- !query 34 schema -struct<xxx:string,i:int,k:int,t:string> +struct<xxx:string,i:int,CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string> -- !query 34 output 0 -1 zero 0 -3 zero -647,11 +647,11 struct<xxx:string,i:int,k:int,t:string> -- !query 35 -SELECT '' AS `xxx`, ii, tt, kk +SELECT udf(udf('')) AS `xxx`, udf(udf(ii)) AS ii, udf(udf(tt)) AS tt, udf(udf(kk)) FROM (J1_TBL CROSS JOIN J2_TBL) AS tx (ii, jj, tt, ii2, kk) -- !query 35 schema -struct<xxx:string,ii:int,tt:string,kk:int> +struct<xxx:string,ii:int,tt:string,CAST(udf(cast(cast(udf(cast(kk as string)) as int) as string)) AS INT):int> -- !query 35 output 0 zero -1 0 zero -3 -755,10 +755,10 struct<xxx:string,ii:int,tt:string,kk:int> -- !query 36 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(udf(j1_tbl.i)), udf(j), udf(t), udf(a.i), udf(a.k), udf(b.i), udf(b.k) FROM J1_TBL CROSS JOIN J2_TBL a CROSS JOIN J2_TBL b -- !query 36 schema -struct<xxx:string,i:int,j:int,t:string,i:int,k:int,i:int,k:int> +struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int> -- !query 36 output 0 NULL zero 0 NULL 0 NULL 0 NULL zero 0 NULL 1 -1 -1654,10 +1654,10 struct<xxx:string,i:int,j:int,t:string,i:int,k:int,i:int,k:int> -- !query 37 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i) AS i, udf(j), udf(t) AS t, udf(k) FROM J1_TBL INNER JOIN J2_TBL USING (i) -- !query 37 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,i:int,CAST(udf(cast(j as string)) AS INT):int,t:string,CAST(udf(cast(k as string)) AS INT):int> -- !query 37 output 0 NULL zero NULL 1 4 one -1 -1669,10 +1669,10 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 38 -SELECT '' AS `xxx`, * +SELECT udf(udf('')) AS `xxx`, udf(i), udf(j) AS j, udf(t), udf(k) AS k FROM J1_TBL JOIN J2_TBL USING (i) -- !query 38 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,j:int,CAST(udf(cast(t as string)) AS STRING):string,k:int> -- !query 38 output 0 NULL zero NULL 1 4 one -1 -1684,9 +1684,9 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 39 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, * FROM J1_TBL t1 (a, b, c) JOIN J2_TBL t2 (a, d) USING (a) - ORDER BY a, d + ORDER BY udf(udf(a)), udf(d) -- !query 39 schema struct<xxx:string,a:int,b:int,c:string,d:int> -- !query 39 output -1700,10 +1700,10 struct<xxx:string,a:int,b:int,c:string,d:int> -- !query 40 -SELECT '' AS `xxx`, * +SELECT udf(udf('')) AS `xxx`, udf(i), udf(j), udf(t), udf(k) FROM J1_TBL NATURAL JOIN J2_TBL -- !query 40 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 40 output 0 NULL zero NULL 1 4 one -1 -1715,10 +1715,10 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 41 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(udf(udf(a))) AS a, udf(b), udf(c), udf(d) FROM J1_TBL t1 (a, b, c) NATURAL JOIN J2_TBL t2 (a, d) -- !query 41 schema -struct<xxx:string,a:int,b:int,c:string,d:int> +struct<xxx:string,a:int,CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(d as string)) AS INT):int> -- !query 41 output 0 NULL zero NULL 1 4 one -1 -1730,10 +1730,10 struct<xxx:string,a:int,b:int,c:string,d:int> -- !query 42 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(udf(a)), udf(udf(b)), udf(udf(c)) AS c, udf(udf(udf(d))) AS d FROM J1_TBL t1 (a, b, c) NATURAL JOIN J2_TBL t2 (d, a) -- !query 42 schema -struct<xxx:string,a:int,b:int,c:string,d:int> +struct<xxx:string,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(b as string)) as int) as string)) AS INT):int,c:string,d:int> -- !query 42 output 0 NULL zero NULL 2 3 two 2 -1741,10 +1741,10 struct<xxx:string,a:int,b:int,c:string,d:int> -- !query 43 -SELECT '' AS `xxx`, * - FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = J2_TBL.i) +SELECT udf('') AS `xxx`, udf(J1_TBL.i), udf(udf(J1_TBL.j)), udf(J1_TBL.t), udf(J2_TBL.i), udf(J2_TBL.k) + FROM J1_TBL JOIN J2_TBL ON (udf(J1_TBL.i) = J2_TBL.i) -- !query 43 schema -struct<xxx:string,i:int,j:int,t:string,i:int,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int> -- !query 43 output 0 NULL zero 0 NULL 1 4 one 1 -1 -1756,10 +1756,10 struct<xxx:string,i:int,j:int,t:string,i:int,k:int> -- !query 44 -SELECT '' AS `xxx`, * - FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = J2_TBL.k) +SELECT udf('') AS `xxx`, udf(udf(J1_TBL.i)), udf(udf(J1_TBL.j)), udf(udf(J1_TBL.t)), J2_TBL.i, J2_TBL.k + FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = udf(J2_TBL.k)) -- !query 44 schema -struct<xxx:string,i:int,j:int,t:string,i:int,k:int> +struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(t as string)) as string) as string)) AS STRING):string,i:int,k:int> -- !query 44 output 0 NULL zero NULL 0 2 3 two 2 2 -1767,10 +1767,10 struct<xxx:string,i:int,j:int,t:string,i:int,k:int> -- !query 45 -SELECT '' AS `xxx`, * - FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i <= J2_TBL.k) +SELECT udf('') AS `xxx`, udf(J1_TBL.i), udf(J1_TBL.j), udf(J1_TBL.t), udf(J2_TBL.i), udf(J2_TBL.k) + FROM J1_TBL JOIN J2_TBL ON (udf(J1_TBL.i) <= udf(udf(J2_TBL.k))) -- !query 45 schema -struct<xxx:string,i:int,j:int,t:string,i:int,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int> -- !query 45 output 0 NULL zero 2 2 0 NULL zero 2 4 -1784,11 +1784,11 struct<xxx:string,i:int,j:int,t:string,i:int,k:int> -- !query 46 -SELECT '' AS `xxx`, * +SELECT udf(udf('')) AS `xxx`, udf(i), udf(j), udf(t), udf(k) FROM J1_TBL LEFT OUTER JOIN J2_TBL USING (i) - ORDER BY i, k, t + ORDER BY udf(udf(i)), udf(k), udf(t) -- !query 46 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 46 output NULL NULL null NULL NULL 0 zero NULL -1806,11 +1806,11 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 47 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k) FROM J1_TBL LEFT JOIN J2_TBL USING (i) - ORDER BY i, k, t + ORDER BY udf(i), udf(udf(k)), udf(t) -- !query 47 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 47 output NULL NULL null NULL NULL 0 zero NULL -1828,10 +1828,10 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 48 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(udf(i)), udf(j), udf(t), udf(k) FROM J1_TBL RIGHT OUTER JOIN J2_TBL USING (i) -- !query 48 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 48 output 0 NULL zero NULL 1 4 one -1 -1845,10 +1845,10 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 49 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(udf(j)), udf(t), udf(k) FROM J1_TBL RIGHT JOIN J2_TBL USING (i) -- !query 49 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 49 output 0 NULL zero NULL 1 4 one -1 -1862,11 +1862,11 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 50 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(udf(t)), udf(k) FROM J1_TBL FULL OUTER JOIN J2_TBL USING (i) - ORDER BY i, k, t + ORDER BY udf(udf(i)), udf(k), udf(t) -- !query 50 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(t as string)) as string) as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 50 output NULL NULL NULL NULL NULL NULL null NULL -1886,11 +1886,11 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 51 -SELECT '' AS `xxx`, * +SELECT udf('') AS `xxx`, udf(i), udf(j), t, udf(udf(k)) FROM J1_TBL FULL JOIN J2_TBL USING (i) - ORDER BY i, k, t + ORDER BY udf(udf(i)), udf(k), udf(udf(t)) -- !query 51 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,t:string,CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int> -- !query 51 output NULL NULL NULL NULL NULL NULL null NULL -1910,19 +1910,19 struct<xxx:string,i:int,j:int,t:string,k:int> -- !query 52 -SELECT '' AS `xxx`, * - FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (k = 1) +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(udf(k)) + FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (udf(k) = 1) -- !query 52 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int> -- !query 52 output -- !query 53 -SELECT '' AS `xxx`, * - FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (i = 1) +SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k) + FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (udf(udf(i)) = udf(1)) -- !query 53 schema -struct<xxx:string,i:int,j:int,t:string,k:int> +struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int> -- !query 53 output 1 4 one -1 -2020,9 +2020,9 ee NULL 42 NULL -- !query 65 SELECT * FROM -(SELECT * FROM t2) as s2 +(SELECT udf(name) as name, t2.n FROM t2) as s2 INNER JOIN -(SELECT * FROM t3) s3 +(SELECT udf(udf(name)) as name, t3.n FROM t3) s3 USING (name) -- !query 65 schema struct<name:string,n:int,n:int> -2033,9 +2033,9 cc 22 23 -- !query 66 SELECT * FROM -(SELECT * FROM t2) as s2 +(SELECT udf(udf(name)) as name, t2.n FROM t2) as s2 LEFT JOIN -(SELECT * FROM t3) s3 +(SELECT udf(name) as name, t3.n FROM t3) s3 USING (name) -- !query 66 schema struct<name:string,n:int,n:int> -2046,13 +2046,13 ee 42 NULL -- !query 67 -SELECT * FROM +SELECT udf(name), udf(udf(s2.n)), udf(s3.n) FROM (SELECT * FROM t2) as s2 FULL JOIN (SELECT * FROM t3) s3 USING (name) -- !query 67 schema -struct<name:string,n:int,n:int> +struct<CAST(udf(cast(name as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(n as string)) as int) as string)) AS INT):int,CAST(udf(cast(n as string)) AS INT):int> -- !query 67 output bb 12 13 cc 22 23 -2062,9 +2062,9 ee 42 NULL -- !query 68 SELECT * FROM -(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2 +(SELECT udf(udf(name)) as name, udf(n) as s2_n, udf(2) as s2_2 FROM t2) as s2 NATURAL INNER JOIN -(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3 +(SELECT udf(name) as name, udf(udf(n)) as s3_n, udf(3) as s3_2 FROM t3) s3 -- !query 68 schema struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int> -- !query 68 output -2074,9 +2074,9 cc 22 2 23 3 -- !query 69 SELECT * FROM -(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2 +(SELECT udf(name) as name, udf(udf(n)) as s2_n, 2 as s2_2 FROM t2) as s2 NATURAL LEFT JOIN -(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3 +(SELECT udf(udf(name)) as name, udf(n) as s3_n, 3 as s3_2 FROM t3) s3 -- !query 69 schema struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int> -- !query 69 output -2087,9 +2087,9 ee 42 2 NULL NULL -- !query 70 SELECT * FROM -(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2 +(SELECT udf(name) as name, udf(n) as s2_n, 2 as s2_2 FROM t2) as s2 NATURAL FULL JOIN -(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3 +(SELECT udf(udf(name)) as name, udf(udf(n)) as s3_n, 3 as s3_2 FROM t3) s3 -- !query 70 schema struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int> -- !query 70 output -2101,11 +2101,11 ee 42 2 NULL NULL -- !query 71 SELECT * FROM -(SELECT name, n as s1_n, 1 as s1_1 FROM t1) as s1 +(SELECT udf(udf(name)) as name, udf(n) as s1_n, 1 as s1_1 FROM t1) as s1 NATURAL INNER JOIN -(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2 +(SELECT udf(name) as name, udf(n) as s2_n, 2 as s2_2 FROM t2) as s2 NATURAL INNER JOIN -(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3 +(SELECT udf(udf(udf(name))) as name, udf(n) as s3_n, 3 as s3_2 FROM t3) s3 -- !query 71 schema struct<name:string,s1_n:int,s1_1:int,s2_n:int,s2_2:int,s3_n:int,s3_2:int> -- !query 71 output -2114,11 +2114,11 bb 11 1 12 2 13 3 -- !query 72 SELECT * FROM -(SELECT name, n as s1_n, 1 as s1_1 FROM t1) as s1 +(SELECT udf(name) as name, udf(n) as s1_n, udf(udf(1)) as s1_1 FROM t1) as s1 NATURAL FULL JOIN -(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2 +(SELECT udf(name) as name, udf(udf(n)) as s2_n, udf(2) as s2_2 FROM t2) as s2 NATURAL FULL JOIN -(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3 +(SELECT udf(udf(name)) as name, udf(n) as s3_n, udf(3) as s3_2 FROM t3) s3 -- !query 72 schema struct<name:string,s1_n:int,s1_1:int,s2_n:int,s2_2:int,s3_n:int,s3_2:int> -- !query 72 output -2129,16 +2129,16 ee NULL NULL 42 2 NULL NULL -- !query 73 -SELECT * FROM -(SELECT name, n as s1_n FROM t1) as s1 +SELECT name, udf(udf(s1_n)), udf(s2_n), udf(s3_n) FROM +(SELECT name, udf(udf(n)) as s1_n FROM t1) as s1 NATURAL FULL JOIN (SELECT * FROM - (SELECT name, n as s2_n FROM t2) as s2 + (SELECT name, udf(n) as s2_n FROM t2) as s2 NATURAL FULL JOIN - (SELECT name, n as s3_n FROM t3) as s3 + (SELECT name, udf(udf(n)) as s3_n FROM t3) as s3 ) ss2 -- !query 73 schema -struct<name:string,s1_n:int,s2_n:int,s3_n:int> +struct<name:string,CAST(udf(cast(cast(udf(cast(s1_n as string)) as int) as string)) AS INT):int,CAST(udf(cast(s2_n as string)) AS INT):int,CAST(udf(cast(s3_n as string)) AS INT):int> -- !query 73 output bb 11 12 13 cc NULL 22 23 -2151,9 +2151,9 SELECT * FROM (SELECT name, n as s1_n FROM t1) as s1 NATURAL FULL JOIN (SELECT * FROM - (SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2 + (SELECT name, udf(udf(n)) as s2_n, 2 as s2_2 FROM t2) as s2 NATURAL FULL JOIN - (SELECT name, n as s3_n FROM t3) as s3 + (SELECT name, udf(n) as s3_n FROM t3) as s3 ) ss2 -- !query 74 schema struct<name:string,s1_n:int,s2_n:int,s2_2:int,s3_n:int> -2165,13 +2165,13 ee NULL 42 2 NULL -- !query 75 -SELECT * FROM - (SELECT name, n as s1_n FROM t1) as s1 +SELECT s1.name, udf(s1_n), s2.name, udf(udf(s2_n)) FROM + (SELECT name, udf(n) as s1_n FROM t1) as s1 FULL JOIN (SELECT name, 2 as s2_n FROM t2) as s2 -ON (s1_n = s2_n) +ON (udf(udf(s1_n)) = udf(s2_n)) -- !query 75 schema -struct<name:string,s1_n:int,name:string,s2_n:int> +struct<name:string,CAST(udf(cast(s1_n as string)) AS INT):int,name:string,CAST(udf(cast(cast(udf(cast(s2_n as string)) as int) as string)) AS INT):int> -- !query 75 output NULL NULL bb 2 NULL NULL cc 2 -2200,9 +2200,9 struct<> -- !query 78 -select * from x +select udf(udf(x1)), udf(x2) from x -- !query 78 schema -struct<x1:int,x2:int> +struct<CAST(udf(cast(cast(udf(cast(x1 as string)) as int) as string)) AS INT):int,CAST(udf(cast(x2 as string)) AS INT):int> -- !query 78 output 1 11 2 22 -2212,9 +2212,9 struct<x1:int,x2:int> -- !query 79 -select * from y +select udf(y1), udf(udf(y2)) from y -- !query 79 schema -struct<y1:int,y2:int> +struct<CAST(udf(cast(y1 as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(y2 as string)) as int) as string)) AS INT):int> -- !query 79 output 1 111 2 222 -2223,7 +2223,7 struct<y1:int,y2:int> -- !query 80 -select * from x left join y on (x1 = y1 and x2 is not null) +select * from x left join y on (udf(x1) = udf(udf(y1)) and udf(x2) is not null) -- !query 80 schema struct<x1:int,x2:int,y1:int,y2:int> -- !query 80 output -2235,7 +2235,7 struct<x1:int,x2:int,y1:int,y2:int> -- !query 81 -select * from x left join y on (x1 = y1 and y2 is not null) +select * from x left join y on (udf(udf(x1)) = udf(y1) and udf(y2) is not null) -- !query 81 schema struct<x1:int,x2:int,y1:int,y2:int> -- !query 81 output -2247,8 +2247,8 struct<x1:int,x2:int,y1:int,y2:int> -- !query 82 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1) +select * from (x left join y on (udf(x1) = udf(udf(y1)))) left join x xx(xx1,xx2) +on (udf(udf(x1)) = udf(xx1)) -- !query 82 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 82 output -2260,8 +2260,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 83 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1 and x2 is not null) +select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2) +on (udf(x1) = xx1 and udf(x2) is not null) -- !query 83 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 83 output -2273,8 +2273,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 84 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1 and y2 is not null) +select * from (x left join y on (x1 = udf(y1))) left join x xx(xx1,xx2) +on (udf(x1) = udf(udf(xx1)) and udf(y2) is not null) -- !query 84 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 84 output -2286,8 +2286,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 85 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1 and xx2 is not null) +select * from (x left join y on (udf(x1) = y1)) left join x xx(xx1,xx2) +on (udf(udf(x1)) = udf(xx1) and udf(udf(xx2)) is not null) -- !query 85 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 85 output -2299,8 +2299,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 86 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1) where (x2 is not null) +select * from (x left join y on (udf(udf(x1)) = udf(udf(y1)))) left join x xx(xx1,xx2) +on (udf(x1) = udf(xx1)) where (udf(x2) is not null) -- !query 86 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 86 output -2310,8 +2310,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 87 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1) where (y2 is not null) +select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2) +on (udf(x1) = xx1) where (udf(y2) is not null) -- !query 87 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 87 output -2321,8 +2321,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 88 -select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2) -on (x1 = xx1) where (xx2 is not null) +select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2) +on (x1 = udf(xx1)) where (xx2 is not null) -- !query 88 schema struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 88 output -2332,75 +2332,75 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int> -- !query 89 -select count() from tenk1 a where unique1 in - (select unique1 from tenk1 b join tenk1 c using (unique1) - where b.unique2 = 42) +select udf(udf(count())) from tenk1 a where udf(udf(unique1)) in + (select udf(unique1) from tenk1 b join tenk1 c using (unique1) + where udf(udf(b.unique2)) = udf(42)) -- !query 89 schema -struct<count(1):bigint> +struct<CAST(udf(cast(cast(udf(cast(count(1) as string)) as bigint) as string)) AS BIGINT):bigint> -- !query 89 output 1 -- !query 90 -select count() from tenk1 x where - x.unique1 in (select a.f1 from int4_tbl a,float8_tbl b where a.f1=b.f1) and - x.unique1 = 0 and - x.unique1 in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=bb.f1) +select udf(count()) from tenk1 x where + udf(x.unique1) in (select udf(a.f1) from int4_tbl a,float8_tbl b where udf(udf(a.f1))=b.f1) and + udf(x.unique1) = 0 and + udf(x.unique1) in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=udf(udf(bb.f1))) -- !query 90 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 90 output 1 -- !query 91 -select count() from tenk1 x where - x.unique1 in (select a.f1 from int4_tbl a,float8_tbl b where a.f1=b.f1) and - x.unique1 = 0 and - x.unique1 in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=bb.f1) +select udf(udf(count())) from tenk1 x where + udf(x.unique1) in (select udf(a.f1) from int4_tbl a,float8_tbl b where udf(udf(a.f1))=b.f1) and + udf(x.unique1) = 0 and + udf(udf(x.unique1)) in (select udf(aa.f1) from int4_tbl aa,float8_tbl bb where udf(aa.f1)=udf(udf(bb.f1))) -- !query 91 schema -struct<count(1):bigint> +struct<CAST(udf(cast(cast(udf(cast(count(1) as string)) as bigint) as string)) AS BIGINT):bigint> -- !query 91 output 1 -- !query 92 select * from int8_tbl i1 left join (int8_tbl i2 join - (select 123 as x) ss on i2.q1 = x) on i1.q2 = i2.q2 -order by 1, 2 + (select udf(123) as x) ss on udf(udf(i2.q1)) = udf(x)) on udf(udf(i1.q2)) = udf(udf(i2.q2)) +order by udf(udf(1)), 2 -- !query 92 schema struct<q1:bigint,q2:bigint,q1:bigint,q2:bigint,x:int> -- !query 92 output -123 456 123 456 123 -123 4567890123456789 123 4567890123456789 123 4567890123456789 -4567890123456789 NULL NULL NULL 4567890123456789 123 NULL NULL NULL +123 456 123 456 123 +123 4567890123456789 123 4567890123456789 123 4567890123456789 4567890123456789 123 4567890123456789 123 -- !query 93 -select count() +select udf(count()) from - (select t3.tenthous as x1, coalesce(t1.stringu1, t2.stringu1) as x2 + (select udf(t3.tenthous) as x1, udf(coalesce(udf(t1.stringu1), udf(t2.stringu1))) as x2 from tenk1 t1 - left join tenk1 t2 on t1.unique1 = t2.unique1 - join tenk1 t3 on t1.unique2 = t3.unique2) ss, + left join tenk1 t2 on udf(t1.unique1) = udf(t2.unique1) + join tenk1 t3 on t1.unique2 = udf(t3.unique2)) ss, tenk1 t4, tenk1 t5 -where t4.thousand = t5.unique1 and ss.x1 = t4.tenthous and ss.x2 = t5.stringu1 +where udf(t4.thousand) = udf(t5.unique1) and udf(udf(ss.x1)) = t4.tenthous and udf(ss.x2) = udf(udf(t5.stringu1)) -- !query 93 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 93 output 1000 -- !query 94 -select a.f1, b.f1, t.thousand, t.tenthous from +select udf(a.f1), udf(b.f1), udf(t.thousand), udf(t.tenthous) from tenk1 t, - (select sum(f1)+1 as f1 from int4_tbl i4a) a, - (select sum(f1) as f1 from int4_tbl i4b) b -where b.f1 = t.thousand and a.f1 = b.f1 and (a.f1+b.f1+999) = t.tenthous + (select udf(udf(sum(udf(f1))+1)) as f1 from int4_tbl i4a) a, + (select udf(sum(udf(f1))) as f1 from int4_tbl i4b) b +where b.f1 = udf(t.thousand) and udf(a.f1) = udf(b.f1) and udf((udf(a.f1)+udf(b.f1)+999)) = udf(udf(t.tenthous)) -- !query 94 schema -struct<f1:bigint,f1:bigint,thousand:int,tenthous:int> +struct<CAST(udf(cast(f1 as string)) AS BIGINT):bigint,CAST(udf(cast(f1 as string)) AS BIGINT):bigint,CAST(udf(cast(thousand as string)) AS INT):int,CAST(udf(cast(tenthous as string)) AS INT):int> -- !query 94 output -2408,8 +2408,8 struct<f1:bigint,f1:bigint,thousand:int,tenthous:int> -- !query 95 select * from j1_tbl full join - (select * from j2_tbl order by j2_tbl.i desc, j2_tbl.k asc) j2_tbl - on j1_tbl.i = j2_tbl.i and j1_tbl.i = j2_tbl.k + (select * from j2_tbl order by udf(udf(j2_tbl.i)) desc, udf(j2_tbl.k) asc) j2_tbl + on udf(j1_tbl.i) = udf(j2_tbl.i) and udf(j1_tbl.i) = udf(j2_tbl.k) -- !query 95 schema struct<i:int,j:int,t:string,i:int,k:int> -- !query 95 output -2435,13 +2435,13 NULL NULL null NULL NULL -- !query 96 -select count() from - (select from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x +select udf(count()) from + (select from tenk1 x order by udf(x.thousand), udf(udf(x.twothousand)), x.fivethous) x left join - (select * from tenk1 y order by y.unique2) y - on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2 + (select * from tenk1 y order by udf(y.unique2)) y + on udf(x.thousand) = y.unique2 and x.twothousand = udf(y.hundred) and x.fivethous = y.unique2 -- !query 96 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 96 output 10000 -2507,7 +2507,7 struct<> -- !query 104 -select tt1., tt2. from tt1 left join tt2 on tt1.joincol = tt2.joincol +select tt1., tt2. from tt1 left join tt2 on udf(udf(tt1.joincol)) = udf(tt2.joincol) -- !query 104 schema struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int> -- !query 104 output -2517,7 +2517,7 struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int> -- !query 105 -select tt1., tt2. from tt2 right join tt1 on tt1.joincol = tt2.joincol +select tt1., tt2. from tt2 right join tt1 on udf(udf(tt1.joincol)) = udf(udf(tt2.joincol)) -- !query 105 schema struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int> -- !query 105 output -2527,10 +2527,10 struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int> -- !query 106 -select count() from tenk1 a, tenk1 b - where a.hundred = b.thousand and (b.fivethous % 10) < 10 +select udf(count()) from tenk1 a, tenk1 b + where udf(a.hundred) = b.thousand and udf(udf((b.fivethous % 10)) < 10) -- !query 106 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 106 output 100000 -2584,14 +2584,14 struct<> -- !query 113 -SELECT a.f1 +SELECT udf(udf(a.f1)) as f1 FROM tt4 a LEFT JOIN ( SELECT b.f1 - FROM tt3 b LEFT JOIN tt3 c ON (b.f1 = c.f1) - WHERE c.f1 IS NULL -) AS d ON (a.f1 = d.f1) -WHERE d.f1 IS NULL + FROM tt3 b LEFT JOIN tt3 c ON udf(b.f1) = udf(c.f1) + WHERE udf(c.f1) IS NULL +) AS d ON udf(a.f1) = d.f1 +WHERE udf(udf(d.f1)) IS NULL -- !query 113 schema struct<f1:int> -- !query 113 output -2621,7 +2621,7 struct<> -- !query 116 -select * from tt5,tt6 where tt5.f1 = tt6.f1 and tt5.f1 = tt5.f2 - tt6.f2 +select * from tt5,tt6 where udf(tt5.f1) = udf(tt6.f1) and udf(tt5.f1) = udf(udf(tt5.f2) - udf(tt6.f2)) -- !query 116 schema struct<f1:int,f2:int,f1:int,f2:int> -- !query 116 output -2649,12 +2649,12 struct<> -- !query 119 -select yy.pkyy as yy_pkyy, yy.pkxx as yy_pkxx, yya.pkyy as yya_pkyy, - xxa.pkxx as xxa_pkxx, xxb.pkxx as xxb_pkxx +select udf(udf(yy.pkyy)) as yy_pkyy, udf(yy.pkxx) as yy_pkxx, udf(yya.pkyy) as yya_pkyy, + udf(xxa.pkxx) as xxa_pkxx, udf(xxb.pkxx) as xxb_pkxx from yy - left join (SELECT * FROM yy where pkyy = 101) as yya ON yy.pkyy = yya.pkyy - left join xx xxa on yya.pkxx = xxa.pkxx - left join xx xxb on coalesce (xxa.pkxx, 1) = xxb.pkxx + left join (SELECT * FROM yy where pkyy = 101) as yya ON udf(yy.pkyy) = udf(yya.pkyy) + left join xx xxa on udf(yya.pkxx) = udf(udf(xxa.pkxx)) + left join xx xxb on udf(udf(coalesce (xxa.pkxx, 1))) = udf(xxb.pkxx) -- !query 119 schema struct<yy_pkyy:int,yy_pkxx:int,yya_pkyy:int,xxa_pkxx:int,xxb_pkxx:int> -- !query 119 output -2693,9 +2693,9 struct<> -- !query 123 select * from - zt2 left join zt3 on (f2 = f3) - left join zt1 on (f3 = f1) -where f2 = 53 + zt2 left join zt3 on (udf(f2) = udf(udf(f3))) + left join zt1 on (udf(udf(f3)) = udf(f1)) +where udf(f2) = 53 -- !query 123 schema struct<f2:int,f3:int,f1:int> -- !query 123 output -2712,9 +2712,9 struct<> -- !query 125 select * from - zt2 left join zt3 on (f2 = f3) - left join zv1 on (f3 = f1) -where f2 = 53 + zt2 left join zt3 on (f2 = udf(f3)) + left join zv1 on (udf(f3) = f1) +where udf(udf(f2)) = 53 -- !query 125 schema struct<f2:int,f3:int,f1:int,junk:string> -- !query 125 output -2722,12 +2722,12 struct<f2:int,f3:int,f1:int,junk:string> -- !query 126 -select a.unique2, a.ten, b.tenthous, b.unique2, b.hundred -from tenk1 a left join tenk1 b on a.unique2 = b.tenthous -where a.unique1 = 42 and - ((b.unique2 is null and a.ten = 2) or b.hundred = 3) +select udf(a.unique2), udf(a.ten), udf(b.tenthous), udf(b.unique2), udf(b.hundred) +from tenk1 a left join tenk1 b on a.unique2 = udf(b.tenthous) +where udf(a.unique1) = 42 and + ((udf(b.unique2) is null and udf(a.ten) = 2) or udf(udf(b.hundred)) = udf(udf(3))) -- !query 126 schema -struct<unique2:int,ten:int,tenthous:int,unique2:int,hundred:int> +struct<CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(ten as string)) AS INT):int,CAST(udf(cast(tenthous as string)) AS INT):int,CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(hundred as string)) AS INT):int> -- !query 126 output -2749,7 +2749,7 struct<> -- !query 129 -select * from a left join b on i = x and i = y and x = i +select * from a left join b on udf(i) = x and i = udf(y) and udf(x) = udf(i) -- !query 129 schema struct<i:int,x:int,y:int> -- !query 129 output -2757,11 +2757,11 struct<i:int,x:int,y:int> -- !query 130 -select t1.q2, count(t2.) -from int8_tbl t1 left join int8_tbl t2 on (t1.q2 = t2.q1) -group by t1.q2 order by 1 +select udf(t1.q2), udf(count(t2.)) +from int8_tbl t1 left join int8_tbl t2 on (udf(udf(t1.q2)) = t2.q1) +group by udf(t1.q2) order by 1 -- !query 130 schema -struct<q2:bigint,count(q1, q2):bigint> +struct<CAST(udf(cast(q2 as string)) AS BIGINT):bigint,CAST(udf(cast(count(q1, q2) as string)) AS BIGINT):bigint> -- !query 130 output -4567890123456789 0 123 2 -2770,11 +2770,11 struct<q2:bigint,count(q1, q2):bigint> -- !query 131 -select t1.q2, count(t2.) -from int8_tbl t1 left join (select from int8_tbl) t2 on (t1.q2 = t2.q1) -group by t1.q2 order by 1 +select udf(udf(t1.q2)), udf(count(t2.)) +from int8_tbl t1 left join (select from int8_tbl) t2 on (udf(udf(t1.q2)) = udf(t2.q1)) +group by udf(udf(t1.q2)) order by 1 -- !query 131 schema -struct<q2:bigint,count(q1, q2):bigint> +struct<CAST(udf(cast(cast(udf(cast(q2 as string)) as bigint) as string)) AS BIGINT):bigint,CAST(udf(cast(count(q1, q2) as string)) AS BIGINT):bigint> -- !query 131 output -4567890123456789 0 123 2 -2783,13 +2783,13 struct<q2:bigint,count(q1, q2):bigint> -- !query 132 -select t1.q2, count(t2.) +select udf(t1.q2) as q2, udf(udf(count(t2.))) from int8_tbl t1 left join - (select q1, case when q2=1 then 1 else q2 end as q2 from int8_tbl) t2 - on (t1.q2 = t2.q1) + (select udf(q1) as q1, case when q2=1 then 1 else q2 end as q2 from int8_tbl) t2 + on (udf(t1.q2) = udf(t2.q1)) group by t1.q2 order by 1 -- !query 132 schema -struct<q2:bigint,count(q1, q2):bigint> +struct<q2:bigint,CAST(udf(cast(cast(udf(cast(count(q1, q2) as string)) as bigint) as string)) AS BIGINT):bigint> -- !query 132 output -4567890123456789 0 123 2 -2828,17 +2828,17 struct<> -- !query 136 -select c.name, ss.code, ss.b_cnt, ss.const +select udf(c.name), udf(ss.code), udf(ss.b_cnt), udf(ss.const) from c left join (select a.code, coalesce(b_grp.cnt, 0) as b_cnt, -1 as const from a left join - (select count(1) as cnt, b.a from b group by b.a) as b_grp - on a.code = b_grp.a + (select udf(count(1)) as cnt, b.a as a from b group by b.a) as b_grp + on udf(a.code) = udf(udf(b_grp.a)) ) as ss - on (c.a = ss.code) + on (udf(udf(c.a)) = udf(ss.code)) order by c.name -- !query 136 schema -struct<name:string,code:string,b_cnt:bigint,const:int> +struct<CAST(udf(cast(name as string)) AS STRING):string,CAST(udf(cast(code as string)) AS STRING):string,CAST(udf(cast(b_cnt as string)) AS BIGINT):bigint,CAST(udf(cast(const as string)) AS INT):int> -- !query 136 output A p 2 -1 B q 0 -1 -2852,15 +2852,15 LEFT JOIN ( SELECT sub3.key3, sub4.value2, COALESCE(sub4.value2, 66) as value3 FROM ( SELECT 1 as key3 ) sub3 LEFT JOIN - ( SELECT sub5.key5, COALESCE(sub6.value1, 1) as value2 FROM + ( SELECT udf(sub5.key5) as key5, udf(udf(COALESCE(sub6.value1, 1))) as value2 FROM ( SELECT 1 as key5 ) sub5 LEFT JOIN ( SELECT 2 as key6, 42 as value1 ) sub6 - ON sub5.key5 = sub6.key6 + ON sub5.key5 = udf(sub6.key6) ) sub4 - ON sub4.key5 = sub3.key3 + ON udf(sub4.key5) = sub3.key3 ) sub2 -ON sub1.key1 = sub2.key3 +ON udf(udf(sub1.key1)) = udf(udf(sub2.key3)) -- !query 137 schema struct<key1:int,key3:int,value2:int,value3:int> -- !query 137 output -2871,34 +2871,34 struct<key1:int,key3:int,value2:int,value3:int> SELECT * FROM ( SELECT 1 as key1 ) sub1 LEFT JOIN -( SELECT sub3.key3, value2, COALESCE(value2, 66) as value3 FROM +( SELECT udf(sub3.key3) as key3, udf(value2), udf(COALESCE(value2, 66)) as value3 FROM ( SELECT 1 as key3 ) sub3 LEFT JOIN ( SELECT sub5.key5, COALESCE(sub6.value1, 1) as value2 FROM ( SELECT 1 as key5 ) sub5 LEFT JOIN ( SELECT 2 as key6, 42 as value1 ) sub6 - ON sub5.key5 = sub6.key6 + ON udf(udf(sub5.key5)) = sub6.key6 ) sub4 ON sub4.key5 = sub3.key3 ) sub2 -ON sub1.key1 = sub2.key3 +ON sub1.key1 = udf(udf(sub2.key3)) -- !query 138 schema -struct<key1:int,key3:int,value2:int,value3:int> +struct<key1:int,key3:int,CAST(udf(cast(value2 as string)) AS INT):int,value3:int> -- !query 138 output 1 1 1 1 -- !query 139 -SELECT qq, unique1 +SELECT udf(qq), udf(udf(unique1)) FROM - ( SELECT COALESCE(q1, 0) AS qq FROM int8_tbl a ) AS ss1 + ( SELECT udf(COALESCE(q1, 0)) AS qq FROM int8_tbl a ) AS ss1 FULL OUTER JOIN - ( SELECT COALESCE(q2, -1) AS qq FROM int8_tbl b ) AS ss2 + ( SELECT udf(udf(COALESCE(q2, -1))) AS qq FROM int8_tbl b ) AS ss2 USING (qq) - INNER JOIN tenk1 c ON qq = unique2 + INNER JOIN tenk1 c ON udf(qq) = udf(unique2) -- !query 139 schema -struct<qq:bigint,unique1:int> +struct<CAST(udf(cast(qq as string)) AS BIGINT):bigint,CAST(udf(cast(cast(udf(cast(unique1 as string)) as int) as string)) AS INT):int> -- !query 139 output 123 4596 123 4596 -2936,19 +2936,19 struct<> -- !query 143 -select nt3.id +select udf(nt3.id) from nt3 as nt3 left join - (select nt2., (nt2.b1 and ss1.a3) AS b3 + (select nt2., (udf(nt2.b1) and udf(ss1.a3)) AS b3 from nt2 as nt2 left join - (select nt1., (nt1.id is not null) as a3 from nt1) as ss1 - on ss1.id = nt2.nt1_id + (select nt1., (udf(nt1.id) is not null) as a3 from nt1) as ss1 + on ss1.id = udf(udf(nt2.nt1_id)) ) as ss2 - on ss2.id = nt3.nt2_id -where nt3.id = 1 and ss2.b3 + on udf(ss2.id) = nt3.nt2_id +where udf(nt3.id) = 1 and udf(ss2.b3) -- !query 143 schema -struct<id:int> +struct<CAST(udf(cast(id as string)) AS INT):int> -- !query 143 output 1 -3003,73 +3003,73 NULL 2147483647 -- !query 146 -select count() from - tenk1 a join tenk1 b on a.unique1 = b.unique2 - left join tenk1 c on a.unique2 = b.unique1 and c.thousand = a.thousand - join int4_tbl on b.thousand = f1 +select udf(count()) from + tenk1 a join tenk1 b on udf(a.unique1) = udf(b.unique2) + left join tenk1 c on udf(a.unique2) = udf(b.unique1) and udf(c.thousand) = udf(udf(a.thousand)) + join int4_tbl on udf(b.thousand) = f1 -- !query 146 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 146 output 10 -- !query 147 -select b.unique1 from - tenk1 a join tenk1 b on a.unique1 = b.unique2 - left join tenk1 c on b.unique1 = 42 and c.thousand = a.thousand - join int4_tbl i1 on b.thousand = f1 - right join int4_tbl i2 on i2.f1 = b.tenthous - order by 1 +select udf(b.unique1) from + tenk1 a join tenk1 b on udf(a.unique1) = udf(b.unique2) + left join tenk1 c on udf(b.unique1) = 42 and c.thousand = udf(a.thousand) + join int4_tbl i1 on udf(b.thousand) = udf(udf(f1)) + right join int4_tbl i2 on udf(udf(i2.f1)) = udf(b.tenthous) + order by udf(1) -- !query 147 schema -struct<unique1:int> +struct<CAST(udf(cast(unique1 as string)) AS INT):int> -- !query 147 output NULL NULL +0 NULL NULL -0 -- !query 148 select * from ( - select unique1, q1, coalesce(unique1, -1) + q1 as fault - from int8_tbl left join tenk1 on (q2 = unique2) + select udf(unique1), udf(q1), udf(udf(coalesce(unique1, -1)) + udf(q1)) as fault + from int8_tbl left join tenk1 on (udf(q2) = udf(unique2)) ) ss -where fault = 122 -order by fault +where udf(fault) = udf(122) +order by udf(fault) -- !query 148 schema -struct<unique1:int,q1:bigint,fault:bigint> +struct<CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(q1 as string)) AS BIGINT):bigint,fault:bigint> -- !query 148 output NULL 123 122 -- !query 149 -select q1, unique2, thousand, hundred - from int8_tbl a left join tenk1 b on q1 = unique2 - where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123) +select udf(q1), udf(unique2), udf(thousand), udf(hundred) + from int8_tbl a left join tenk1 b on udf(q1) = udf(unique2) + where udf(coalesce(thousand,123)) = udf(q1) and udf(q1) = udf(udf(coalesce(hundred,123))) -- !query 149 schema -struct<q1:bigint,unique2:int,thousand:int,hundred:int> +struct<CAST(udf(cast(q1 as string)) AS BIGINT):bigint,CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(thousand as string)) AS INT):int,CAST(udf(cast(hundred as string)) AS INT):int> -- !query 149 output -- !query 150 -select f1, unique2, case when unique2 is null then f1 else 0 end - from int4_tbl a left join tenk1 b on f1 = unique2 - where (case when unique2 is null then f1 else 0 end) = 0 +select udf(f1), udf(unique2), case when udf(udf(unique2)) is null then udf(f1) else 0 end + from int4_tbl a left join tenk1 b on udf(f1) = udf(udf(unique2)) + where (case when udf(unique2) is null then udf(f1) else 0 end) = 0 -- !query 150 schema -struct<f1:int,unique2:int,CASE WHEN (unique2 IS NULL) THEN f1 ELSE 0 END:int> +struct<CAST(udf(cast(f1 as string)) AS INT):int,CAST(udf(cast(unique2 as string)) AS INT):int,CASE WHEN (CAST(udf(cast(cast(udf(cast(unique2 as string)) as int) as string)) AS INT) IS NULL) THEN CAST(udf(cast(f1 as string)) AS INT) ELSE 0 END:int> -- !query 150 output 0 0 0 -- !query 151 -select a.unique1, b.unique1, c.unique1, coalesce(b.twothousand, a.twothousand) - from tenk1 a left join tenk1 b on b.thousand = a.unique1 left join tenk1 c on c.unique2 = coalesce(b.twothousand, a.twothousand) - where a.unique2 < 10 and coalesce(b.twothousand, a.twothousand) = 44 +select udf(a.unique1), udf(b.unique1), udf(c.unique1), udf(coalesce(b.twothousand, a.twothousand)) + from tenk1 a left join tenk1 b on udf(b.thousand) = a.unique1 left join tenk1 c on udf(c.unique2) = udf(coalesce(b.twothousand, a.twothousand)) + where a.unique2 < udf(10) and udf(udf(coalesce(b.twothousand, a.twothousand))) = udf(44) -- !query 151 schema -struct<unique1:int,unique1:int,unique1:int,coalesce(twothousand, twothousand):int> +struct<CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(coalesce(twothousand, twothousand) as string)) AS INT):int> -- !query 151 output -3078,11 +3078,11 struct<unique1:int,unique1:int,unique1:int,coalesce(twothousand, twothousand):in select * from text_tbl t1 inner join int8_tbl i8 - on i8.q2 = 456 + on udf(i8.q2) = udf(udf(456)) right join text_tbl t2 - on t1.f1 = 'doh!' + on udf(t1.f1) = udf(udf('doh!')) left join int4_tbl i4 - on i8.q1 = i4.f1 + on udf(udf(i8.q1)) = i4.f1 -- !query 152 schema struct<f1:string,q1:bigint,q2:bigint,f1:string,f1:int> -- !query 152 output -3092,10 +3092,10 doh! 123 456 hi de ho neighbor NULL -- !query 153 select * from - (select 1 as id) as xx + (select udf(udf(1)) as id) as xx left join - (tenk1 as a1 full join (select 1 as id) as yy on (a1.unique1 = yy.id)) - on (xx.id = coalesce(yy.id)) + (tenk1 as a1 full join (select udf(1) as id) as yy on (udf(a1.unique1) = udf(yy.id))) + on (xx.id = udf(udf(coalesce(yy.id)))) -- !query 153 schema struct<id:int,unique1:int,unique2:int,two:int,four:int,ten:int,twenty:int,hundred:int,thousand:int,twothousand:int,fivethous:int,tenthous:int,odd:int,even:int,stringu1:string,stringu2:string,string4:string,id:int> -- !query 153 output -3103,11 +3103,11 struct<id:int,unique1:int,unique2:int,two:int,four:int,ten:int,twenty:int,hundre -- !query 154 -select a.q2, b.q1 - from int8_tbl a left join int8_tbl b on a.q2 = coalesce(b.q1, 1) - where coalesce(b.q1, 1) > 0 +select udf(a.q2), udf(b.q1) + from int8_tbl a left join int8_tbl b on udf(a.q2) = coalesce(b.q1, 1) + where udf(udf(coalesce(b.q1, 1)) > 0) -- !query 154 schema -struct<q2:bigint,q1:bigint> +struct<CAST(udf(cast(q2 as string)) AS BIGINT):bigint,CAST(udf(cast(q1 as string)) AS BIGINT):bigint> -- !query 154 output -4567890123456789 NULL 123 123 -3142,7 +3142,7 struct<> -- !query 157 -select p.* from parent p left join child c on (p.k = c.k) +select p.* from parent p left join child c on (udf(p.k) = udf(c.k)) -- !query 157 schema struct<k:int,pd:int> -- !query 157 output -3153,8 +3153,8 struct<k:int,pd:int> -- !query 158 select p., linked from parent p - left join (select c., true as linked from child c) as ss - on (p.k = ss.k) + left join (select c., udf(udf(true)) as linked from child c) as ss + on (udf(p.k) = udf(udf(ss.k))) -- !query 158 schema struct<k:int,pd:int,linked:boolean> -- !query 158 output -3165,8 +3165,8 struct<k:int,pd:int,linked:boolean> -- !query 159 select p. from - parent p left join child c on (p.k = c.k) - where p.k = 1 and p.k = 2 + parent p left join child c on (udf(p.k) = c.k) + where p.k = udf(1) and udf(udf(p.k)) = udf(udf(2)) -- !query 159 schema struct<k:int,pd:int> -- !query 159 output -3175,8 +3175,8 struct<k:int,pd:int> -- !query 160 select p.* from - (parent p left join child c on (p.k = c.k)) join parent x on p.k = x.k - where p.k = 1 and p.k = 2 + (parent p left join child c on (udf(p.k) = c.k)) join parent x on p.k = udf(x.k) + where udf(p.k) = udf(1) and udf(udf(p.k)) = udf(udf(2)) -- !query 160 schema struct<k:int,pd:int> -- !query 160 output -3204,7 +3204,7 struct<> -- !query 163 -SELECT * FROM b LEFT JOIN a ON (b.a_id = a.id) WHERE (a.id IS NULL OR a.id > 0) +SELECT * FROM b LEFT JOIN a ON (udf(b.a_id) = udf(a.id)) WHERE (udf(udf(a.id)) IS NULL OR udf(a.id) > 0) -- !query 163 schema struct<id:int,a_id:int,id:int> -- !query 163 output -3212,7 +3212,7 struct<id:int,a_id:int,id:int> -- !query 164 -SELECT b.* FROM b LEFT JOIN a ON (b.a_id = a.id) WHERE (a.id IS NULL OR a.id > 0) +SELECT b.* FROM b LEFT JOIN a ON (udf(b.a_id) = udf(a.id)) WHERE (udf(a.id) IS NULL OR udf(udf(a.id)) > 0) -- !query 164 schema struct<id:int,a_id:int> -- !query 164 output -3231,13 +3231,13 struct<> -- !query 166 SELECT * FROM - (SELECT 1 AS x) ss1 + (SELECT udf(1) AS x) ss1 LEFT JOIN - (SELECT q1, q2, COALESCE(dat1, q1) AS y - FROM int8_tbl LEFT JOIN innertab ON q2 = id) ss2 + (SELECT udf(q1), udf(q2), udf(COALESCE(dat1, q1)) AS y + FROM int8_tbl LEFT JOIN innertab ON udf(udf(q2)) = id) ss2 ON true -- !query 166 schema -struct<x:int,q1:bigint,q2:bigint,y:bigint> +struct<x:int,CAST(udf(cast(q1 as string)) AS BIGINT):bigint,CAST(udf(cast(q2 as string)) AS BIGINT):bigint,y:bigint> -- !query 166 output 1 123 456 123 1 123 4567890123456789 123 -3248,27 +3248,27 struct<x:int,q1:bigint,q2:bigint,y:bigint> -- !query 167 select * from - int8_tbl x join (int4_tbl x cross join int4_tbl y) j on q1 = f1 + int8_tbl x join (int4_tbl x cross join int4_tbl y) j on udf(q1) = udf(f1) -- !query 167 schema struct<> -- !query 167 output org.apache.spark.sql.AnalysisException -Reference 'f1' is ambiguous, could be: j.f1, j.f1.; line 2 pos 63 +Reference 'f1' is ambiguous, could be: j.f1, j.f1.; line 2 pos 72 -- !query 168 select * from - int8_tbl x join (int4_tbl x cross join int4_tbl y) j on q1 = y.f1 + int8_tbl x join (int4_tbl x cross join int4_tbl y) j on udf(q1) = udf(y.f1) -- !query 168 schema struct<> -- !query 168 output org.apache.spark.sql.AnalysisException -cannot resolve '`y.f1`' given input columns: [j.f1, j.f1, x.q1, x.q2]; line 2 pos 63 +cannot resolve '`y.f1`' given input columns: [j.f1, j.f1, x.q1, x.q2]; line 2 pos 72 -- !query 169 select * from - int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1 + int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on udf(q1) = udf(udf(f1)) -- !query 169 schema struct<q1:bigint,q2:bigint,f1:int,ff:int> -- !query 169 output -3276,69 +3276,69 struct<q1:bigint,q2:bigint,f1:int,ff:int> -- !query 170 -select t1.uunique1 from - tenk1 t1 join tenk2 t2 on t1.two = t2.two +select udf(t1.uunique1) from + tenk1 t1 join tenk2 t2 on t1.two = udf(t2.two) -- !query 170 schema struct<> -- !query 170 output org.apache.spark.sql.AnalysisException -cannot resolve '`t1.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7 +cannot resolve '`t1.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 11 -- !query 171 -select t2.uunique1 from - tenk1 t1 join tenk2 t2 on t1.two = t2.two +select udf(udf(t2.uunique1)) from + tenk1 t1 join tenk2 t2 on udf(t1.two) = t2.two -- !query 171 schema struct<> -- !query 171 output org.apache.spark.sql.AnalysisException -cannot resolve '`t2.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7 +cannot resolve '`t2.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 15 -- !query 172 -select uunique1 from - tenk1 t1 join tenk2 t2 on t1.two = t2.two +select udf(uunique1) from + tenk1 t1 join tenk2 t2 on udf(t1.two) = udf(t2.two) -- !query 172 schema struct<> -- !query 172 output org.apache.spark.sql.AnalysisException -cannot resolve '`uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7 +cannot resolve '`uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 11 -- !query 173 -select f1,g from int4_tbl a, (select f1 as g) ss +select udf(udf(f1,g)) from int4_tbl a, (select udf(udf(f1)) as g) ss -- !query 173 schema struct<> -- !query 173 output org.apache.spark.sql.AnalysisException -cannot resolve '`f1`' given input columns: []; line 1 pos 37 +cannot resolve '`f1`' given input columns: []; line 1 pos 55 -- !query 174 -select f1,g from int4_tbl a, (select a.f1 as g) ss +select udf(f1,g) from int4_tbl a, (select a.f1 as g) ss -- !query 174 schema struct<> -- !query 174 output org.apache.spark.sql.AnalysisException -cannot resolve '`a.f1`' given input columns: []; line 1 pos 37 +cannot resolve '`a.f1`' given input columns: []; line 1 pos 42 -- !query 175 -select f1,g from int4_tbl a cross join (select f1 as g) ss +select udf(udf(f1,g)) from int4_tbl a cross join (select udf(f1) as g) ss -- !query 175 schema struct<> -- !query 175 output org.apache.spark.sql.AnalysisException -cannot resolve '`f1`' given input columns: []; line 1 pos 47 +cannot resolve '`f1`' given input columns: []; line 1 pos 61 -- !query 176 -select f1,g from int4_tbl a cross join (select a.f1 as g) ss +select udf(f1,g) from int4_tbl a cross join (select udf(udf(a.f1)) as g) ss -- !query 176 schema struct<> -- !query 176 output org.apache.spark.sql.AnalysisException -cannot resolve '`a.f1`' given input columns: []; line 1 pos 47 +cannot resolve '`a.f1`' given input columns: []; line 1 pos 60 -- !query 177 -3383,8 +3383,8 struct<> -- !query 182 select * from j1 -inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2 -where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1 +inner join j2 on udf(j1.id1) = udf(j2.id1) and udf(udf(j1.id2)) = udf(j2.id2) +where udf(j1.id1) % 1000 = 1 and udf(udf(j2.id1) % 1000) = 1 -- !query 182 schema struct<id1:int,id2:int,id1:int,id2:int> -- !query 182 output ``` </p> </details> ## How was this patch tested? Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921). Closes #25371 from huaxingao/spark-28393. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-19 20:10:56 +09:00
Wenchen Fan	97dc4c0bfc	[SPARK-28744][SQL][TEST] rename SharedSQLContext to SharedSparkSession ## What changes were proposed in this pull request? The Spark SQL test framework needs to support 2 kinds of tests: 1. tests inside Spark to test Spark itself (extends `SparkFunSuite`) 2. test outside of Spark to test Spark applications (introduced at `b57ed2245c`) The class hierarchy of the major testing traits: ![image](https://user-images.githubusercontent.com/3182036/63088526-c0f0af80-bf87-11e9-9bed-c144c2486da9.png) `PlanTestBase`, `SQLTestUtilsBase` and `SharedSparkSession` intentionally don't extend `SparkFunSuite`, so that they can be used for tests outside of Spark. Tests in Spark should extends `QueryTest` and/or `SharedSQLContext` in most cases. However, the name is a little confusing. As a result, some test suites extend `SharedSparkSession` instead of `SharedSQLContext`. `SharedSparkSession` doesn't work well with `SparkFunSuite` as it doesn't have the special handling of thread auditing in `SharedSQLContext`. For example, you will see a warning starting with `===== POSSIBLE THREAD LEAK IN SUITE` when you run `DataFrameSelfJoinSuite`. This PR proposes to rename `SharedSparkSession` to `SharedSparkSessionBase`, and rename `SharedSQLContext` to `SharedSparkSession`. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25463 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-19 19:01:56 +08:00
Peter Toth	f999e00e9f	[SPARK-28356][SHUFFLE][FOLLOWUP] Fix case with different pre-shuffle partition numbers ### What changes were proposed in this pull request? This PR reverts some of the latest changes in `ReduceNumShufflePartitions` to fix the case when there are different pre-shuffle partition numbers in the plan. Please see the new UT for an example. ### Why are the changes needed? Eliminate a bug. ### Does this PR introduce any user-facing change? Yes, some queries that failed will succeed now. ### How was this patch tested? Added new UT. Closes #25479 from peter-toth/SPARK-28356-followup. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-19 15:53:43 +08:00
Eyal Zituny	d75a11d059	[SPARK-27330][SS] support task abort in foreach writer ## What changes were proposed in this pull request? in order to address cases where foreach writer task is failing without calling the close() method, (for example when a task is interrupted) added the option to implement an abort() method that will be called when the task is aborted. users should handle resource cleanup (such as connections) in the abort() method ## How was this patch tested? update existing unit tests. Closes #24382 from eyalzit/SPARK-27330-foreach-writer-abort. Lead-authored-by: Eyal Zituny <eyal.zituny@equalum.io> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Co-authored-by: eyalzit <eyal.zituny@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-19 14:12:48 +08:00
shivusondur	c96b6154b7	[SPARK-28390][SQL][PYTHON][TESTS][FOLLOW-UP] Update the TODO with actual blocking JIRA IDs ## What changes were proposed in this pull request? only todo message updated. Need to add udf() for GroupBy Tests, after resolving following jira [SPARK-28386] and [SPARK-26741] ## How was this patch tested? NA, only TODO message updated. Closes #25415 from shivusondur/jiraFollowup. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-19 13:01:39 +09:00
WeichenXu	4ddad79060	[SPARK-28598][SQL] Few date time manipulation functions does not provide versions supporting Column as input through the Dataframe API ## What changes were proposed in this pull request? Add following functions: ``` def add_months(startDate: Column, numMonths: Column): Column def date_add(start: Column, days: Column): Column def date_sub(start: Column, days: Column): Column ``` ## How was this patch tested? UT. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25334 from WeichenXu123/datefunc_impr. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-19 11:41:13 +09:00
Dongjoon Hyun	f0834d3a7f	Revert "[SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server" This reverts commit `efbb035902`.	2019-08-18 16:54:24 -07:00
Yuming Wang	efbb035902	[SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server ## What changes were proposed in this pull request? This PR build a test framework that directly re-run all the tests in `SQLQueryTestSuite` via Thrift Server. But it's a little different from `SQLQueryTestSuite`: 1. Can not support [UDF testing](`44e607e921/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L293-L297)`). 2. Can not support `DESC` command and `SHOW` command because `SQLQueryTestSuite` [formatted the output](`1882912cca/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (L38-L50)`.). When building this framework, found two bug: [SPARK-28624](https://issues.apache.org/jira/browse/SPARK-28624): `make_date` is inconsistent when reading from table [SPARK-28611](https://issues.apache.org/jira/browse/SPARK-28611): Histogram's height is different found two features that ThriftServer can not support: [SPARK-28636](https://issues.apache.org/jira/browse/SPARK-28636): ThriftServer can not support decimal type with negative scale [SPARK-28637](https://issues.apache.org/jira/browse/SPARK-28637): ThriftServer can not support interval type Also, found two inconsistent behavior: [SPARK-28620](https://issues.apache.org/jira/browse/SPARK-28620): Double type returned for float type in Beeline/JDBC [SPARK-28619](https://issues.apache.org/jira/browse/SPARK-28619): The golden result file is different when tested by `bin/spark-sql` ## How was this patch tested? N/A Closes #25373 from wangyum/SPARK-28527. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-08-17 19:12:50 -07:00
Gengliang Wang	92bfd9a317	[SPARK-28757][SQL] File table location should include both values of option `path` and `paths` ### What changes were proposed in this pull request? If both options `path` and `paths` are passed to file data source v2, both values of the options should be included as the target paths. ### Why are the changes needed? In V1 implementation, file table location includes both values of option `path` and `paths`. In the refactoring of https://github.com/apache/spark/pull/24025, the value of option `path` is ignored if "paths" are specified. We should make it consistent with V1. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test Closes #25473 from gengliangwang/fixPathOption. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-16 22:27:27 +08:00
Maxim Gekk	96ca734fb7	[SPARK-28745][SQL][TEST] Add benchmarks for `extract()` ## What changes were proposed in this pull request? Added new benchmark `ExtractBenchmark` for the `EXTRACT(field FROM source)` function. It was executed on all currently supported values of the `field` argument: `MILLENNIUM`, `CENTURY`, `DECADE`, `YEAR`, `ISOYEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECONDS`, `MICROSECONDS`, `EPOCH`. The `cast(id as timestamp)` was taken as the `source` argument. ## How was this patch tested? By running the benchmark via: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark" ``` Closes #25462 from MaxGekk/extract-benchmark. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-15 12:44:36 -07:00
Burak Yavuz	0526529b31	[SPARK-28666] Support saveAsTable for V2 tables through Session Catalog ## What changes were proposed in this pull request? We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog. ## How was this patch tested? Unit tests. A lot of tests broke under hive when things were not working properly under `ResolveTables`, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths. Closes #25402 from brkyvz/saveAsV2. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-15 12:29:34 +08:00
Maxim Gekk	3a4afce96c	[SPARK-28687][SQL] Support `epoch`, `isoyear`, `milliseconds` and `microseconds` at `extract()` ## What changes were proposed in this pull request? In the PR, I propose new expressions `Epoch`, `IsoYear`, `Milliseconds` and `Microseconds`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision. 2. `isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January. 3. `milliseconds` - the seconds field including fractional parts multiplied by 1,000. 4. `microseconds` - the seconds field including fractional parts multiplied by 1,000,000. Here are examples: ```sql spark-sql> SELECT EXTRACT(EPOCH FROM TIMESTAMP '2019-08-11 19:07:30.123456'); 1565550450.123456 spark-sql> SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-01'); 2005 spark-sql> SELECT EXTRACT(MILLISECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456'); 30123.456 spark-sql> SELECT EXTRACT(MICROSECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456'); 30123456 ``` ## How was this patch tested? Added new tests to `DateExpressionsSuite`, and uncommented existing tests in `extract.sql` and `pgSQL/date.sql`. Closes #25408 from MaxGekk/extract-ext3. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-14 08:44:44 -07:00
xy_xin	2eeb25e52d	[SPARK-28351][SQL] Support DELETE in DataSource V2 ## What changes were proposed in this pull request? This pr adds DELETE support for V2 datasources. As a first step, this pr only support delete by source filters: ```scala void delete(Filter[] filters); ``` which could not deal with complicated cases like subqueries. Since it's uncomfortable to embed the implementation of DELETE in the current V2 APIs, a new mix-in of datasource is added, which is called `SupportsMaintenance`, similar to `SupportsRead` and `SupportsWrite`. A datasource which can be maintained means we can perform DELETE/UPDATE/MERGE/OPTIMIZE on the datasource, as long as the datasource implements the necessary mix-ins. ## How was this patch tested? new test case. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25115 from xianyinxin/SPARK-28351. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-14 23:38:45 +08:00
John Zhuge	391c7e8f2e	[SPARK-27739][SQL] df.persist should save stats from optimized plan ## What changes were proposed in this pull request? CacheManager.cacheQuery saves the stats from the optimized plan to cache. ## How was this patch tested? Existing testss. Closes #24623 from jzhuge/SPARK-27739. Authored-by: John Zhuge <jzhuge@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-14 19:49:53 +08:00
Edgar Rodriguez	598fcbe5ed	[SPARK-28265][SQL] Add renameTable to TableCatalog API ## What changes were proposed in this pull request? This PR adds the `renameTable` call to the `TableCatalog` API, as described in the [Table Metadata API SPIP](https://docs.google.com/document/d/1zLFiA1VuaWeVxeTDXNg8bL6GP3BVoOZBkewFtEnjEoo/edit#heading=h.m45webtwxf2d). This PR is related to: https://github.com/apache/spark/pull/24246 ## How was this patch tested? Added unit tests and contract tests. Closes #25206 from edgarRd/SPARK-28265-add-rename-table-catalog-api. Authored-by: Edgar Rodriguez <edgar.rd@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-14 14:24:13 +08:00
Dilip Biswal	331f2657d9	[SPARK-27768][SQL] Support Infinity/NaN-related float/double literals case-insensitively ## What changes were proposed in this pull request? Here is the problem description from the JIRA. ``` When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results. SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('infinity'), ('1')) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('infinity'), ('infinity')) v(x); SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE)) FROM (VALUES ('-infinity'), ('infinity')) v(x); The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way. Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html ``` In this PR, the casting code is enhanced to handle these `special` string literals in case insensitive manner. ## How was this patch tested? Added tests in CastSuite and modified existing test suites. Closes #25331 from dilipbiswal/double_infinity. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 16:48:30 -07:00
Maxim Gekk	3d85c54895	[SPARK-28700][SQL] Use DECIMAL type for `sec` in `make_timestamp()` ## What changes were proposed in this pull request? Changed type of `sec` argument in the `make_timestamp()` function from `DOUBLE` to `DECIMAL(8, 6)`. The scale is set to 6 to cover microsecond fractions, and the precision is 2 digits for seconds + 6 digits for microsecond fraction. New type prevents losing precision in some cases, for example: Before: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001); 2019-08-12 00:00:58 ``` After: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001); 2019-08-12 00:00:58.000001 ``` Also switching to `DECIMAL` fixes rounding `sec` towards "nearest neighbor" unless both neighbors are equidistant, in which case round up. For example: Before: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567); 2019-08-12 00:00:00.123456 ``` After: ```sql spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567); 2019-08-12 00:00:00.123457 ``` ## How was this patch tested? This was tested by `DateExpressionsSuite` and `pgSQL/timestamp.sql`. Closes #25421 from MaxGekk/make_timestamp-decimal. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 15:51:50 -07:00
Maxim Gekk	f04a766946	[SPARK-28718][SQL] Support `field` synonyms at `extract` ## What changes were proposed in this pull request? In the PR, I propose additional synonyms for the `field` argument of `extract` supported by PostgreSQL. The `extract.sql` is updated to check all supported values of the `field` argument. The list of synonyms was taken from https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/datetime.c . ## How was this patch tested? By running `extract.sql` via: ``` $ build/sbt "sql/test-only *SQLQueryTestSuite -- -z extract.sql" ``` Closes #25438 from MaxGekk/extract-field-synonyms. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-13 15:36:28 -07:00
Liang-Chi Hsieh	e6a0385289	[SPARK-28422][SQL][PYTHON] GROUPED_AGG pandas_udf should work without group by clause ## What changes were proposed in this pull request? A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`. This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`. When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf. ## How was this patch tested? Added tests. Closes #25352 from viirya/SPARK-28422. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-14 00:32:33 +09:00
Yuming Wang	9a7f29023e	[SPARK-28383][SQL] SHOW CREATE TABLE is not supported on a temporary view ## What changes were proposed in this pull request? It throws `Table or view not found` when showing temporary views: ```sql spark-sql> CREATE TEMPORARY VIEW temp_view AS SELECT 1 AS a; spark-sql> show create table temp_view; Error in query: Table or view 'temp_view' not found in database 'default'; ``` It's not easy to support temporary views. This pr changed it to throws `SHOW CREATE TABLE is not supported on a temporary view`: ```sql spark-sql> CREATE TEMPORARY VIEW temp_view AS SELECT 1 AS a; spark-sql> show create table temp_view; Error in query: SHOW CREATE TABLE is not supported on a temporary view: temp_view; ``` ## How was this patch tested? unit tests Closes #25149 from wangyum/SPARK-28383. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-12 21:01:19 -07:00
Stavros Kontopoulos	ec84415358	[SPARK-28280][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-by.sql' ## What changes were proposed in this pull request? This PR is a followup of a fix as described in here: https://github.com/apache/spark/pull/25215#issuecomment-517659981 <details><summary>Diff comparing to 'group-by.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out index 3a5df254f2..febe47b5ba 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out -13,26 +13,26 struct<> -- !query 1 -SELECT a, COUNT(b) FROM testData +SELECT udf(a), udf(COUNT(b)) FROM testData -- !query 1 schema struct<> -- !query 1 output org.apache.spark.sql.AnalysisException -grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(count(testdata.`b`) AS `count(b)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.; +grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(CAST(udf(cast(count(b) as string)) AS BIGINT) AS `CAST(udf(cast(count(b) as string)) AS BIGINT)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.; -- !query 2 -SELECT COUNT(a), COUNT(b) FROM testData +SELECT COUNT(udf(a)), udf(COUNT(b)) FROM testData -- !query 2 schema -struct<count(a):bigint,count(b):bigint> +struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> -- !query 2 output 7 7 -- !query 3 -SELECT a, COUNT(b) FROM testData GROUP BY a +SELECT udf(a), COUNT(udf(b)) FROM testData GROUP BY a -- !query 3 schema -struct<a:int,count(b):bigint> +struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(b as string)) AS INT)):bigint> -- !query 3 output 1 2 2 2 -41,7 +41,7 NULL 1 -- !query 4 -SELECT a, COUNT(b) FROM testData GROUP BY b +SELECT udf(a), udf(COUNT(udf(b))) FROM testData GROUP BY b -- !query 4 schema struct<> -- !query 4 output -50,9 +50,9 expression 'testdata.`a`' is neither present in the group by, nor is it an aggre -- !query 5 -SELECT COUNT(a), COUNT(b) FROM testData GROUP BY a +SELECT COUNT(udf(a)), COUNT(udf(b)) FROM testData GROUP BY udf(a) -- !query 5 schema -struct<count(a):bigint,count(b):bigint> +struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,count(CAST(udf(cast(b as string)) AS INT)):bigint> -- !query 5 output 0 1 2 2 -61,15 +61,15 struct<count(a):bigint,count(b):bigint> -- !query 6 -SELECT 'foo', COUNT(a) FROM testData GROUP BY 1 +SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1 -- !query 6 schema -struct<foo:string,count(a):bigint> +struct<foo:string,count(CAST(udf(cast(a as string)) AS INT)):bigint> -- !query 6 output foo 7 -- !query 7 -SELECT 'foo' FROM testData WHERE a = 0 GROUP BY 1 +SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1) -- !query 7 schema struct<foo:string> -- !query 7 output -77,25 +77,25 struct<foo:string> -- !query 8 -SELECT 'foo', APPROX_COUNT_DISTINCT(a) FROM testData WHERE a = 0 GROUP BY 1 +SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1) -- !query 8 schema -struct<foo:string,approx_count_distinct(a):bigint> +struct<foo:string,CAST(udf(cast(approx_count_distinct(cast(udf(cast(a as string)) as int), 0.05, 0, 0) as string)) AS BIGINT):bigint> -- !query 8 output -- !query 9 -SELECT 'foo', MAX(STRUCT(a)) FROM testData WHERE a = 0 GROUP BY 1 +SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1) -- !query 9 schema -struct<foo:string,max(named_struct(a, a)):struct<a:int>> +struct<foo:string,max(named_struct(col1, CAST(udf(cast(a as string)) AS INT))):struct<col1:int>> -- !query 9 output -- !query 10 -SELECT a + b, COUNT(b) FROM testData GROUP BY a + b +SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b -- !query 10 schema -struct<(a + b):int,count(b):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> -- !query 10 output 2 1 3 2 -105,7 +105,7 NULL 1 -- !query 11 -SELECT a + 2, COUNT(b) FROM testData GROUP BY a + 1 +SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1 -- !query 11 schema struct<> -- !query 11 output -114,9 +114,9 expression 'testdata.`a`' is neither present in the group by, nor is it an aggre -- !query 12 -SELECT a + 1 + 1, COUNT(b) FROM testData GROUP BY a + 1 +SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM testData GROUP BY udf(a + 1) -- !query 12 schema -struct<((a + 1) + 1):int,count(b):bigint> +struct<(CAST(udf(cast((a + 1) as string)) AS INT) + 1):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> -- !query 12 output 3 2 4 2 -125,26 +125,26 NULL 1 -- !query 13 -SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a) +SELECT SKEWNESS(udf(a)), udf(KURTOSIS(a)), udf(MIN(a)), MAX(udf(a)), udf(AVG(udf(a))), udf(VARIANCE(a)), STDDEV(udf(a)), udf(SUM(a)), udf(COUNT(a)) FROM testData -- !query 13 schema -struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS DOUBLE)):double,stddev_samp(CAST(a AS DOUBLE)):double,sum(a):bigint,count(a):bigint> +struct<skewness(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(kurtosis(cast(a as double)) as string)) AS DOUBLE):double,CAST(udf(cast(min(a) as string)) AS INT):int,max(CAST(udf(cast(a as string)) AS INT)):int,CAST(udf(cast(avg(cast(cast(udf(cast(a as string)) as int) as bigint)) as string)) AS DOUBLE):double,CAST(udf(cast(var_samp(cast(a as double)) as string)) AS DOUBLE):double,stddev_samp(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(sum(cast(a as bigint)) as string)) AS BIGINT):bigint,CAST(udf(cast(count(a) as string)) AS BIGINT):bigint> -- !query 13 output -0.2723801058145729 -1.5069204152249134 1 3 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7 -- !query 14 -SELECT COUNT(DISTINCT b), COUNT(DISTINCT b, c) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY a +SELECT COUNT(DISTINCT udf(b)), udf(COUNT(DISTINCT b, c)) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY udf(a) -- !query 14 schema -struct<count(DISTINCT b):bigint,count(DISTINCT b, c):bigint> +struct<count(DISTINCT CAST(udf(cast(b as string)) AS INT)):bigint,CAST(udf(cast(count(distinct b, c) as string)) AS BIGINT):bigint> -- !query 14 output 1 1 -- !query 15 -SELECT a AS k, COUNT(b) FROM testData GROUP BY k +SELECT udf(a) AS k, COUNT(udf(b)) FROM testData GROUP BY k -- !query 15 schema -struct<k:int,count(b):bigint> +struct<k:int,count(CAST(udf(cast(b as string)) AS INT)):bigint> -- !query 15 output 1 2 2 2 -153,21 +153,21 NULL 1 -- !query 16 -SELECT a AS k, COUNT(b) FROM testData GROUP BY k HAVING k > 1 +SELECT a AS k, udf(COUNT(b)) FROM testData GROUP BY k HAVING k > 1 -- !query 16 schema -struct<k:int,count(b):bigint> +struct<k:int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> -- !query 16 output 2 2 3 2 -- !query 17 -SELECT COUNT(b) AS k FROM testData GROUP BY k +SELECT udf(COUNT(b)) AS k FROM testData GROUP BY k -- !query 17 schema struct<> -- !query 17 output org.apache.spark.sql.AnalysisException -aggregate functions are not allowed in GROUP BY, but found count(testdata.`b`); +aggregate functions are not allowed in GROUP BY, but found CAST(udf(cast(count(b) as string)) AS BIGINT); -- !query 18 -180,7 +180,7 struct<> -- !query 19 -SELECT k AS a, COUNT(v) FROM testDataHasSameNameWithAlias GROUP BY a +SELECT k AS a, udf(COUNT(udf(v))) FROM testDataHasSameNameWithAlias GROUP BY udf(a) -- !query 19 schema struct<> -- !query 19 output -197,32 +197,32 spark.sql.groupByAliases false -- !query 21 -SELECT a AS k, COUNT(b) FROM testData GROUP BY k +SELECT a AS k, udf(COUNT(udf(b))) FROM testData GROUP BY k -- !query 21 schema struct<> -- !query 21 output org.apache.spark.sql.AnalysisException -cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 47 +cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 57 -- !query 22 -SELECT a, COUNT(1) FROM testData WHERE false GROUP BY a +SELECT udf(a), COUNT(udf(1)) FROM testData WHERE false GROUP BY udf(a) -- !query 22 schema -struct<a:int,count(1):bigint> +struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(1 as string)) AS INT)):bigint> -- !query 22 output -- !query 23 -SELECT COUNT(1) FROM testData WHERE false +SELECT udf(COUNT(1)) FROM testData WHERE false -- !query 23 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 23 output 0 -- !query 24 -SELECT 1 FROM (SELECT COUNT(1) FROM testData WHERE false) t +SELECT 1 FROM (SELECT udf(COUNT(1)) FROM testData WHERE false) t -- !query 24 schema struct<1:int> -- !query 24 output -232,7 +232,7 struct<1:int> -- !query 25 SELECT 1 from ( SELECT 1 AS z, - MIN(a.x) + udf(MIN(a.x)) FROM (select 1 as x) a WHERE false ) b -244,32 +244,32 struct<1:int> -- !query 26 -SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count() +SELECT corr(DISTINCT x, y), udf(corr(DISTINCT y, x)), count() FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y) -- !query 26 schema -struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,corr(DISTINCT CAST(y AS DOUBLE), CAST(x AS DOUBLE)):double,count(1):bigint> +struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,CAST(udf(cast(corr(distinct cast(y as double), cast(x as double)) as string)) AS DOUBLE):double,count(1):bigint> -- !query 26 output 1.0 1.0 3 -- !query 27 -SELECT 1 FROM range(10) HAVING true +SELECT udf(1) FROM range(10) HAVING true -- !query 27 schema -struct<1:int> +struct<CAST(udf(cast(1 as string)) AS INT):int> -- !query 27 output 1 -- !query 28 -SELECT 1 FROM range(10) HAVING MAX(id) > 0 +SELECT udf(udf(1)) FROM range(10) HAVING MAX(id) > 0 -- !query 28 schema -struct<1:int> +struct<CAST(udf(cast(cast(udf(cast(1 as string)) as int) as string)) AS INT):int> -- !query 28 output 1 -- !query 29 -SELECT id FROM range(10) HAVING id > 0 +SELECT udf(id) FROM range(10) HAVING id > 0 -- !query 29 schema struct<> -- !query 29 output -291,33 +291,33 struct<> -- !query 31 -SELECT every(v), some(v), any(v) FROM test_agg WHERE 1 = 0 +SELECT udf(every(v)), udf(some(v)), any(v) FROM test_agg WHERE 1 = 0 -- !query 31 schema -struct<every(v):boolean,some(v):boolean,any(v):boolean> +struct<CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean> -- !query 31 output NULL NULL NULL -- !query 32 -SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 4 +SELECT udf(every(udf(v))), some(v), any(v) FROM test_agg WHERE k = 4 -- !query 32 schema -struct<every(v):boolean,some(v):boolean,any(v):boolean> +struct<CAST(udf(cast(every(cast(udf(cast(v as string)) as boolean)) as string)) AS BOOLEAN):boolean,some(v):boolean,any(v):boolean> -- !query 32 output NULL NULL NULL -- !query 33 -SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 5 +SELECT every(v), udf(some(v)), any(v) FROM test_agg WHERE k = 5 -- !query 33 schema -struct<every(v):boolean,some(v):boolean,any(v):boolean> +struct<every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean> -- !query 33 output false true true -- !query 34 -SELECT k, every(v), some(v), any(v) FROM test_agg GROUP BY k +SELECT udf(k), every(v), udf(some(v)), any(v) FROM test_agg GROUP BY udf(k) -- !query 34 schema -struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean> +struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean> -- !query 34 output 1 false true true 2 true true true -327,9 +327,9 struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean> -- !query 35 -SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) = false +SELECT udf(k), every(v) FROM test_agg GROUP BY k HAVING every(v) = false -- !query 35 schema -struct<k:int,every(v):boolean> +struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean> -- !query 35 output 1 false 3 false -337,77 +337,77 struct<k:int,every(v):boolean> -- !query 36 -SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) IS NULL +SELECT udf(k), udf(every(v)) FROM test_agg GROUP BY udf(k) HAVING every(v) IS NULL -- !query 36 schema -struct<k:int,every(v):boolean> +struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean> -- !query 36 output 4 NULL -- !query 37 -SELECT k, - Every(v) AS every +SELECT udf(k), + udf(Every(v)) AS every FROM test_agg WHERE k = 2 AND v IN (SELECT Any(v) FROM test_agg WHERE k = 1) -GROUP BY k +GROUP BY udf(k) -- !query 37 schema -struct<k:int,every:boolean> +struct<CAST(udf(cast(k as string)) AS INT):int,every:boolean> -- !query 37 output 2 true -- !query 38 -SELECT k, +SELECT udf(udf(k)), Every(v) AS every FROM test_agg WHERE k = 2 AND v IN (SELECT Every(v) FROM test_agg WHERE k = 1) -GROUP BY k +GROUP BY udf(udf(k)) -- !query 38 schema -struct<k:int,every:boolean> +struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,every:boolean> -- !query 38 output -- !query 39 -SELECT every(1) +SELECT every(udf(1)) -- !query 39 schema struct<> -- !query 39 output org.apache.spark.sql.AnalysisException -cannot resolve 'every(1)' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7 +cannot resolve 'every(CAST(udf(cast(1 as string)) AS INT))' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7 -- !query 40 -SELECT some(1S) +SELECT some(udf(1S)) -- !query 40 schema struct<> -- !query 40 output org.apache.spark.sql.AnalysisException -cannot resolve 'some(1S)' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7 +cannot resolve 'some(CAST(udf(cast(1 as string)) AS SMALLINT))' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7 -- !query 41 -SELECT any(1L) +SELECT any(udf(1L)) -- !query 41 schema struct<> -- !query 41 output org.apache.spark.sql.AnalysisException -cannot resolve 'any(1L)' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7 +cannot resolve 'any(CAST(udf(cast(1 as string)) AS BIGINT))' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7 -- !query 42 -SELECT every("true") +SELECT udf(every("true")) -- !query 42 schema struct<> -- !query 42 output org.apache.spark.sql.AnalysisException -cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7 +cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 11 -- !query 43 -428,9 +428,9 struct<k:int,v:boolean,every(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST -- !query 44 -SELECT k, v, some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg +SELECT k, udf(udf(v)), some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg -- !query 44 schema -struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean> +struct<k:int,CAST(udf(cast(cast(udf(cast(v as string)) as boolean) as string)) AS BOOLEAN):boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean> -- !query 44 output 1 false false 1 true true -445,9 +445,9 struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST R -- !query 45 -SELECT k, v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg +SELECT udf(udf(k)), v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg -- !query 45 schema -struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean> +struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean> -- !query 45 output 1 false false 1 true true -462,17 +462,17 struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RA -- !query 46 -SELECT count() FROM test_agg HAVING count() > 1L +SELECT udf(count()) FROM test_agg HAVING count() > 1L -- !query 46 schema -struct<count(1):bigint> +struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 46 output 10 -- !query 47 -SELECT k, max(v) FROM test_agg GROUP BY k HAVING max(v) = true +SELECT k, udf(max(v)) FROM test_agg GROUP BY k HAVING max(v) = true -- !query 47 schema -struct<k:int,max(v):boolean> +struct<k:int,CAST(udf(cast(max(v) as string)) AS BOOLEAN):boolean> -- !query 47 output 1 true 2 true -480,7 +480,7 struct<k:int,max(v):boolean> -- !query 48 -SELECT * FROM (SELECT COUNT() AS cnt FROM test_agg) WHERE cnt > 1L +SELECT FROM (SELECT udf(COUNT()) AS cnt FROM test_agg) WHERE cnt > 1L -- !query 48 schema struct<cnt:bigint> -- !query 48 output -488,7 +488,7 struct<cnt:bigint> -- !query 49 -SELECT count() FROM test_agg WHERE count() > 1L +SELECT udf(count()) FROM test_agg WHERE count() > 1L -- !query 49 schema struct<> -- !query 49 output -500,7 +500,7 Invalid expressions: [count(1)]; -- !query 50 -SELECT count() FROM test_agg WHERE count() + 1L > 1L +SELECT udf(count()) FROM test_agg WHERE count() + 1L > 1L -- !query 50 schema struct<> -- !query 50 output -512,7 +512,7 Invalid expressions: [count(1)]; -- !query 51 -SELECT count() FROM test_agg WHERE k = 1 or k = 2 or count() + 1L > 1L or max(k) > 1 +SELECT udf(count()) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1 -- !query 51 schema struct<> -- !query 51 output ``` </p> </details> ## How was this patch tested? Tested as instructed in SPARK-27921. Closes #25360 from skonto/group-by-followup. Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-13 10:06:32 +09:00
Maxim Gekk	6964128e25	[SPARK-28017][SPARK-28656][SQL][FOLLOW-UP] Restore comments in date.sql ## What changes were proposed in this pull request? Restored comments in `date.sql` removed by `924d794a6f` and `997d153e54` . The comments was introduced by `51379b731d` . ## How was this patch tested? By re-running `date.sql` via: ```shell $ build/sbt "sql/test-only *SQLQueryTestSuite -- -z date.sql" ``` Closes #25422 from MaxGekk/sql-comments-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-12 11:19:19 -07:00
Yuming Wang	47af8925b6	[SPARK-28675][SQL] Remove maskCredentials and use redactOptions ## What changes were proposed in this pull request? This PR replaces `CatalogUtils.maskCredentials` with `SQLConf.get.redactOptions` to match other redacts. ## How was this patch tested? unit test and manual tests: Before this PR: ```sql spark-sql> DESC EXTENDED test_spark_28675; id int NULL # Detailed Table Information Database default Table test_spark_28675 Owner root Created Time Fri Aug 09 08:23:17 GMT-07:00 2019 Last Access Wed Dec 31 17:00:00 GMT-07:00 1969 Created By Spark 3.0.0-SNAPSHOT Type MANAGED Provider org.apache.spark.sql.jdbc Location file:/user/hive/warehouse/test_spark_28675 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675'; default test_spark_28675 false Database: default Table: test_spark_28675 Owner: root Created Time: Fri Aug 09 08:23:17 GMT-07:00 2019 Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969 Created By: Spark 3.0.0-SNAPSHOT Type: MANAGED Provider: org.apache.spark.sql.jdbc Location: file:/user/hive/warehouse/test_spark_28675 Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties: [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] Schema: root \|-- id: integer (nullable = true) ``` After this PR: ```sql spark-sql> DESC EXTENDED test_spark_28675; id int NULL # Detailed Table Information Database default Table test_spark_28675 Owner root Created Time Fri Aug 09 08:19:49 GMT-07:00 2019 Last Access Wed Dec 31 17:00:00 GMT-07:00 1969 Created By Spark 3.0.0-SNAPSHOT Type MANAGED Provider org.apache.spark.sql.jdbc Location file:/user/hive/warehouse/test_spark_28675 Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties [url=*******(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675'; default test_spark_28675 false Database: default Table: test_spark_28675 Owner: root Created Time: Fri Aug 09 08:19:49 GMT-07:00 2019 Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969 Created By: Spark 3.0.0-SNAPSHOT Type: MANAGED Provider: org.apache.spark.sql.jdbc Location: file:/user/hive/warehouse/test_spark_28675 Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat Storage Properties: [url=*******(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675] Schema: root \|-- id: integer (nullable = true) ``` Closes #25395 from wangyum/SPARK-28675. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-10 16:45:59 -07:00
younggyu chun	8535df7261	[MINOR] Fix typos in comments and replace an explicit type with <> ## What changes were proposed in this pull request? This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+. ## How was this patch tested? Manually tested. Closes #25338 from younggyuchun/younggyu. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-08-10 16:47:11 -05:00
Maxim Gekk	924d794a6f	[SPARK-28656][SQL] Support `millennium`, `century` and `decade` at `extract()` ## What changes were proposed in this pull request? In the PR, I propose new expressions `Millennium`, `Century` and `Decade`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_. 2. `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD. 3. `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10. Here are examples: ```sql spark-sql> SELECT EXTRACT(MILLENNIUM FROM DATE '1981-01-19'); 2 spark-sql> SELECT EXTRACT(CENTURY FROM DATE '1981-01-19'); 20 spark-sql> SELECT EXTRACT(DECADE FROM DATE '1981-01-19'); 198 ``` ## How was this patch tested? Added new tests to `DateExpressionsSuite` and uncommented existing tests in `pgSQL/date.sql`. Closes #25388 from MaxGekk/extract-ext2. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-09 11:18:50 -07:00
gengjiaan	5159876415	[SPARK-28077][SQL][TEST][FOLLOW-UP] Enable Overlay function tests ## What changes were proposed in this pull request? This PR is a follow-up to https://github.com/apache/spark/pull/24918 ## How was this patch tested? Pass the Jenkins with the newly update test files. Closes #25393 from beliefer/enable-overlay-tests. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-09 19:05:41 +09:00
Shixiong Zhu	5bb69945e4	[SPARK-28651][SS] Force the schema of Streaming file source to be nullable ## What changes were proposed in this pull request? Right now, batch DataFrame always changes the schema to nullable automatically (See this line: `325bc8e9c6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L399)`). But streaming file source is missing this. This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior. ## How was this patch tested? The new unit test. Closes #25382 from zsxwing/SPARK-28651. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-09 18:54:55 +09:00
Burak Yavuz	5368eaa2fc	[SPARK-28565][SQL] DataFrameWriter saveAsTable support for V2 catalogs ## What changes were proposed in this pull request? Adds support for V2 catalogs and the V2SessionCatalog for V2 tables for saveAsTable. If the table can resolve through the V2SessionCatalog, we use SaveMode for datasource v1 for backwards compatibility to select the code path we're going to hit. Depending on the SaveMode: - SaveMode.Append: a) If table exists: Use AppendData.byName b) If table doesn't exist, use CTAS (ignoreIfExists = false) - SaveMode.Overwrite: Use RTAS (orCreate = true) - SaveMode.Ignore: Use CTAS (ignoreIfExists = true) - SaveMode.ErrorIfExists: Use CTAS (ignoreIfExists = false) ## How was this patch tested? Unit tests in DataSourceV2DataFrameSuite Closes #25330 from brkyvz/saveAsTable. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-08-08 22:30:00 -07:00
Maxim Gekk	997d153e54	[SPARK-28017][SQL] Support additional levels of truncations by DATE_TRUNC/TRUNC ## What changes were proposed in this pull request? I propose new levels of truncations for the `date_trunc()` and `trunc()` functions: 1. `MICROSECOND` and `MILLISECOND` truncate values of the `TIMESTAMP` type to microsecond and millisecond precision. 2. `DECADE`, `CENTURY` and `MILLENNIUM` truncate dates/timestamps to lowest date of current decade/century/millennium. Also the `WEEK` and `QUARTER` levels have been supported by the `trunc()` function. The function is implemented similarly to `date_trunc` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC to maintain feature parity with it. Here are examples of `TRUNC`: ```sql spark-sql> SELECT TRUNC('2015-10-27', 'DECADE'); 2010-01-01 spark-sql> set spark.sql.datetime.java8API.enabled=true; spark.sql.datetime.java8API.enabled true spark-sql> SELECT TRUNC('1999-10-27', 'millennium'); 1001-01-01 ``` Examples of `DATE_TRUNC`: ```sql spark-sql> SELECT DATE_TRUNC('CENTURY', '2015-03-05T09:32:05.123456'); 2001-01-01T00:00:00Z ``` ## How was this patch tested? Added new tests to `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`, and uncommented existing tests in `pgSQL/date.sql`. Closes #25336 from MaxGekk/date_truct-ext. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-09 12:29:44 +08:00
Burak Yavuz	c80430f5c9	[SPARK-28572][SQL] Simple analyzer checks for v2 table creation code paths ## What changes were proposed in this pull request? Adds checks around: - The existence of transforms in the table schema (even in nested fields) - Duplications of transforms - Case sensitivity checks around column names in the V2 table creation code paths. ## How was this patch tested? Unit tests. Closes #25305 from brkyvz/v2CreateTable. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-09 12:04:28 +08:00
Yuming Wang	2580c1bfe2	[SPARK-28660][SQL][TEST] Port AGGREGATES.sql [Part 4] ## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L607-L997 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/aggregates.out#L1615-L2289 When porting the test cases, found five PostgreSQL specific features that do not exist in Spark SQL: [SPARK-27980](https://issues.apache.org/jira/browse/SPARK-27980): Ordered-Set Aggregate Functions [SPARK-28661](https://issues.apache.org/jira/browse/SPARK-28661): Hypothetical-Set Aggregate Functions [SPARK-28382](https://issues.apache.org/jira/browse/SPARK-28382): Array Functions: unnest [SPARK-28663](https://issues.apache.org/jira/browse/SPARK-28663): Aggregate Functions for Statistics [SPARK-28664](https://issues.apache.org/jira/browse/SPARK-28664): ORDER BY in aggregate function [SPARK-28669](https://issues.apache.org/jira/browse/SPARK-28669): System Information Functions ## How was this patch tested? N/A Closes #25392 from wangyum/SPARK-28660. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-08 16:39:32 -07:00
Yuming Wang	d19a56f9db	[SPARK-28642][SQL] Hide credentials in SHOW CREATE TABLE ## What changes were proposed in this pull request? [SPARK-17783](https://issues.apache.org/jira/browse/SPARK-17783) hided Credentials in `CREATE` and `DESC FORMATTED/EXTENDED` a PERSISTENT/TEMP Table for JDBC. But `SHOW CREATE TABLE` exposed the credentials: ```sql spark-sql> show create table mysql_federated_sample; CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN) USING org.apache.spark.sql.jdbc OPTIONS ( `url` 'jdbc:mysql://localhost/hive?user=root&password=mypasswd', `driver` 'com.mysql.jdbc.Driver', `dbtable` 'TBLS' ) ``` This pr fix this issue. ## How was this patch tested? unit tests and manual tests: ```sql spark-sql> show create table mysql_federated_sample; CREATE TABLE `mysql_federated_sample` (`TBL_ID` BIGINT, `CREATE_TIME` INT, `DB_ID` BIGINT, `LAST_ACCESS_TIME` INT, `OWNER` STRING, `RETENTION` INT, `SD_ID` BIGINT, `TBL_NAME` STRING, `TBL_TYPE` STRING, `VIEW_EXPANDED_TEXT` STRING, `VIEW_ORIGINAL_TEXT` STRING, `IS_REWRITE_ENABLED` BOOLEAN) USING org.apache.spark.sql.jdbc OPTIONS ( `url` '*********(redacted)', `driver` 'com.mysql.jdbc.Driver', `dbtable` 'TBLS' ) ``` Closes #25375 from wangyum/SPARK-28642. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-08 16:24:43 -07:00
HyukjinKwon	8c0dc38640	[SPARK-28654][SQL] Move "Extract Python UDFs" to the last in optimizer ## What changes were proposed in this pull request? Plans after "Extract Python UDFs" are very flaky and error-prone to other rules. For instance, if we add some rules, for instance, `PushDownPredicates` in `postHocOptimizationBatches`, the test in `BatchEvalPythonExecSuite` fails: ```scala test("Python UDF refers to the attributes from more than one child") { val df = Seq(("Hello", 4)).toDF("a", "b") val df2 = Seq(("Hello", 4)).toDF("c", "d") val joinDF = df.crossJoin(df2).where("dummyPythonUDF(a, c) == dummyPythonUDF(d, c)") val qualifiedPlanNodes = joinDF.queryExecution.executedPlan.collect { case b: BatchEvalPythonExec => b } assert(qualifiedPlanNodes.size == 1) } ``` ``` Invalid PythonUDF dummyUDF(a#63, c#74), requires attributes from more than one child. ``` This is because Python UDF extraction optimization is rolled back as below: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates === !Filter (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) Join Cross, (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) !+- Join Cross :- Project [_1#2 AS a#7, _2#3 AS b#8] ! :- Project [_1#2 AS a#7, _2#3 AS b#8] : +- LocalRelation [_1#2, _2#3] ! : +- LocalRelation [_1#2, _2#3] +- Project [_1#13 AS c#18, _2#14 AS d#19] ! +- Project [_1#13 AS c#18, _2#14 AS d#19] +- LocalRelation [_1#13, _2#14] ! +- LocalRelation [_1#13, _2#14] ``` Seems we should do Python UDFs cases at the last even after post hoc rules. Note that this actually rather follows the way in previous versions when those were in physical plans (see SPARK-24721 and SPARK-12981). Those optimization rules were supposed to be placed at the end. Note that I intentionally didn't move `ExperimentalMethods` (`spark.experimental.extraStrategies`). This is an explicit experimental API and I wanted to just-in-case workaround after this change for now. ## How was this patch tested? Existing tests should cover. Closes #25386 from HyukjinKwon/SPARK-28654. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-08 20:21:07 +08:00
Yishuang Lu	e58dd4af60	[MINOR][DOC] Fix a typo 'lister' -> 'listener' ## What changes were proposed in this pull request? Fix the typo in java doc. ## How was this patch tested? N/A Signed-off-by: Yishuang Lu <luystugmail.com> Closes #25377 from lys0716/dev. Authored-by: Yishuang Lu <luystu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-08 11:12:18 +09:00
Yuming Wang	eeaf1851b2	[SPARK-28617][SQL][TEST] Fix misplacement when comment is at the end of the query ## What changes were proposed in this pull request? This PR fixes the issue of misplacement when the comment at the end of the query. Example: Comment for ` SELECT date '5874898-01-01'`: `2d74f14d74/sql/core/src/test/resources/sql-tests/inputs/pgSQL/date.sql (L200)` But the golden file is: `a5a5da78cf/sql/core/src/test/resources/sql-tests/results/pgSQL/date.sql.out (L484-L507)` After this PR: `eeb7405ad0/sql/core/src/test/resources/sql-tests/results/pgSQL/date.sql.out (L482-L501)` ## How was this patch tested? N/A Closes #25357 from wangyum/SPARK-28617. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-07 16:45:23 -07:00
maryannxue	325bc8e9c6	[SPARK-28583][SQL] Subqueries should not call `onUpdatePlan` in Adaptive Query Execution ## What changes were proposed in this pull request? Subqueries do not have their own execution id, thus when calling `AdaptiveSparkPlanExec.onUpdatePlan`, it will actually get the `QueryExecution` instance of the main query, which is wasteful and problematic. It could cause issues like stack overflow or dead locks in some circumstances. This PR fixes this issue by making `AdaptiveSparkPlanExec` compare the `QueryExecution` object retrieved by current execution ID against the `QueryExecution` object from which this plan is created, and only update the UI when the two instances are the same. ## How was this patch tested? Manual tests on TPC-DS queries. Closes #25316 from maryannxue/aqe-updateplan-fix. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: herman <herman@databricks.com>	2019-08-07 22:10:17 +02:00
Wenchen Fan	469423f338	[SPARK-28595][SQL] explain should not trigger partition listing ## What changes were proposed in this pull request? Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again. This is caused by `FileSourceScanExec`: 1. In its `toString`, it needs to report the number of partitions it reads. This needs to query the hive metastore. 2. In its `outputOrdering`, it needs to get all the files. This needs to query the hive metastore. This PR fixes by: 1. `toString` do not need to report the number of partitions it reads. We should report it via SQL metrics. 2. The `outputOrdering` is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck. ## How was this patch tested? existing tests Closes #25328 from cloud-fan/ui. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-07 19:14:25 +08:00
gengjiaan	99de6a4240	[SPARK-27924][SQL][TEST][FOLLOW-UP] Enable Boolean-Predicate syntax tests ## What changes were proposed in this pull request? This PR is a follow-up to https://github.com/apache/spark/pull/25074 ## How was this patch tested? Pass the Jenkins with the newly update test files. Closes #25366 from beliefer/uncomment-boolean-test. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-07 00:34:49 -07:00
mcheah	44e607e921	[SPARK-28238][SQL] Implement DESCRIBE TABLE for Data Source V2 Tables ## What changes were proposed in this pull request? Implements the `DESCRIBE TABLE` logical and physical plans for data source v2 tables. ## How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25040 from mccheah/describe-table-v2. Authored-by: mcheah <mcheah@palantir.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-07 14:26:45 +08:00
WeichenXu	a133175ffa	[SPARK-28615][SQL][DOCS] Add a guide line for dataframe functions to say column signature function is by default ## What changes were proposed in this pull request? Add a guide line for dataframe functions, say: ``` This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons. ``` ## How was this patch tested? N/A Closes #25355 from WeichenXu123/update_functions_guide2. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-07 10:39:47 +09:00
Nik Vanderhoof	9e931e787d	[SPARK-27905][SQL] Add higher order function 'forall' ## What changes were proposed in this pull request? Add's the higher order function `forall`, which tests an array to see if a predicate holds for every element. The function is implemented in `org.apache.spark.sql.catalyst.expressions.ArrayForAll`. The function is added to the function registry under the pretty name `forall`. ## How was this patch tested? I've added appropriate unit tests for the new ArrayForAll expression in `sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala`. Also added tests for the function in `sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala`. Not sure who is best to ask about this PR so: HyukjinKwon rxin gatorsmile ueshin srowen hvanhovell gatorsmile Closes #24761 from nvander1/feature/for_all. Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com> Co-authored-by: Nik <nikolasrvanderhoof@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2019-08-06 14:25:53 -07:00
Maxim Gekk	9e3aab8b95	[SPARK-28623][SQL] Support `dow`, `isodow` and `doy` by `extract()` ## What changes were proposed in this pull request? In the PR, I propose to use existing expressions `DayOfYear`, `WeekDay` and `DayOfWeek`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT): 1. `dow` - the day of the week as Sunday (0) to Saturday (6) 2. `isodow` - the day of the week as Monday (1) to Sunday (7) 3. `doy` - the day of the year (1 - 365/366) Here are examples: ```sql spark-sql> SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40'); 5 spark-sql> SELECT EXTRACT(ISODOW FROM TIMESTAMP '2001-02-18 20:38:40'); 7 spark-sql> SELECT EXTRACT(DOY FROM TIMESTAMP '2001-02-16 20:38:40'); 47 ``` ## How was this patch tested? Updated `extract.sql`. Closes #25367 from MaxGekk/extract-ext. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-06 13:39:49 -07:00
HyukjinKwon	bab88c48b1	[SPARK-28622][SQL][PYTHON] Rename PullOutPythonUDFInJoinCondition to ExtractPythonUDFFromJoinCondition and move to 'Extract Python UDFs' ## What changes were proposed in this pull request? This PR targets to rename `PullOutPythonUDFInJoinCondition` to `ExtractPythonUDFFromJoinCondition` and move to 'Extract Python UDFs' together with other Python UDF related rules. Currently `PullOutPythonUDFInJoinCondition` rule is alone outside of other 'Extract Python UDFs' rules together. and the name `ExtractPythonUDFFromJoinCondition` is matched to existing Python UDF extraction rules. ## How was this patch tested? Existing tests should cover. Closes #25358 from HyukjinKwon/move-python-join-rule. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-08-05 23:36:35 -07:00
Udbhav30	150dbc5dc2	[SPARK-28391][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into groupby clause in 'pgSQL/select_implicit.sql' ## What changes were proposed in this pull request? This PR adds UDF cases into group by clause in 'pgSQL/select_implicit.sql' <details><summary>Diff comparing to 'pgSQL/select_implicit.sql'</summary> <p> ```diff diff --git a/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_implicit.sql.out b/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/pgSQL/select_implicit.sql.out index 17303b2..0675820 100755 --- a/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_implicit.sql.out +++ b/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/pgSQL/select_implicit.sql.out -91,11 +91,9 struct<> -- !query 11 -SELECT udf(c), udf(count()) FROM test_missing_target GROUP BY -udf(test_missing_target.c) -ORDER BY udf(c) +SELECT c, count() FROM test_missing_target GROUP BY test_missing_target.c ORDER BY c -- !query 11 schema -struct<CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<c:string,count(1):bigint> -- !query 11 output ABAB 2 BBBB 2 -106,10 +104,9 cccc 2 -- !query 12 -SELECT udf(count()) FROM test_missing_target GROUP BY udf(test_missing_target.c) -ORDER BY udf(c) +SELECT count() FROM test_missing_target GROUP BY test_missing_target.c ORDER BY c -- !query 12 schema -struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<count(1):bigint> -- !query 12 output 2 2 -120,18 +117,18 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 13 -SELECT udf(count()) FROM test_missing_target GROUP BY udf(a) ORDER BY udf(b) +SELECT count() FROM test_missing_target GROUP BY a ORDER BY b -- !query 13 schema struct<> -- !query 13 output org.apache.spark.sql.AnalysisException -cannot resolve '`b`' given input columns: [CAST(udf(cast(count(1) as string)) AS BIGINT)]; line 1 pos 75 +cannot resolve '`b`' given input columns: [count(1)]; line 1 pos 61 -- !query 14 -SELECT udf(count()) FROM test_missing_target GROUP BY udf(b) ORDER BY udf(b) +SELECT count() FROM test_missing_target GROUP BY b ORDER BY b -- !query 14 schema -struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<count(1):bigint> -- !query 14 output 1 2 -140,10 +137,10 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 15 -SELECT udf(test_missing_target.b), udf(count()) - FROM test_missing_target GROUP BY udf(b) ORDER BY udf(b) +SELECT test_missing_target.b, count() + FROM test_missing_target GROUP BY b ORDER BY b -- !query 15 schema -struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<b:int,count(1):bigint> -- !query 15 output 1 1 2 2 -152,9 +149,9 struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string) -- !query 16 -SELECT udf(c) FROM test_missing_target ORDER BY udf(a) +SELECT c FROM test_missing_target ORDER BY a -- !query 16 schema -struct<CAST(udf(cast(c as string)) AS STRING):string> +struct<c:string> -- !query 16 output XXXX ABAB -169,10 +166,9 CCCC -- !query 17 -SELECT udf(count()) FROM test_missing_target GROUP BY udf(b) ORDER BY udf(b) -desc +SELECT count() FROM test_missing_target GROUP BY b ORDER BY b desc -- !query 17 schema -struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<count(1):bigint> -- !query 17 output 4 3 -181,17 +177,17 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 18 -SELECT udf(count()) FROM test_missing_target ORDER BY udf(1) desc +SELECT count() FROM test_missing_target ORDER BY 1 desc -- !query 18 schema -struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<count(1):bigint> -- !query 18 output 10 -- !query 19 -SELECT udf(c), udf(count()) FROM test_missing_target GROUP BY 1 ORDER BY 1 +SELECT c, count() FROM test_missing_target GROUP BY 1 ORDER BY 1 -- !query 19 schema -struct<CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<c:string,count(1):bigint> -- !query 19 output ABAB 2 BBBB 2 -202,30 +198,30 cccc 2 -- !query 20 -SELECT udf(c), udf(count()) FROM test_missing_target GROUP BY 3 +SELECT c, count() FROM test_missing_target GROUP BY 3 -- !query 20 schema struct<> -- !query 20 output org.apache.spark.sql.AnalysisException -GROUP BY position 3 is not in select list (valid range is [1, 2]); line 1 pos 63 +GROUP BY position 3 is not in select list (valid range is [1, 2]); line 1 pos 53 -- !query 21 -SELECT udf(count()) FROM test_missing_target x, test_missing_target y - WHERE udf(x.a) = udf(y.a) - GROUP BY udf(b) ORDER BY udf(b) +SELECT count() FROM test_missing_target x, test_missing_target y + WHERE x.a = y.a + GROUP BY b ORDER BY b -- !query 21 schema struct<> -- !query 21 output org.apache.spark.sql.AnalysisException -Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 14 +Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 10 -- !query 22 -SELECT udf(a), udf(a) FROM test_missing_target - ORDER BY udf(a) +SELECT a, a FROM test_missing_target + ORDER BY a -- !query 22 schema -struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int> +struct<a:int,a:int> -- !query 22 output 0 0 1 1 -240,10 +236,10 struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS IN -- !query 23 -SELECT udf(udf(a)/2), udf(udf(a)/2) FROM test_missing_target - ORDER BY udf(udf(a)/2) +SELECT a/2, a/2 FROM test_missing_target + ORDER BY a/2 -- !query 23 schema -struct<CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS INT):int,CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS INT):int> +struct<(a div 2):int,(a div 2):int> -- !query 23 output 0 0 0 0 -258,10 +254,10 struct<CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS -- !query 24 -SELECT udf(a/2), udf(a/2) FROM test_missing_target - GROUP BY udf(a/2) ORDER BY udf(a/2) +SELECT a/2, a/2 FROM test_missing_target + GROUP BY a/2 ORDER BY a/2 -- !query 24 schema -struct<CAST(udf(cast((a div 2) as string)) AS INT):int,CAST(udf(cast((a div 2) as string)) AS INT):int> +struct<(a div 2):int,(a div 2):int> -- !query 24 output 0 0 1 1 -271,11 +267,11 struct<CAST(udf(cast((a div 2) as string)) AS INT):int,CAST(udf(cast((a div 2) a -- !query 25 -SELECT udf(x.b), udf(count()) FROM test_missing_target x, test_missing_target y - WHERE udf(x.a) = udf(y.a) - GROUP BY udf(x.b) ORDER BY udf(x.b) +SELECT x.b, count() FROM test_missing_target x, test_missing_target y + WHERE x.a = y.a + GROUP BY x.b ORDER BY x.b -- !query 25 schema -struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<b:int,count(1):bigint> -- !query 25 output 1 1 2 2 -284,11 +280,11 struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string) -- !query 26 -SELECT udf(count()) FROM test_missing_target x, test_missing_target y - WHERE udf(x.a) = udf(y.a) - GROUP BY udf(x.b) ORDER BY udf(x.b) +SELECT count() FROM test_missing_target x, test_missing_target y + WHERE x.a = y.a + GROUP BY x.b ORDER BY x.b -- !query 26 schema -struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> +struct<count(1):bigint> -- !query 26 output 1 2 -297,22 +293,22 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint> -- !query 27 -SELECT udf(a%2), udf(count(udf(b))) FROM test_missing_target -GROUP BY udf(test_missing_target.a%2) -ORDER BY udf(test_missing_target.a%2) +SELECT a%2, count(b) FROM test_missing_target +GROUP BY test_missing_target.a%2 +ORDER BY test_missing_target.a%2 -- !query 27 schema -struct<CAST(udf(cast((a % 2) as string)) AS INT):int,CAST(udf(cast(count(cast(udf(cast(b as string)) as int)) as string)) AS BIGINT):bigint> +struct<(a % 2):int,count(b):bigint> -- !query 27 output 0 5 1 5 -- !query 28 -SELECT udf(count(c)) FROM test_missing_target -GROUP BY udf(lower(test_missing_target.c)) -ORDER BY udf(lower(test_missing_target.c)) +SELECT count(c) FROM test_missing_target +GROUP BY lower(test_missing_target.c) +ORDER BY lower(test_missing_target.c) -- !query 28 schema -struct<CAST(udf(cast(count(c) as string)) AS BIGINT):bigint> +struct<count(c):bigint> -- !query 28 output 2 3 -321,18 +317,18 struct<CAST(udf(cast(count(c) as string)) AS BIGINT):bigint> -- !query 29 -SELECT udf(count(udf(a))) FROM test_missing_target GROUP BY udf(a) ORDER BY udf(b) +SELECT count(a) FROM test_missing_target GROUP BY a ORDER BY b -- !query 29 schema struct<> -- !query 29 output org.apache.spark.sql.AnalysisException -cannot resolve '`b`' given input columns: [CAST(udf(cast(count(cast(udf(cast(a as string)) as int)) as string)) AS BIGINT)]; line 1 pos 80 +cannot resolve '`b`' given input columns: [count(a)]; line 1 pos 61 -- !query 30 -SELECT udf(count(b)) FROM test_missing_target GROUP BY udf(b/2) ORDER BY udf(b/2) +SELECT count(b) FROM test_missing_target GROUP BY b/2 ORDER BY b/2 -- !query 30 schema -struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> +struct<count(b):bigint> -- !query 30 output 1 5 -340,10 +336,10 struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> -- !query 31 -SELECT udf(lower(test_missing_target.c)), udf(count(udf(c))) - FROM test_missing_target GROUP BY udf(lower(c)) ORDER BY udf(lower(c)) +SELECT lower(test_missing_target.c), count(c) + FROM test_missing_target GROUP BY lower(c) ORDER BY lower(c) -- !query 31 schema -struct<CAST(udf(cast(lower(c) as string)) AS STRING):string,CAST(udf(cast(count(cast(udf(cast(c as string)) as string)) as string)) AS BIGINT):bigint> +struct<lower(c):string,count(c):bigint> -- !query 31 output abab 2 bbbb 3 -352,9 +348,9 xxxx 1 -- !query 32 -SELECT udf(a) FROM test_missing_target ORDER BY udf(upper(udf(d))) +SELECT a FROM test_missing_target ORDER BY upper(d) -- !query 32 schema -struct<CAST(udf(cast(a as string)) AS INT):int> +struct<a:int> -- !query 32 output 0 1 -369,33 +365,32 struct<CAST(udf(cast(a as string)) AS INT):int> -- !query 33 -SELECT udf(count(b)) FROM test_missing_target - GROUP BY udf((b + 1) / 2) ORDER BY udf((b + 1) / 2) desc +SELECT count(b) FROM test_missing_target + GROUP BY (b + 1) / 2 ORDER BY (b + 1) / 2 desc -- !query 33 schema -struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint> +struct<count(b):bigint> -- !query 33 output 7 3 -- !query 34 -SELECT udf(count(udf(x.a))) FROM test_missing_target x, test_missing_target y - WHERE udf(x.a) = udf(y.a) - GROUP BY udf(b/2) ORDER BY udf(b/2) +SELECT count(x.a) FROM test_missing_target x, test_missing_target y + WHERE x.a = y.a + GROUP BY b/2 ORDER BY b/2 -- !query 34 schema struct<> -- !query 34 output org.apache.spark.sql.AnalysisException -Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 14 +Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 10 -- !query 35 -SELECT udf(x.b/2), udf(count(udf(x.b))) FROM test_missing_target x, -test_missing_target y - WHERE udf(x.a) = udf(y.a) - GROUP BY udf(x.b/2) ORDER BY udf(x.b/2) +SELECT x.b/2, count(x.b) FROM test_missing_target x, test_missing_target y + WHERE x.a = y.a + GROUP BY x.b/2 ORDER BY x.b/2 -- !query 35 schema -struct<CAST(udf(cast((b div 2) as string)) AS INT):int,CAST(udf(cast(count(cast(udf(cast(b as string)) as int)) as string)) AS BIGINT):bigint> +struct<(b div 2):int,count(b):bigint> -- !query 35 output 0 1 1 5 -403,14 +398,14 struct<CAST(udf(cast((b div 2) as string)) AS INT):int,CAST(udf(cast(count(cast( -- !query 36 -SELECT udf(count(udf(b))) FROM test_missing_target x, test_missing_target y - WHERE udf(x.a) = udf(y.a) - GROUP BY udf(x.b/2) +SELECT count(b) FROM test_missing_target x, test_missing_target y + WHERE x.a = y.a + GROUP BY x.b/2 -- !query 36 schema struct<> -- !query 36 output org.apache.spark.sql.AnalysisException -Reference 'b' is ambiguous, could be: x.b, y.b.; line 1 pos 21 +Reference 'b' is ambiguous, could be: x.b, y.b.; line 1 pos 13 -- !query 37 ``` </p> </details> ## How was this patch tested? Tested as Guided in SPARK-27921 Closes #25350 from Udbhav30/master. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-06 15:14:32 +09:00
HyukjinKwon	da3d4b6a35	[SPARK-28537][SQL][HOTFIX][FOLLOW-UP] Add supportColumnar in DebugExec ## What changes were proposed in this pull request? This PR add supportColumnar in DebugExec. Seems there was a conflict between https://github.com/apache/spark/pull/25274 and https://github.com/apache/spark/pull/25264 Currently tests are broken in Jenkins: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108687/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108688/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108693/ ``` org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: ColumnarToRow +- InMemoryTableScan [id#356956L] +- InMemoryRelation [id#356956L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Range (0, 5, step=1, splits=2) Stacktrace sbt.ForkMain$ForkError: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: ColumnarToRow +- InMemoryTableScan [id#356956L] +- InMemoryRelation [id#356956L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Range (0, 5, step=1, splits=2) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56) at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:431) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:323) at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287) ``` ## How was this patch tested? Manually tested the failed test. Closes #25365 from HyukjinKwon/SPARK-28537. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-06 15:08:15 +09:00
Stavros Kontopoulos	4a2c662315	[SPARK-27921][PYTHON][SQL][TESTS][FOLLOW-UP] Add UDF cases into group by clause in 'udf-group-analytics.sql' ## What changes were proposed in this pull request? This PR is a followup of a fix as described in here: #25215 (comment) <details><summary>Diff comparing to 'group-analytics.sql'</summary> <p> ```diff diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out index 3439a05727..de297ab166 100644 --- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out +++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out -13,9 +13,9 struct<> -- !query 1 -SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH CUBE +SELECT udf(a + b), b, udf(SUM(a - b)) FROM testData GROUP BY udf(a + b), b WITH CUBE -- !query 1 schema -struct<(a + b):int,b:int,sum((a - b)):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,CAST(udf(cast(sum(cast((a - b) as bigint)) as string)) AS BIGINT):bigint> -- !query 1 output 2 1 0 2 NULL 0 -33,9 +33,9 NULL NULL 3 -- !query 2 -SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH CUBE +SELECT udf(a), udf(b), SUM(b) FROM testData GROUP BY udf(a), b WITH CUBE -- !query 2 schema -struct<a:int,b:int,sum(b):bigint> +struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(b as string)) AS INT):int,sum(b):bigint> -- !query 2 output 1 1 1 1 2 2 -52,9 +52,9 NULL NULL 9 -- !query 3 -SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP +SELECT udf(a + b), b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP -- !query 3 schema -struct<(a + b):int,b:int,sum((a - b)):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,sum((a - b)):bigint> -- !query 3 output 2 1 0 2 NULL 0 -70,9 +70,9 NULL NULL 3 -- !query 4 -SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH ROLLUP +SELECT udf(a), b, udf(SUM(b)) FROM testData GROUP BY udf(a), b WITH ROLLUP -- !query 4 schema -struct<a:int,b:int,sum(b):bigint> +struct<CAST(udf(cast(a as string)) AS INT):int,b:int,CAST(udf(cast(sum(cast(b as bigint)) as string)) AS BIGINT):bigint> -- !query 4 output 1 1 1 1 2 2 -97,7 +97,7 struct<> -- !query 6 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY course, year +SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY udf(course), year -- !query 6 schema struct<course:string,year:int,sum(earnings):bigint> -- !query 6 output -111,7 +111,7 dotNET 2013 48000 -- !query 7 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, year +SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, udf(year) -- !query 7 schema struct<course:string,year:int,sum(earnings):bigint> -- !query 7 output -127,9 +127,9 dotNET 2013 48000 -- !query 8 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year) +SELECT course, udf(year), SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year) -- !query 8 schema -struct<course:string,year:int,sum(earnings):bigint> +struct<course:string,CAST(udf(cast(year as string)) AS INT):int,sum(earnings):bigint> -- !query 8 output Java NULL 50000 NULL 2012 35000 -138,26 +138,26 dotNET NULL 63000 -- !query 9 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course) +SELECT course, year, udf(SUM(earnings)) FROM courseSales GROUP BY course, year GROUPING SETS(course) -- !query 9 schema -struct<course:string,year:int,sum(earnings):bigint> +struct<course:string,year:int,CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint> -- !query 9 output Java NULL 50000 dotNET NULL 63000 -- !query 10 -SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year) +SELECT udf(course), year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year) -- !query 10 schema -struct<course:string,year:int,sum(earnings):bigint> +struct<CAST(udf(cast(course as string)) AS STRING):string,year:int,sum(earnings):bigint> -- !query 10 output NULL 2012 35000 NULL 2013 78000 -- !query 11 -SELECT course, SUM(earnings) AS sum FROM courseSales -GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum +SELECT course, udf(SUM(earnings)) AS sum FROM courseSales +GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, udf(sum) -- !query 11 schema struct<course:string,sum:bigint> -- !query 11 output -173,7 +173,7 dotNET 63000 -- !query 12 SELECT course, SUM(earnings) AS sum, GROUPING_ID(course, earnings) FROM courseSales -GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum +GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY udf(course), sum -- !query 12 schema struct<course:string,sum:bigint,grouping_id(course, earnings):int> -- !query 12 output -188,10 +188,10 dotNET 63000 1 -- !query 13 -SELECT course, year, GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales +SELECT udf(course), udf(year), GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year) -- !query 13 schema -struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int> +struct<CAST(udf(cast(course as string)) AS STRING):string,CAST(udf(cast(year as string)) AS INT):int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int> -- !query 13 output Java 2012 0 0 0 Java 2013 0 0 0 -205,7 +205,7 dotNET NULL 0 1 1 -- !query 14 -SELECT course, year, GROUPING(course) FROM courseSales GROUP BY course, year +SELECT course, udf(year), GROUPING(course) FROM courseSales GROUP BY course, udf(year) -- !query 14 schema struct<> -- !query 14 output -214,7 +214,7 grouping() can only be used with GroupingSets/Cube/Rollup; -- !query 15 -SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY course, year +SELECT course, udf(year), GROUPING_ID(course, year) FROM courseSales GROUP BY udf(course), year -- !query 15 schema struct<> -- !query 15 output -223,7 +223,7 grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 16 -SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year +SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, udf(year) -- !query 16 schema struct<course:string,year:int,grouping__id:int> -- !query 16 output -240,7 +240,7 NULL NULL 3 -- !query 17 SELECT course, year FROM courseSales GROUP BY CUBE(course, year) -HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, year +HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, udf(year) -- !query 17 schema struct<course:string,year:int> -- !query 17 output -250,7 +250,7 dotNET NULL -- !query 18 -SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0 +SELECT course, udf(year) FROM courseSales GROUP BY udf(course), year HAVING GROUPING(course) > 0 -- !query 18 schema struct<> -- !query 18 output -259,7 +259,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 19 -SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0 +SELECT course, udf(udf(year)) FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0 -- !query 19 schema struct<> -- !query 19 output -268,9 +268,9 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 20 -SELECT course, year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0 +SELECT udf(course), year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0 -- !query 20 schema -struct<course:string,year:int> +struct<CAST(udf(cast(course as string)) AS STRING):string,year:int> -- !query 20 output Java NULL NULL 2012 -281,7 +281,7 dotNET NULL -- !query 21 SELECT course, year, GROUPING(course), GROUPING(year) FROM courseSales GROUP BY CUBE(course, year) -ORDER BY GROUPING(course), GROUPING(year), course, year +ORDER BY GROUPING(course), GROUPING(year), course, udf(year) -- !query 21 schema struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint> -- !query 21 output -298,7 +298,7 NULL NULL 1 1 -- !query 22 SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year) -ORDER BY GROUPING(course), GROUPING(year), course, year +ORDER BY GROUPING(course), GROUPING(year), course, udf(year) -- !query 22 schema struct<course:string,year:int,grouping_id(course, year):int> -- !query 22 output -314,7 +314,7 NULL NULL 3 -- !query 23 -SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING(course) +SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING(course) -- !query 23 schema struct<> -- !query 23 output -323,7 +323,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 24 -SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING_ID(course) +SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING_ID(course) -- !query 24 schema struct<> -- !query 24 output -332,7 +332,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup; -- !query 25 -SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year +SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, udf(course), year -- !query 25 schema struct<course:string,year:int> -- !query 25 output -348,7 +348,7 NULL NULL -- !query 26 -SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2) +SELECT udf(a + b) AS k1, udf(b) AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2) -- !query 26 schema struct<k1:int,k2:int,sum((a - b)):bigint> -- !query 26 output -368,7 +368,7 NULL NULL 3 -- !query 27 -SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b) +SELECT udf(udf(a + b)) AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b) -- !query 27 schema struct<k:int,b:int,sum((a - b)):bigint> -- !query 27 output -386,9 +386,9 NULL NULL 3 -- !query 28 -SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k) +SELECT udf(a + b), udf(udf(b)) AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k) -- !query 28 schema -struct<(a + b):int,k:int,sum((a - b)):bigint> +struct<CAST(udf(cast((a + b) as string)) AS INT):int,k:int,sum((a - b)):bigint> -- !query 28 output NULL 1 3 NULL 2 0 ``` </p> </details> ## How was this patch tested? Tested as instructed in SPARK-27921. Closes #25362 from skonto/group-analytics-followup. Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-06 15:00:28 +09:00
Jungtaek Lim (HeartSaVioR)	128ea37bda	[SPARK-28601][CORE][SQL] Use StandardCharsets.UTF_8 instead of "UTF-8" string representation, and get rid of UnsupportedEncodingException ## What changes were proposed in this pull request? This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used. This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible. This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings. ## How was this patch tested? Existing unit tests. Closes #25335 from HeartSaVioR/SPARK-28601. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-05 20:45:54 -07:00
Wenchen Fan	03e3006312	[SPARK-28213][SQL][FOLLOWUP] code cleanup and bug fix for columnar execution framework ## What changes were proposed in this pull request? I did a post-hoc review of https://github.com/apache/spark/pull/25008 , and would like to propose some cleanups/fixes/improvements: 1. Do not track the scanTime metrics in `ColumnarToRowExec`. This metrics is specific to file scan, and doesn't make sense for a general batch-to-row operator. 2. Because of 2, we need to track scanTime when building RDDs in the file scan node. 3. use `RDD#mapPartitionsInternal` instead of `flatMap` in several places, as `mapPartitionsInternal` is created for Spark SQL and we use it in almost all the SQL operators. 4. Add `limitNotReachedCond` in `ColumnarToRowExec`. This was in the `ColumnarBatchScan` before and is critical for performance. 5. Clear the relationship between codegen stage and columnar stage. The whole-stage-codegen framework is completely row-based, so these 2 kinds of stages can NEVER overlap. When they are adjacent, it's either a `RowToColumnarExec` above `WholeStageExec`, or a `ColumnarToRowExec` above the `InputAdapter`. 6. Reuse the `ColumnarBatch` in `RowToColumnarExec`. We don't need to create a new one every time, just need to reset it. 7. Do not skip testing full scan node in `LogicalPlanTagInSparkPlanSuite` 8. Add back the removed tests in `WholeStageCodegenSuite`. ## How was this patch tested? existing tests Closes #25264 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-06 10:11:18 +08:00
Wenchen Fan	6fb79af48c	[SPARK-28344][SQL] detect ambiguous self-join and fail the query ## What changes were proposed in this pull request? This is an alternative solution of https://github.com/apache/spark/pull/24442 . It fails the query if ambiguous self join is detected, instead of trying to disambiguate it. The problem is that, it's hard to come up with a reasonable rule to disambiguate, the rule proposed by #24442 is mostly a heuristic. ### background of the self-join problem: This is a long-standing bug and I've seen many people complaining about it in JIRA/dev list. A typical example: ``` val df1 = … val df2 = df1.filter(...) df1.join(df2, df1("a") > df2("a")) // returns empty result ``` The root cause is, `Dataset.apply` is so powerful that users think it returns a column reference which can point to the column of the Dataset at anywhere. This is not true in many cases. `Dataset.apply` returns an `AttributeReference` . Different Datasets may share the same `AttributeReference`. In the example above, `df2` adds a Filter operator above the logical plan of `df1`, and the Filter operator reserves the output `AttributeReference` of its child. This means, `df1("a")` is exactly the same as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false. ### The rule to detect ambiguous column reference caused by self join: We can reuse the infra in #24442 : 1. each Dataset has a globally unique id. 2. the `AttributeReference` returned by `Dataset.apply` carries the ID and column position(e.g. 3rd column of the Dataset) via metadata. 3. the logical plan of a `Dataset` carries the ID via `TreeNodeTag` When self-join happens, the analyzer asks the right side plan of join to re-generate output attributes with new exprIds. Based on it, a simple rule to detect ambiguous self join is: 1. find all column references (i.e. `AttributeReference`s with Dataset ID and col position) in the root node of a query plan. 2. for each column reference, traverse the query plan tree, find a sub-plan that carries Dataset ID and the ID is the same as the one in the column reference. 3. get the corresponding output attribute of the sub-plan by the col position in the column reference. 4. if the corresponding output attribute has a different exprID than the column reference, then it means this sub-plan is on the right side of a self-join and has regenerated its output attributes. This is an ambiguous self join because the column reference points to a table being self-joined. ## How was this patch tested? existing tests and new test cases Closes #25107 from cloud-fan/new-self-join. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-06 10:06:36 +08:00

1 2 3 4 5 ...

5848 commits