## What changes were proposed in this pull request?
The expression `IntegralDivide`, which corresponds to the `div` operator, support only integral type. Postgres, though, allows it to work also with decimals.
The PR adds the support to decimal operands for this operation in order to have feature parity with postgres.
## How was this patch tested?
added UTs
Closes#25136 from mgaido91/SPARK-28322.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR changes subquery reuse in Adaptive Query Execution from compile-time static reuse to execution-time dynamic reuse. This PR adds a `ReuseAdaptiveSubquery` rule that applies to a query stage after it is created and before it is executed. The new dynamic reuse enables subqueries to be reused across all different subquery levels.
### Why are the changes needed?
This is an improvement to the current subquery reuse in Adaptive Query Execution, which allows subquery reuse to happen in a lazy fashion as well as at different subquery levels.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Passed existing tests.
Closes#25471 from maryannxue/aqe-dynamic-sub-reuse.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This is a pure refactor PR, which creates a new class `CatalogManager` to track the registered v2 catalogs, and provide the catalog up functionality.
`CatalogManager` also tracks the current catalog/namespace. We will implement corresponding commands in other PRs, like `USE CATALOG my_catalog`
## How was this patch tested?
existing tests
Closes#25368 from cloud-fan/refactor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
# What changes were proposed in this pull request?
This patch modifies the explanation of guarantee for ForeachWriter as it doesn't guarantee same output for `(partitionId, epochId)`. Refer the description of [SPARK-28650](https://issues.apache.org/jira/browse/SPARK-28650) for more details.
Spark itself still guarantees same output for same epochId (batch) if the preconditions are met, 1) source is always providing the same input records for same offset request. 2) the query is idempotent in overall (indeterministic calculation like now(), random() can break this).
Assuming breaking preconditions as an exceptional case (the preconditions are implicitly required even before), we still can describe the guarantee with `epochId`, though it will be harder to leverage the guarantee: 1) ForeachWriter should implement a feature to track whether all the partitions are written successfully for given `epochId` 2) There's pretty less chance to leverage the fact, as the chance for Spark to successfully write all partitions and fail to checkpoint the batch is small.
Credit to zsxwing on discovering the broken guarantee.
## How was this patch tested?
This is just a documentation change, both on javadoc and guide doc.
Closes#25407 from HeartSaVioR/SPARK-28650.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
## What changes were proposed in this pull request?
Create Hive Partitioned Table without specifying data type for partition column will success unexpectedly.
```HiveQL
// create a hive table partition by b, but the data type of b isn't specified.
CREATE TABLE tbl(a int) PARTITIONED BY (b) STORED AS parquet
```
In https://issues.apache.org/jira/browse/SPARK-26435 , PARTITIONED BY clause are extended to support Hive CTAS as following:
```ANTLR
// Before
(PARTITIONED BY '(' partitionColumns=colTypeList ')'
// After
(PARTITIONED BY '(' partitionColumns=colTypeList ')'|
PARTITIONED BY partitionColumnNames=identifierList) |
```
Create Table Statement like above case will pass the syntax check, and recognized as (PARTITIONED BY partitionColumnNames=identifierList) 。
This PR will check this case in visitCreateHiveTable and throw a exception which contains explicit error message to user.
## How was this patch tested?
Added tests.
Closes#25390 from lidinghao/hive-ddl-fix.
Authored-by: lihao <lihaowhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Some newer JDKs use the tzdata2018i database, which changes how certain (obscure) historical dates and timezones are handled. As previously, we can pretty much safely ignore these in tests, as the value may vary by JDK.
### Why are the changes needed?
Test otherwise fails using, for example, JDK 1.8.0_222. https://bugs.openjdk.java.net/browse/JDK-8215982 has a full list of JDKs which has this.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests
Closes#25504 from srowen/SPARK-28775.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up of #24761 which added a higher-order function `ArrayForAll`.
The PR mistakenly removed the `prettyName` from `ArrayExists` and forgot to add it to `ArrayForAll`.
### Why are the changes needed?
This reverts the `prettyName` back to `ArrayExists` not to affect explained plans, and adds it to `ArrayForAll` to clarify the `prettyName` as the same as the expressions around.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25501 from ueshin/issues/SPARK-27905/pretty_names.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR adds some tests converted from ```pgSQL/join.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'join.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out
index f75fe05196..ad2b5dd0db 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/join.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-join.sql.out
-240,10 +240,10 struct<>
-- !query 27
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t)
FROM J1_TBL AS tx
-- !query 27 schema
-struct<xxx:string,i:int,j:int,t:string>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string>
-- !query 27 output
0 NULL zero
1 4 one
-259,10 +259,10 struct<xxx:string,i:int,j:int,t:string>
-- !query 28
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(udf(i)), udf(j), udf(t)
FROM J1_TBL tx
-- !query 28 schema
-struct<xxx:string,i:int,j:int,t:string>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string>
-- !query 28 output
0 NULL zero
1 4 one
-278,10 +278,10 struct<xxx:string,i:int,j:int,t:string>
-- !query 29
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, a, udf(udf(b)), c
FROM J1_TBL AS t1 (a, b, c)
-- !query 29 schema
-struct<xxx:string,a:int,b:int,c:string>
+struct<xxx:string,a:int,CAST(udf(cast(cast(udf(cast(b as string)) as int) as string)) AS INT):int,c:string>
-- !query 29 output
0 NULL zero
1 4 one
-297,10 +297,10 struct<xxx:string,a:int,b:int,c:string>
-- !query 30
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(a), udf(b), udf(udf(c))
FROM J1_TBL t1 (a, b, c)
-- !query 30 schema
-struct<xxx:string,a:int,b:int,c:string>
+struct<xxx:string,CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(c as string)) as string) as string)) AS STRING):string>
-- !query 30 output
0 NULL zero
1 4 one
-316,10 +316,10 struct<xxx:string,a:int,b:int,c:string>
-- !query 31
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(a), b, udf(c), udf(d), e
FROM J1_TBL t1 (a, b, c), J2_TBL t2 (d, e)
-- !query 31 schema
-struct<xxx:string,a:int,b:int,c:string,d:int,e:int>
+struct<xxx:string,CAST(udf(cast(a as string)) AS INT):int,b:int,CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(d as string)) AS INT):int,e:int>
-- !query 31 output
0 NULL zero 0 NULL
0 NULL zero 1 -1
-423,7 +423,7 struct<xxx:string,a:int,b:int,c:string,d:int,e:int>
-- !query 32
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, *
FROM J1_TBL CROSS JOIN J2_TBL
-- !query 32 schema
struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
-530,20 +530,20 struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
-- !query 33
-SELECT '' AS `xxx`, i, k, t
+SELECT udf('') AS `xxx`, udf(i) AS i, udf(k), udf(t) AS t
FROM J1_TBL CROSS JOIN J2_TBL
-- !query 33 schema
struct<>
-- !query 33 output
org.apache.spark.sql.AnalysisException
-Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 20
+Reference 'i' is ambiguous, could be: default.j1_tbl.i, default.j2_tbl.i.; line 1 pos 29
-- !query 34
-SELECT '' AS `xxx`, t1.i, k, t
+SELECT udf('') AS `xxx`, udf(t1.i) AS i, udf(k), udf(t)
FROM J1_TBL t1 CROSS JOIN J2_TBL t2
-- !query 34 schema
-struct<xxx:string,i:int,k:int,t:string>
+struct<xxx:string,i:int,CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string>
-- !query 34 output
0 -1 zero
0 -3 zero
-647,11 +647,11 struct<xxx:string,i:int,k:int,t:string>
-- !query 35
-SELECT '' AS `xxx`, ii, tt, kk
+SELECT udf(udf('')) AS `xxx`, udf(udf(ii)) AS ii, udf(udf(tt)) AS tt, udf(udf(kk))
FROM (J1_TBL CROSS JOIN J2_TBL)
AS tx (ii, jj, tt, ii2, kk)
-- !query 35 schema
-struct<xxx:string,ii:int,tt:string,kk:int>
+struct<xxx:string,ii:int,tt:string,CAST(udf(cast(cast(udf(cast(kk as string)) as int) as string)) AS INT):int>
-- !query 35 output
0 zero -1
0 zero -3
-755,10 +755,10 struct<xxx:string,ii:int,tt:string,kk:int>
-- !query 36
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(j1_tbl.i)), udf(j), udf(t), udf(a.i), udf(a.k), udf(b.i), udf(b.k)
FROM J1_TBL CROSS JOIN J2_TBL a CROSS JOIN J2_TBL b
-- !query 36 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int>
-- !query 36 output
0 NULL zero 0 NULL 0 NULL
0 NULL zero 0 NULL 1 -1
-1654,10 +1654,10 struct<xxx:string,i:int,j:int,t:string,i:int,k:int,i:int,k:int>
-- !query 37
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i) AS i, udf(j), udf(t) AS t, udf(k)
FROM J1_TBL INNER JOIN J2_TBL USING (i)
-- !query 37 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,i:int,CAST(udf(cast(j as string)) AS INT):int,t:string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 37 output
0 NULL zero NULL
1 4 one -1
-1669,10 +1669,10 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 38
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(i), udf(j) AS j, udf(t), udf(k) AS k
FROM J1_TBL JOIN J2_TBL USING (i)
-- !query 38 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,j:int,CAST(udf(cast(t as string)) AS STRING):string,k:int>
-- !query 38 output
0 NULL zero NULL
1 4 one -1
-1684,9 +1684,9 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 39
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, *
FROM J1_TBL t1 (a, b, c) JOIN J2_TBL t2 (a, d) USING (a)
- ORDER BY a, d
+ ORDER BY udf(udf(a)), udf(d)
-- !query 39 schema
struct<xxx:string,a:int,b:int,c:string,d:int>
-- !query 39 output
-1700,10 +1700,10 struct<xxx:string,a:int,b:int,c:string,d:int>
-- !query 40
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(i), udf(j), udf(t), udf(k)
FROM J1_TBL NATURAL JOIN J2_TBL
-- !query 40 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 40 output
0 NULL zero NULL
1 4 one -1
-1715,10 +1715,10 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 41
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(udf(a))) AS a, udf(b), udf(c), udf(d)
FROM J1_TBL t1 (a, b, c) NATURAL JOIN J2_TBL t2 (a, d)
-- !query 41 schema
-struct<xxx:string,a:int,b:int,c:string,d:int>
+struct<xxx:string,a:int,CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(d as string)) AS INT):int>
-- !query 41 output
0 NULL zero NULL
1 4 one -1
-1730,10 +1730,10 struct<xxx:string,a:int,b:int,c:string,d:int>
-- !query 42
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(a)), udf(udf(b)), udf(udf(c)) AS c, udf(udf(udf(d))) AS d
FROM J1_TBL t1 (a, b, c) NATURAL JOIN J2_TBL t2 (d, a)
-- !query 42 schema
-struct<xxx:string,a:int,b:int,c:string,d:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(b as string)) as int) as string)) AS INT):int,c:string,d:int>
-- !query 42 output
0 NULL zero NULL
2 3 two 2
-1741,10 +1741,10 struct<xxx:string,a:int,b:int,c:string,d:int>
-- !query 43
-SELECT '' AS `xxx`, *
- FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = J2_TBL.i)
+SELECT udf('') AS `xxx`, udf(J1_TBL.i), udf(udf(J1_TBL.j)), udf(J1_TBL.t), udf(J2_TBL.i), udf(J2_TBL.k)
+ FROM J1_TBL JOIN J2_TBL ON (udf(J1_TBL.i) = J2_TBL.i)
-- !query 43 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int>
-- !query 43 output
0 NULL zero 0 NULL
1 4 one 1 -1
-1756,10 +1756,10 struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
-- !query 44
-SELECT '' AS `xxx`, *
- FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = J2_TBL.k)
+SELECT udf('') AS `xxx`, udf(udf(J1_TBL.i)), udf(udf(J1_TBL.j)), udf(udf(J1_TBL.t)), J2_TBL.i, J2_TBL.k
+ FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i = udf(J2_TBL.k))
-- !query 44 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(t as string)) as string) as string)) AS STRING):string,i:int,k:int>
-- !query 44 output
0 NULL zero NULL 0
2 3 two 2 2
-1767,10 +1767,10 struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
-- !query 45
-SELECT '' AS `xxx`, *
- FROM J1_TBL JOIN J2_TBL ON (J1_TBL.i <= J2_TBL.k)
+SELECT udf('') AS `xxx`, udf(J1_TBL.i), udf(J1_TBL.j), udf(J1_TBL.t), udf(J2_TBL.i), udf(J2_TBL.k)
+ FROM J1_TBL JOIN J2_TBL ON (udf(J1_TBL.i) <= udf(udf(J2_TBL.k)))
-- !query 45 schema
-struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(k as string)) AS INT):int>
-- !query 45 output
0 NULL zero 2 2
0 NULL zero 2 4
-1784,11 +1784,11 struct<xxx:string,i:int,j:int,t:string,i:int,k:int>
-- !query 46
-SELECT '' AS `xxx`, *
+SELECT udf(udf('')) AS `xxx`, udf(i), udf(j), udf(t), udf(k)
FROM J1_TBL LEFT OUTER JOIN J2_TBL USING (i)
- ORDER BY i, k, t
+ ORDER BY udf(udf(i)), udf(k), udf(t)
-- !query 46 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 46 output
NULL NULL null NULL
NULL 0 zero NULL
-1806,11 +1806,11 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 47
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k)
FROM J1_TBL LEFT JOIN J2_TBL USING (i)
- ORDER BY i, k, t
+ ORDER BY udf(i), udf(udf(k)), udf(t)
-- !query 47 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 47 output
NULL NULL null NULL
NULL 0 zero NULL
-1828,10 +1828,10 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 48
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(udf(i)), udf(j), udf(t), udf(k)
FROM J1_TBL RIGHT OUTER JOIN J2_TBL USING (i)
-- !query 48 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(cast(udf(cast(i as string)) as int) as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 48 output
0 NULL zero NULL
1 4 one -1
-1845,10 +1845,10 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 49
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(udf(j)), udf(t), udf(k)
FROM J1_TBL RIGHT JOIN J2_TBL USING (i)
-- !query 49 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(j as string)) as int) as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 49 output
0 NULL zero NULL
1 4 one -1
-1862,11 +1862,11 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 50
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(udf(t)), udf(k)
FROM J1_TBL FULL OUTER JOIN J2_TBL USING (i)
- ORDER BY i, k, t
+ ORDER BY udf(udf(i)), udf(k), udf(t)
-- !query 50 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(t as string)) as string) as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 50 output
NULL NULL NULL NULL
NULL NULL null NULL
-1886,11 +1886,11 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 51
-SELECT '' AS `xxx`, *
+SELECT udf('') AS `xxx`, udf(i), udf(j), t, udf(udf(k))
FROM J1_TBL FULL JOIN J2_TBL USING (i)
- ORDER BY i, k, t
+ ORDER BY udf(udf(i)), udf(k), udf(udf(t))
-- !query 51 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,t:string,CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int>
-- !query 51 output
NULL NULL NULL NULL
NULL NULL null NULL
-1910,19 +1910,19 struct<xxx:string,i:int,j:int,t:string,k:int>
-- !query 52
-SELECT '' AS `xxx`, *
- FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (k = 1)
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(udf(k))
+ FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (udf(k) = 1)
-- !query 52 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int>
-- !query 52 output
-- !query 53
-SELECT '' AS `xxx`, *
- FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (i = 1)
+SELECT udf('') AS `xxx`, udf(i), udf(j), udf(t), udf(k)
+ FROM J1_TBL LEFT JOIN J2_TBL USING (i) WHERE (udf(udf(i)) = udf(1))
-- !query 53 schema
-struct<xxx:string,i:int,j:int,t:string,k:int>
+struct<xxx:string,CAST(udf(cast(i as string)) AS INT):int,CAST(udf(cast(j as string)) AS INT):int,CAST(udf(cast(t as string)) AS STRING):string,CAST(udf(cast(k as string)) AS INT):int>
-- !query 53 output
1 4 one -1
-2020,9 +2020,9 ee NULL 42 NULL
-- !query 65
SELECT * FROM
-(SELECT * FROM t2) as s2
+(SELECT udf(name) as name, t2.n FROM t2) as s2
INNER JOIN
-(SELECT * FROM t3) s3
+(SELECT udf(udf(name)) as name, t3.n FROM t3) s3
USING (name)
-- !query 65 schema
struct<name:string,n:int,n:int>
-2033,9 +2033,9 cc 22 23
-- !query 66
SELECT * FROM
-(SELECT * FROM t2) as s2
+(SELECT udf(udf(name)) as name, t2.n FROM t2) as s2
LEFT JOIN
-(SELECT * FROM t3) s3
+(SELECT udf(name) as name, t3.n FROM t3) s3
USING (name)
-- !query 66 schema
struct<name:string,n:int,n:int>
-2046,13 +2046,13 ee 42 NULL
-- !query 67
-SELECT * FROM
+SELECT udf(name), udf(udf(s2.n)), udf(s3.n) FROM
(SELECT * FROM t2) as s2
FULL JOIN
(SELECT * FROM t3) s3
USING (name)
-- !query 67 schema
-struct<name:string,n:int,n:int>
+struct<CAST(udf(cast(name as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(n as string)) as int) as string)) AS INT):int,CAST(udf(cast(n as string)) AS INT):int>
-- !query 67 output
bb 12 13
cc 22 23
-2062,9 +2062,9 ee 42 NULL
-- !query 68
SELECT * FROM
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(udf(name)) as name, udf(n) as s2_n, udf(2) as s2_2 FROM t2) as s2
NATURAL INNER JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(name) as name, udf(udf(n)) as s3_n, udf(3) as s3_2 FROM t3) s3
-- !query 68 schema
struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
-- !query 68 output
-2074,9 +2074,9 cc 22 2 23 3
-- !query 69
SELECT * FROM
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(udf(n)) as s2_n, 2 as s2_2 FROM t2) as s2
NATURAL LEFT JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(name)) as name, udf(n) as s3_n, 3 as s3_2 FROM t3) s3
-- !query 69 schema
struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
-- !query 69 output
-2087,9 +2087,9 ee 42 2 NULL NULL
-- !query 70
SELECT * FROM
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(n) as s2_n, 2 as s2_2 FROM t2) as s2
NATURAL FULL JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(name)) as name, udf(udf(n)) as s3_n, 3 as s3_2 FROM t3) s3
-- !query 70 schema
struct<name:string,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
-- !query 70 output
-2101,11 +2101,11 ee 42 2 NULL NULL
-- !query 71
SELECT * FROM
-(SELECT name, n as s1_n, 1 as s1_1 FROM t1) as s1
+(SELECT udf(udf(name)) as name, udf(n) as s1_n, 1 as s1_1 FROM t1) as s1
NATURAL INNER JOIN
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(n) as s2_n, 2 as s2_2 FROM t2) as s2
NATURAL INNER JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(udf(name))) as name, udf(n) as s3_n, 3 as s3_2 FROM t3) s3
-- !query 71 schema
struct<name:string,s1_n:int,s1_1:int,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
-- !query 71 output
-2114,11 +2114,11 bb 11 1 12 2 13 3
-- !query 72
SELECT * FROM
-(SELECT name, n as s1_n, 1 as s1_1 FROM t1) as s1
+(SELECT udf(name) as name, udf(n) as s1_n, udf(udf(1)) as s1_1 FROM t1) as s1
NATURAL FULL JOIN
-(SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+(SELECT udf(name) as name, udf(udf(n)) as s2_n, udf(2) as s2_2 FROM t2) as s2
NATURAL FULL JOIN
-(SELECT name, n as s3_n, 3 as s3_2 FROM t3) s3
+(SELECT udf(udf(name)) as name, udf(n) as s3_n, udf(3) as s3_2 FROM t3) s3
-- !query 72 schema
struct<name:string,s1_n:int,s1_1:int,s2_n:int,s2_2:int,s3_n:int,s3_2:int>
-- !query 72 output
-2129,16 +2129,16 ee NULL NULL 42 2 NULL NULL
-- !query 73
-SELECT * FROM
-(SELECT name, n as s1_n FROM t1) as s1
+SELECT name, udf(udf(s1_n)), udf(s2_n), udf(s3_n) FROM
+(SELECT name, udf(udf(n)) as s1_n FROM t1) as s1
NATURAL FULL JOIN
(SELECT * FROM
- (SELECT name, n as s2_n FROM t2) as s2
+ (SELECT name, udf(n) as s2_n FROM t2) as s2
NATURAL FULL JOIN
- (SELECT name, n as s3_n FROM t3) as s3
+ (SELECT name, udf(udf(n)) as s3_n FROM t3) as s3
) ss2
-- !query 73 schema
-struct<name:string,s1_n:int,s2_n:int,s3_n:int>
+struct<name:string,CAST(udf(cast(cast(udf(cast(s1_n as string)) as int) as string)) AS INT):int,CAST(udf(cast(s2_n as string)) AS INT):int,CAST(udf(cast(s3_n as string)) AS INT):int>
-- !query 73 output
bb 11 12 13
cc NULL 22 23
-2151,9 +2151,9 SELECT * FROM
(SELECT name, n as s1_n FROM t1) as s1
NATURAL FULL JOIN
(SELECT * FROM
- (SELECT name, n as s2_n, 2 as s2_2 FROM t2) as s2
+ (SELECT name, udf(udf(n)) as s2_n, 2 as s2_2 FROM t2) as s2
NATURAL FULL JOIN
- (SELECT name, n as s3_n FROM t3) as s3
+ (SELECT name, udf(n) as s3_n FROM t3) as s3
) ss2
-- !query 74 schema
struct<name:string,s1_n:int,s2_n:int,s2_2:int,s3_n:int>
-2165,13 +2165,13 ee NULL 42 2 NULL
-- !query 75
-SELECT * FROM
- (SELECT name, n as s1_n FROM t1) as s1
+SELECT s1.name, udf(s1_n), s2.name, udf(udf(s2_n)) FROM
+ (SELECT name, udf(n) as s1_n FROM t1) as s1
FULL JOIN
(SELECT name, 2 as s2_n FROM t2) as s2
-ON (s1_n = s2_n)
+ON (udf(udf(s1_n)) = udf(s2_n))
-- !query 75 schema
-struct<name:string,s1_n:int,name:string,s2_n:int>
+struct<name:string,CAST(udf(cast(s1_n as string)) AS INT):int,name:string,CAST(udf(cast(cast(udf(cast(s2_n as string)) as int) as string)) AS INT):int>
-- !query 75 output
NULL NULL bb 2
NULL NULL cc 2
-2200,9 +2200,9 struct<>
-- !query 78
-select * from x
+select udf(udf(x1)), udf(x2) from x
-- !query 78 schema
-struct<x1:int,x2:int>
+struct<CAST(udf(cast(cast(udf(cast(x1 as string)) as int) as string)) AS INT):int,CAST(udf(cast(x2 as string)) AS INT):int>
-- !query 78 output
1 11
2 22
-2212,9 +2212,9 struct<x1:int,x2:int>
-- !query 79
-select * from y
+select udf(y1), udf(udf(y2)) from y
-- !query 79 schema
-struct<y1:int,y2:int>
+struct<CAST(udf(cast(y1 as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(y2 as string)) as int) as string)) AS INT):int>
-- !query 79 output
1 111
2 222
-2223,7 +2223,7 struct<y1:int,y2:int>
-- !query 80
-select * from x left join y on (x1 = y1 and x2 is not null)
+select * from x left join y on (udf(x1) = udf(udf(y1)) and udf(x2) is not null)
-- !query 80 schema
struct<x1:int,x2:int,y1:int,y2:int>
-- !query 80 output
-2235,7 +2235,7 struct<x1:int,x2:int,y1:int,y2:int>
-- !query 81
-select * from x left join y on (x1 = y1 and y2 is not null)
+select * from x left join y on (udf(udf(x1)) = udf(y1) and udf(y2) is not null)
-- !query 81 schema
struct<x1:int,x2:int,y1:int,y2:int>
-- !query 81 output
-2247,8 +2247,8 struct<x1:int,x2:int,y1:int,y2:int>
-- !query 82
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1)
+select * from (x left join y on (udf(x1) = udf(udf(y1)))) left join x xx(xx1,xx2)
+on (udf(udf(x1)) = udf(xx1))
-- !query 82 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 82 output
-2260,8 +2260,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 83
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1 and x2 is not null)
+select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2)
+on (udf(x1) = xx1 and udf(x2) is not null)
-- !query 83 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 83 output
-2273,8 +2273,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 84
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1 and y2 is not null)
+select * from (x left join y on (x1 = udf(y1))) left join x xx(xx1,xx2)
+on (udf(x1) = udf(udf(xx1)) and udf(y2) is not null)
-- !query 84 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 84 output
-2286,8 +2286,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 85
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1 and xx2 is not null)
+select * from (x left join y on (udf(x1) = y1)) left join x xx(xx1,xx2)
+on (udf(udf(x1)) = udf(xx1) and udf(udf(xx2)) is not null)
-- !query 85 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 85 output
-2299,8 +2299,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 86
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1) where (x2 is not null)
+select * from (x left join y on (udf(udf(x1)) = udf(udf(y1)))) left join x xx(xx1,xx2)
+on (udf(x1) = udf(xx1)) where (udf(x2) is not null)
-- !query 86 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 86 output
-2310,8 +2310,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 87
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1) where (y2 is not null)
+select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2)
+on (udf(x1) = xx1) where (udf(y2) is not null)
-- !query 87 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 87 output
-2321,8 +2321,8 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 88
-select * from (x left join y on (x1 = y1)) left join x xx(xx1,xx2)
-on (x1 = xx1) where (xx2 is not null)
+select * from (x left join y on (udf(x1) = udf(y1))) left join x xx(xx1,xx2)
+on (x1 = udf(xx1)) where (xx2 is not null)
-- !query 88 schema
struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 88 output
-2332,75 +2332,75 struct<x1:int,x2:int,y1:int,y2:int,xx1:int,xx2:int>
-- !query 89
-select count(*) from tenk1 a where unique1 in
- (select unique1 from tenk1 b join tenk1 c using (unique1)
- where b.unique2 = 42)
+select udf(udf(count(*))) from tenk1 a where udf(udf(unique1)) in
+ (select udf(unique1) from tenk1 b join tenk1 c using (unique1)
+ where udf(udf(b.unique2)) = udf(42))
-- !query 89 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(cast(udf(cast(count(1) as string)) as bigint) as string)) AS BIGINT):bigint>
-- !query 89 output
1
-- !query 90
-select count(*) from tenk1 x where
- x.unique1 in (select a.f1 from int4_tbl a,float8_tbl b where a.f1=b.f1) and
- x.unique1 = 0 and
- x.unique1 in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=bb.f1)
+select udf(count(*)) from tenk1 x where
+ udf(x.unique1) in (select udf(a.f1) from int4_tbl a,float8_tbl b where udf(udf(a.f1))=b.f1) and
+ udf(x.unique1) = 0 and
+ udf(x.unique1) in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=udf(udf(bb.f1)))
-- !query 90 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 90 output
1
-- !query 91
-select count(*) from tenk1 x where
- x.unique1 in (select a.f1 from int4_tbl a,float8_tbl b where a.f1=b.f1) and
- x.unique1 = 0 and
- x.unique1 in (select aa.f1 from int4_tbl aa,float8_tbl bb where aa.f1=bb.f1)
+select udf(udf(count(*))) from tenk1 x where
+ udf(x.unique1) in (select udf(a.f1) from int4_tbl a,float8_tbl b where udf(udf(a.f1))=b.f1) and
+ udf(x.unique1) = 0 and
+ udf(udf(x.unique1)) in (select udf(aa.f1) from int4_tbl aa,float8_tbl bb where udf(aa.f1)=udf(udf(bb.f1)))
-- !query 91 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(cast(udf(cast(count(1) as string)) as bigint) as string)) AS BIGINT):bigint>
-- !query 91 output
1
-- !query 92
select * from int8_tbl i1 left join (int8_tbl i2 join
- (select 123 as x) ss on i2.q1 = x) on i1.q2 = i2.q2
-order by 1, 2
+ (select udf(123) as x) ss on udf(udf(i2.q1)) = udf(x)) on udf(udf(i1.q2)) = udf(udf(i2.q2))
+order by udf(udf(1)), 2
-- !query 92 schema
struct<q1:bigint,q2:bigint,q1:bigint,q2:bigint,x:int>
-- !query 92 output
-123 456 123 456 123
-123 4567890123456789 123 4567890123456789 123
4567890123456789 -4567890123456789 NULL NULL NULL
4567890123456789 123 NULL NULL NULL
+123 456 123 456 123
+123 4567890123456789 123 4567890123456789 123
4567890123456789 4567890123456789 123 4567890123456789 123
-- !query 93
-select count(*)
+select udf(count(*))
from
- (select t3.tenthous as x1, coalesce(t1.stringu1, t2.stringu1) as x2
+ (select udf(t3.tenthous) as x1, udf(coalesce(udf(t1.stringu1), udf(t2.stringu1))) as x2
from tenk1 t1
- left join tenk1 t2 on t1.unique1 = t2.unique1
- join tenk1 t3 on t1.unique2 = t3.unique2) ss,
+ left join tenk1 t2 on udf(t1.unique1) = udf(t2.unique1)
+ join tenk1 t3 on t1.unique2 = udf(t3.unique2)) ss,
tenk1 t4,
tenk1 t5
-where t4.thousand = t5.unique1 and ss.x1 = t4.tenthous and ss.x2 = t5.stringu1
+where udf(t4.thousand) = udf(t5.unique1) and udf(udf(ss.x1)) = t4.tenthous and udf(ss.x2) = udf(udf(t5.stringu1))
-- !query 93 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 93 output
1000
-- !query 94
-select a.f1, b.f1, t.thousand, t.tenthous from
+select udf(a.f1), udf(b.f1), udf(t.thousand), udf(t.tenthous) from
tenk1 t,
- (select sum(f1)+1 as f1 from int4_tbl i4a) a,
- (select sum(f1) as f1 from int4_tbl i4b) b
-where b.f1 = t.thousand and a.f1 = b.f1 and (a.f1+b.f1+999) = t.tenthous
+ (select udf(udf(sum(udf(f1))+1)) as f1 from int4_tbl i4a) a,
+ (select udf(sum(udf(f1))) as f1 from int4_tbl i4b) b
+where b.f1 = udf(t.thousand) and udf(a.f1) = udf(b.f1) and udf((udf(a.f1)+udf(b.f1)+999)) = udf(udf(t.tenthous))
-- !query 94 schema
-struct<f1:bigint,f1:bigint,thousand:int,tenthous:int>
+struct<CAST(udf(cast(f1 as string)) AS BIGINT):bigint,CAST(udf(cast(f1 as string)) AS BIGINT):bigint,CAST(udf(cast(thousand as string)) AS INT):int,CAST(udf(cast(tenthous as string)) AS INT):int>
-- !query 94 output
-2408,8 +2408,8 struct<f1:bigint,f1:bigint,thousand:int,tenthous:int>
-- !query 95
select * from
j1_tbl full join
- (select * from j2_tbl order by j2_tbl.i desc, j2_tbl.k asc) j2_tbl
- on j1_tbl.i = j2_tbl.i and j1_tbl.i = j2_tbl.k
+ (select * from j2_tbl order by udf(udf(j2_tbl.i)) desc, udf(j2_tbl.k) asc) j2_tbl
+ on udf(j1_tbl.i) = udf(j2_tbl.i) and udf(j1_tbl.i) = udf(j2_tbl.k)
-- !query 95 schema
struct<i:int,j:int,t:string,i:int,k:int>
-- !query 95 output
-2435,13 +2435,13 NULL NULL null NULL NULL
-- !query 96
-select count(*) from
- (select * from tenk1 x order by x.thousand, x.twothousand, x.fivethous) x
+select udf(count(*)) from
+ (select * from tenk1 x order by udf(x.thousand), udf(udf(x.twothousand)), x.fivethous) x
left join
- (select * from tenk1 y order by y.unique2) y
- on x.thousand = y.unique2 and x.twothousand = y.hundred and x.fivethous = y.unique2
+ (select * from tenk1 y order by udf(y.unique2)) y
+ on udf(x.thousand) = y.unique2 and x.twothousand = udf(y.hundred) and x.fivethous = y.unique2
-- !query 96 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 96 output
10000
-2507,7 +2507,7 struct<>
-- !query 104
-select tt1.*, tt2.* from tt1 left join tt2 on tt1.joincol = tt2.joincol
+select tt1.*, tt2.* from tt1 left join tt2 on udf(udf(tt1.joincol)) = udf(tt2.joincol)
-- !query 104 schema
struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>
-- !query 104 output
-2517,7 +2517,7 struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>
-- !query 105
-select tt1.*, tt2.* from tt2 right join tt1 on tt1.joincol = tt2.joincol
+select tt1.*, tt2.* from tt2 right join tt1 on udf(udf(tt1.joincol)) = udf(udf(tt2.joincol))
-- !query 105 schema
struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>
-- !query 105 output
-2527,10 +2527,10 struct<tt1_id:int,joincol:int,tt2_id:int,joincol:int>
-- !query 106
-select count(*) from tenk1 a, tenk1 b
- where a.hundred = b.thousand and (b.fivethous % 10) < 10
+select udf(count(*)) from tenk1 a, tenk1 b
+ where udf(a.hundred) = b.thousand and udf(udf((b.fivethous % 10)) < 10)
-- !query 106 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 106 output
100000
-2584,14 +2584,14 struct<>
-- !query 113
-SELECT a.f1
+SELECT udf(udf(a.f1)) as f1
FROM tt4 a
LEFT JOIN (
SELECT b.f1
- FROM tt3 b LEFT JOIN tt3 c ON (b.f1 = c.f1)
- WHERE c.f1 IS NULL
-) AS d ON (a.f1 = d.f1)
-WHERE d.f1 IS NULL
+ FROM tt3 b LEFT JOIN tt3 c ON udf(b.f1) = udf(c.f1)
+ WHERE udf(c.f1) IS NULL
+) AS d ON udf(a.f1) = d.f1
+WHERE udf(udf(d.f1)) IS NULL
-- !query 113 schema
struct<f1:int>
-- !query 113 output
-2621,7 +2621,7 struct<>
-- !query 116
-select * from tt5,tt6 where tt5.f1 = tt6.f1 and tt5.f1 = tt5.f2 - tt6.f2
+select * from tt5,tt6 where udf(tt5.f1) = udf(tt6.f1) and udf(tt5.f1) = udf(udf(tt5.f2) - udf(tt6.f2))
-- !query 116 schema
struct<f1:int,f2:int,f1:int,f2:int>
-- !query 116 output
-2649,12 +2649,12 struct<>
-- !query 119
-select yy.pkyy as yy_pkyy, yy.pkxx as yy_pkxx, yya.pkyy as yya_pkyy,
- xxa.pkxx as xxa_pkxx, xxb.pkxx as xxb_pkxx
+select udf(udf(yy.pkyy)) as yy_pkyy, udf(yy.pkxx) as yy_pkxx, udf(yya.pkyy) as yya_pkyy,
+ udf(xxa.pkxx) as xxa_pkxx, udf(xxb.pkxx) as xxb_pkxx
from yy
- left join (SELECT * FROM yy where pkyy = 101) as yya ON yy.pkyy = yya.pkyy
- left join xx xxa on yya.pkxx = xxa.pkxx
- left join xx xxb on coalesce (xxa.pkxx, 1) = xxb.pkxx
+ left join (SELECT * FROM yy where pkyy = 101) as yya ON udf(yy.pkyy) = udf(yya.pkyy)
+ left join xx xxa on udf(yya.pkxx) = udf(udf(xxa.pkxx))
+ left join xx xxb on udf(udf(coalesce (xxa.pkxx, 1))) = udf(xxb.pkxx)
-- !query 119 schema
struct<yy_pkyy:int,yy_pkxx:int,yya_pkyy:int,xxa_pkxx:int,xxb_pkxx:int>
-- !query 119 output
-2693,9 +2693,9 struct<>
-- !query 123
select * from
- zt2 left join zt3 on (f2 = f3)
- left join zt1 on (f3 = f1)
-where f2 = 53
+ zt2 left join zt3 on (udf(f2) = udf(udf(f3)))
+ left join zt1 on (udf(udf(f3)) = udf(f1))
+where udf(f2) = 53
-- !query 123 schema
struct<f2:int,f3:int,f1:int>
-- !query 123 output
-2712,9 +2712,9 struct<>
-- !query 125
select * from
- zt2 left join zt3 on (f2 = f3)
- left join zv1 on (f3 = f1)
-where f2 = 53
+ zt2 left join zt3 on (f2 = udf(f3))
+ left join zv1 on (udf(f3) = f1)
+where udf(udf(f2)) = 53
-- !query 125 schema
struct<f2:int,f3:int,f1:int,junk:string>
-- !query 125 output
-2722,12 +2722,12 struct<f2:int,f3:int,f1:int,junk:string>
-- !query 126
-select a.unique2, a.ten, b.tenthous, b.unique2, b.hundred
-from tenk1 a left join tenk1 b on a.unique2 = b.tenthous
-where a.unique1 = 42 and
- ((b.unique2 is null and a.ten = 2) or b.hundred = 3)
+select udf(a.unique2), udf(a.ten), udf(b.tenthous), udf(b.unique2), udf(b.hundred)
+from tenk1 a left join tenk1 b on a.unique2 = udf(b.tenthous)
+where udf(a.unique1) = 42 and
+ ((udf(b.unique2) is null and udf(a.ten) = 2) or udf(udf(b.hundred)) = udf(udf(3)))
-- !query 126 schema
-struct<unique2:int,ten:int,tenthous:int,unique2:int,hundred:int>
+struct<CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(ten as string)) AS INT):int,CAST(udf(cast(tenthous as string)) AS INT):int,CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(hundred as string)) AS INT):int>
-- !query 126 output
-2749,7 +2749,7 struct<>
-- !query 129
-select * from a left join b on i = x and i = y and x = i
+select * from a left join b on udf(i) = x and i = udf(y) and udf(x) = udf(i)
-- !query 129 schema
struct<i:int,x:int,y:int>
-- !query 129 output
-2757,11 +2757,11 struct<i:int,x:int,y:int>
-- !query 130
-select t1.q2, count(t2.*)
-from int8_tbl t1 left join int8_tbl t2 on (t1.q2 = t2.q1)
-group by t1.q2 order by 1
+select udf(t1.q2), udf(count(t2.*))
+from int8_tbl t1 left join int8_tbl t2 on (udf(udf(t1.q2)) = t2.q1)
+group by udf(t1.q2) order by 1
-- !query 130 schema
-struct<q2:bigint,count(q1, q2):bigint>
+struct<CAST(udf(cast(q2 as string)) AS BIGINT):bigint,CAST(udf(cast(count(q1, q2) as string)) AS BIGINT):bigint>
-- !query 130 output
-4567890123456789 0
123 2
-2770,11 +2770,11 struct<q2:bigint,count(q1, q2):bigint>
-- !query 131
-select t1.q2, count(t2.*)
-from int8_tbl t1 left join (select * from int8_tbl) t2 on (t1.q2 = t2.q1)
-group by t1.q2 order by 1
+select udf(udf(t1.q2)), udf(count(t2.*))
+from int8_tbl t1 left join (select * from int8_tbl) t2 on (udf(udf(t1.q2)) = udf(t2.q1))
+group by udf(udf(t1.q2)) order by 1
-- !query 131 schema
-struct<q2:bigint,count(q1, q2):bigint>
+struct<CAST(udf(cast(cast(udf(cast(q2 as string)) as bigint) as string)) AS BIGINT):bigint,CAST(udf(cast(count(q1, q2) as string)) AS BIGINT):bigint>
-- !query 131 output
-4567890123456789 0
123 2
-2783,13 +2783,13 struct<q2:bigint,count(q1, q2):bigint>
-- !query 132
-select t1.q2, count(t2.*)
+select udf(t1.q2) as q2, udf(udf(count(t2.*)))
from int8_tbl t1 left join
- (select q1, case when q2=1 then 1 else q2 end as q2 from int8_tbl) t2
- on (t1.q2 = t2.q1)
+ (select udf(q1) as q1, case when q2=1 then 1 else q2 end as q2 from int8_tbl) t2
+ on (udf(t1.q2) = udf(t2.q1))
group by t1.q2 order by 1
-- !query 132 schema
-struct<q2:bigint,count(q1, q2):bigint>
+struct<q2:bigint,CAST(udf(cast(cast(udf(cast(count(q1, q2) as string)) as bigint) as string)) AS BIGINT):bigint>
-- !query 132 output
-4567890123456789 0
123 2
-2828,17 +2828,17 struct<>
-- !query 136
-select c.name, ss.code, ss.b_cnt, ss.const
+select udf(c.name), udf(ss.code), udf(ss.b_cnt), udf(ss.const)
from c left join
(select a.code, coalesce(b_grp.cnt, 0) as b_cnt, -1 as const
from a left join
- (select count(1) as cnt, b.a from b group by b.a) as b_grp
- on a.code = b_grp.a
+ (select udf(count(1)) as cnt, b.a as a from b group by b.a) as b_grp
+ on udf(a.code) = udf(udf(b_grp.a))
) as ss
- on (c.a = ss.code)
+ on (udf(udf(c.a)) = udf(ss.code))
order by c.name
-- !query 136 schema
-struct<name:string,code:string,b_cnt:bigint,const:int>
+struct<CAST(udf(cast(name as string)) AS STRING):string,CAST(udf(cast(code as string)) AS STRING):string,CAST(udf(cast(b_cnt as string)) AS BIGINT):bigint,CAST(udf(cast(const as string)) AS INT):int>
-- !query 136 output
A p 2 -1
B q 0 -1
-2852,15 +2852,15 LEFT JOIN
( SELECT sub3.key3, sub4.value2, COALESCE(sub4.value2, 66) as value3 FROM
( SELECT 1 as key3 ) sub3
LEFT JOIN
- ( SELECT sub5.key5, COALESCE(sub6.value1, 1) as value2 FROM
+ ( SELECT udf(sub5.key5) as key5, udf(udf(COALESCE(sub6.value1, 1))) as value2 FROM
( SELECT 1 as key5 ) sub5
LEFT JOIN
( SELECT 2 as key6, 42 as value1 ) sub6
- ON sub5.key5 = sub6.key6
+ ON sub5.key5 = udf(sub6.key6)
) sub4
- ON sub4.key5 = sub3.key3
+ ON udf(sub4.key5) = sub3.key3
) sub2
-ON sub1.key1 = sub2.key3
+ON udf(udf(sub1.key1)) = udf(udf(sub2.key3))
-- !query 137 schema
struct<key1:int,key3:int,value2:int,value3:int>
-- !query 137 output
-2871,34 +2871,34 struct<key1:int,key3:int,value2:int,value3:int>
SELECT * FROM
( SELECT 1 as key1 ) sub1
LEFT JOIN
-( SELECT sub3.key3, value2, COALESCE(value2, 66) as value3 FROM
+( SELECT udf(sub3.key3) as key3, udf(value2), udf(COALESCE(value2, 66)) as value3 FROM
( SELECT 1 as key3 ) sub3
LEFT JOIN
( SELECT sub5.key5, COALESCE(sub6.value1, 1) as value2 FROM
( SELECT 1 as key5 ) sub5
LEFT JOIN
( SELECT 2 as key6, 42 as value1 ) sub6
- ON sub5.key5 = sub6.key6
+ ON udf(udf(sub5.key5)) = sub6.key6
) sub4
ON sub4.key5 = sub3.key3
) sub2
-ON sub1.key1 = sub2.key3
+ON sub1.key1 = udf(udf(sub2.key3))
-- !query 138 schema
-struct<key1:int,key3:int,value2:int,value3:int>
+struct<key1:int,key3:int,CAST(udf(cast(value2 as string)) AS INT):int,value3:int>
-- !query 138 output
1 1 1 1
-- !query 139
-SELECT qq, unique1
+SELECT udf(qq), udf(udf(unique1))
FROM
- ( SELECT COALESCE(q1, 0) AS qq FROM int8_tbl a ) AS ss1
+ ( SELECT udf(COALESCE(q1, 0)) AS qq FROM int8_tbl a ) AS ss1
FULL OUTER JOIN
- ( SELECT COALESCE(q2, -1) AS qq FROM int8_tbl b ) AS ss2
+ ( SELECT udf(udf(COALESCE(q2, -1))) AS qq FROM int8_tbl b ) AS ss2
USING (qq)
- INNER JOIN tenk1 c ON qq = unique2
+ INNER JOIN tenk1 c ON udf(qq) = udf(unique2)
-- !query 139 schema
-struct<qq:bigint,unique1:int>
+struct<CAST(udf(cast(qq as string)) AS BIGINT):bigint,CAST(udf(cast(cast(udf(cast(unique1 as string)) as int) as string)) AS INT):int>
-- !query 139 output
123 4596
123 4596
-2936,19 +2936,19 struct<>
-- !query 143
-select nt3.id
+select udf(nt3.id)
from nt3 as nt3
left join
- (select nt2.*, (nt2.b1 and ss1.a3) AS b3
+ (select nt2.*, (udf(nt2.b1) and udf(ss1.a3)) AS b3
from nt2 as nt2
left join
- (select nt1.*, (nt1.id is not null) as a3 from nt1) as ss1
- on ss1.id = nt2.nt1_id
+ (select nt1.*, (udf(nt1.id) is not null) as a3 from nt1) as ss1
+ on ss1.id = udf(udf(nt2.nt1_id))
) as ss2
- on ss2.id = nt3.nt2_id
-where nt3.id = 1 and ss2.b3
+ on udf(ss2.id) = nt3.nt2_id
+where udf(nt3.id) = 1 and udf(ss2.b3)
-- !query 143 schema
-struct<id:int>
+struct<CAST(udf(cast(id as string)) AS INT):int>
-- !query 143 output
1
-3003,73 +3003,73 NULL 2147483647
-- !query 146
-select count(*) from
- tenk1 a join tenk1 b on a.unique1 = b.unique2
- left join tenk1 c on a.unique2 = b.unique1 and c.thousand = a.thousand
- join int4_tbl on b.thousand = f1
+select udf(count(*)) from
+ tenk1 a join tenk1 b on udf(a.unique1) = udf(b.unique2)
+ left join tenk1 c on udf(a.unique2) = udf(b.unique1) and udf(c.thousand) = udf(udf(a.thousand))
+ join int4_tbl on udf(b.thousand) = f1
-- !query 146 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 146 output
10
-- !query 147
-select b.unique1 from
- tenk1 a join tenk1 b on a.unique1 = b.unique2
- left join tenk1 c on b.unique1 = 42 and c.thousand = a.thousand
- join int4_tbl i1 on b.thousand = f1
- right join int4_tbl i2 on i2.f1 = b.tenthous
- order by 1
+select udf(b.unique1) from
+ tenk1 a join tenk1 b on udf(a.unique1) = udf(b.unique2)
+ left join tenk1 c on udf(b.unique1) = 42 and c.thousand = udf(a.thousand)
+ join int4_tbl i1 on udf(b.thousand) = udf(udf(f1))
+ right join int4_tbl i2 on udf(udf(i2.f1)) = udf(b.tenthous)
+ order by udf(1)
-- !query 147 schema
-struct<unique1:int>
+struct<CAST(udf(cast(unique1 as string)) AS INT):int>
-- !query 147 output
NULL
NULL
+0
NULL
NULL
-0
-- !query 148
select * from
(
- select unique1, q1, coalesce(unique1, -1) + q1 as fault
- from int8_tbl left join tenk1 on (q2 = unique2)
+ select udf(unique1), udf(q1), udf(udf(coalesce(unique1, -1)) + udf(q1)) as fault
+ from int8_tbl left join tenk1 on (udf(q2) = udf(unique2))
) ss
-where fault = 122
-order by fault
+where udf(fault) = udf(122)
+order by udf(fault)
-- !query 148 schema
-struct<unique1:int,q1:bigint,fault:bigint>
+struct<CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(q1 as string)) AS BIGINT):bigint,fault:bigint>
-- !query 148 output
NULL 123 122
-- !query 149
-select q1, unique2, thousand, hundred
- from int8_tbl a left join tenk1 b on q1 = unique2
- where coalesce(thousand,123) = q1 and q1 = coalesce(hundred,123)
+select udf(q1), udf(unique2), udf(thousand), udf(hundred)
+ from int8_tbl a left join tenk1 b on udf(q1) = udf(unique2)
+ where udf(coalesce(thousand,123)) = udf(q1) and udf(q1) = udf(udf(coalesce(hundred,123)))
-- !query 149 schema
-struct<q1:bigint,unique2:int,thousand:int,hundred:int>
+struct<CAST(udf(cast(q1 as string)) AS BIGINT):bigint,CAST(udf(cast(unique2 as string)) AS INT):int,CAST(udf(cast(thousand as string)) AS INT):int,CAST(udf(cast(hundred as string)) AS INT):int>
-- !query 149 output
-- !query 150
-select f1, unique2, case when unique2 is null then f1 else 0 end
- from int4_tbl a left join tenk1 b on f1 = unique2
- where (case when unique2 is null then f1 else 0 end) = 0
+select udf(f1), udf(unique2), case when udf(udf(unique2)) is null then udf(f1) else 0 end
+ from int4_tbl a left join tenk1 b on udf(f1) = udf(udf(unique2))
+ where (case when udf(unique2) is null then udf(f1) else 0 end) = 0
-- !query 150 schema
-struct<f1:int,unique2:int,CASE WHEN (unique2 IS NULL) THEN f1 ELSE 0 END:int>
+struct<CAST(udf(cast(f1 as string)) AS INT):int,CAST(udf(cast(unique2 as string)) AS INT):int,CASE WHEN (CAST(udf(cast(cast(udf(cast(unique2 as string)) as int) as string)) AS INT) IS NULL) THEN CAST(udf(cast(f1 as string)) AS INT) ELSE 0 END:int>
-- !query 150 output
0 0 0
-- !query 151
-select a.unique1, b.unique1, c.unique1, coalesce(b.twothousand, a.twothousand)
- from tenk1 a left join tenk1 b on b.thousand = a.unique1 left join tenk1 c on c.unique2 = coalesce(b.twothousand, a.twothousand)
- where a.unique2 < 10 and coalesce(b.twothousand, a.twothousand) = 44
+select udf(a.unique1), udf(b.unique1), udf(c.unique1), udf(coalesce(b.twothousand, a.twothousand))
+ from tenk1 a left join tenk1 b on udf(b.thousand) = a.unique1 left join tenk1 c on udf(c.unique2) = udf(coalesce(b.twothousand, a.twothousand))
+ where a.unique2 < udf(10) and udf(udf(coalesce(b.twothousand, a.twothousand))) = udf(44)
-- !query 151 schema
-struct<unique1:int,unique1:int,unique1:int,coalesce(twothousand, twothousand):int>
+struct<CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(unique1 as string)) AS INT):int,CAST(udf(cast(coalesce(twothousand, twothousand) as string)) AS INT):int>
-- !query 151 output
-3078,11 +3078,11 struct<unique1:int,unique1:int,unique1:int,coalesce(twothousand, twothousand):in
select * from
text_tbl t1
inner join int8_tbl i8
- on i8.q2 = 456
+ on udf(i8.q2) = udf(udf(456))
right join text_tbl t2
- on t1.f1 = 'doh!'
+ on udf(t1.f1) = udf(udf('doh!'))
left join int4_tbl i4
- on i8.q1 = i4.f1
+ on udf(udf(i8.q1)) = i4.f1
-- !query 152 schema
struct<f1:string,q1:bigint,q2:bigint,f1:string,f1:int>
-- !query 152 output
-3092,10 +3092,10 doh! 123 456 hi de ho neighbor NULL
-- !query 153
select * from
- (select 1 as id) as xx
+ (select udf(udf(1)) as id) as xx
left join
- (tenk1 as a1 full join (select 1 as id) as yy on (a1.unique1 = yy.id))
- on (xx.id = coalesce(yy.id))
+ (tenk1 as a1 full join (select udf(1) as id) as yy on (udf(a1.unique1) = udf(yy.id)))
+ on (xx.id = udf(udf(coalesce(yy.id))))
-- !query 153 schema
struct<id:int,unique1:int,unique2:int,two:int,four:int,ten:int,twenty:int,hundred:int,thousand:int,twothousand:int,fivethous:int,tenthous:int,odd:int,even:int,stringu1:string,stringu2:string,string4:string,id:int>
-- !query 153 output
-3103,11 +3103,11 struct<id:int,unique1:int,unique2:int,two:int,four:int,ten:int,twenty:int,hundre
-- !query 154
-select a.q2, b.q1
- from int8_tbl a left join int8_tbl b on a.q2 = coalesce(b.q1, 1)
- where coalesce(b.q1, 1) > 0
+select udf(a.q2), udf(b.q1)
+ from int8_tbl a left join int8_tbl b on udf(a.q2) = coalesce(b.q1, 1)
+ where udf(udf(coalesce(b.q1, 1)) > 0)
-- !query 154 schema
-struct<q2:bigint,q1:bigint>
+struct<CAST(udf(cast(q2 as string)) AS BIGINT):bigint,CAST(udf(cast(q1 as string)) AS BIGINT):bigint>
-- !query 154 output
-4567890123456789 NULL
123 123
-3142,7 +3142,7 struct<>
-- !query 157
-select p.* from parent p left join child c on (p.k = c.k)
+select p.* from parent p left join child c on (udf(p.k) = udf(c.k))
-- !query 157 schema
struct<k:int,pd:int>
-- !query 157 output
-3153,8 +3153,8 struct<k:int,pd:int>
-- !query 158
select p.*, linked from parent p
- left join (select c.*, true as linked from child c) as ss
- on (p.k = ss.k)
+ left join (select c.*, udf(udf(true)) as linked from child c) as ss
+ on (udf(p.k) = udf(udf(ss.k)))
-- !query 158 schema
struct<k:int,pd:int,linked:boolean>
-- !query 158 output
-3165,8 +3165,8 struct<k:int,pd:int,linked:boolean>
-- !query 159
select p.* from
- parent p left join child c on (p.k = c.k)
- where p.k = 1 and p.k = 2
+ parent p left join child c on (udf(p.k) = c.k)
+ where p.k = udf(1) and udf(udf(p.k)) = udf(udf(2))
-- !query 159 schema
struct<k:int,pd:int>
-- !query 159 output
-3175,8 +3175,8 struct<k:int,pd:int>
-- !query 160
select p.* from
- (parent p left join child c on (p.k = c.k)) join parent x on p.k = x.k
- where p.k = 1 and p.k = 2
+ (parent p left join child c on (udf(p.k) = c.k)) join parent x on p.k = udf(x.k)
+ where udf(p.k) = udf(1) and udf(udf(p.k)) = udf(udf(2))
-- !query 160 schema
struct<k:int,pd:int>
-- !query 160 output
-3204,7 +3204,7 struct<>
-- !query 163
-SELECT * FROM b LEFT JOIN a ON (b.a_id = a.id) WHERE (a.id IS NULL OR a.id > 0)
+SELECT * FROM b LEFT JOIN a ON (udf(b.a_id) = udf(a.id)) WHERE (udf(udf(a.id)) IS NULL OR udf(a.id) > 0)
-- !query 163 schema
struct<id:int,a_id:int,id:int>
-- !query 163 output
-3212,7 +3212,7 struct<id:int,a_id:int,id:int>
-- !query 164
-SELECT b.* FROM b LEFT JOIN a ON (b.a_id = a.id) WHERE (a.id IS NULL OR a.id > 0)
+SELECT b.* FROM b LEFT JOIN a ON (udf(b.a_id) = udf(a.id)) WHERE (udf(a.id) IS NULL OR udf(udf(a.id)) > 0)
-- !query 164 schema
struct<id:int,a_id:int>
-- !query 164 output
-3231,13 +3231,13 struct<>
-- !query 166
SELECT * FROM
- (SELECT 1 AS x) ss1
+ (SELECT udf(1) AS x) ss1
LEFT JOIN
- (SELECT q1, q2, COALESCE(dat1, q1) AS y
- FROM int8_tbl LEFT JOIN innertab ON q2 = id) ss2
+ (SELECT udf(q1), udf(q2), udf(COALESCE(dat1, q1)) AS y
+ FROM int8_tbl LEFT JOIN innertab ON udf(udf(q2)) = id) ss2
ON true
-- !query 166 schema
-struct<x:int,q1:bigint,q2:bigint,y:bigint>
+struct<x:int,CAST(udf(cast(q1 as string)) AS BIGINT):bigint,CAST(udf(cast(q2 as string)) AS BIGINT):bigint,y:bigint>
-- !query 166 output
1 123 456 123
1 123 4567890123456789 123
-3248,27 +3248,27 struct<x:int,q1:bigint,q2:bigint,y:bigint>
-- !query 167
select * from
- int8_tbl x join (int4_tbl x cross join int4_tbl y) j on q1 = f1
+ int8_tbl x join (int4_tbl x cross join int4_tbl y) j on udf(q1) = udf(f1)
-- !query 167 schema
struct<>
-- !query 167 output
org.apache.spark.sql.AnalysisException
-Reference 'f1' is ambiguous, could be: j.f1, j.f1.; line 2 pos 63
+Reference 'f1' is ambiguous, could be: j.f1, j.f1.; line 2 pos 72
-- !query 168
select * from
- int8_tbl x join (int4_tbl x cross join int4_tbl y) j on q1 = y.f1
+ int8_tbl x join (int4_tbl x cross join int4_tbl y) j on udf(q1) = udf(y.f1)
-- !query 168 schema
struct<>
-- !query 168 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`y.f1`' given input columns: [j.f1, j.f1, x.q1, x.q2]; line 2 pos 63
+cannot resolve '`y.f1`' given input columns: [j.f1, j.f1, x.q1, x.q2]; line 2 pos 72
-- !query 169
select * from
- int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on q1 = f1
+ int8_tbl x join (int4_tbl x cross join int4_tbl y(ff)) j on udf(q1) = udf(udf(f1))
-- !query 169 schema
struct<q1:bigint,q2:bigint,f1:int,ff:int>
-- !query 169 output
-3276,69 +3276,69 struct<q1:bigint,q2:bigint,f1:int,ff:int>
-- !query 170
-select t1.uunique1 from
- tenk1 t1 join tenk2 t2 on t1.two = t2.two
+select udf(t1.uunique1) from
+ tenk1 t1 join tenk2 t2 on t1.two = udf(t2.two)
-- !query 170 schema
struct<>
-- !query 170 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`t1.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7
+cannot resolve '`t1.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 11
-- !query 171
-select t2.uunique1 from
- tenk1 t1 join tenk2 t2 on t1.two = t2.two
+select udf(udf(t2.uunique1)) from
+ tenk1 t1 join tenk2 t2 on udf(t1.two) = t2.two
-- !query 171 schema
struct<>
-- !query 171 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`t2.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7
+cannot resolve '`t2.uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 15
-- !query 172
-select uunique1 from
- tenk1 t1 join tenk2 t2 on t1.two = t2.two
+select udf(uunique1) from
+ tenk1 t1 join tenk2 t2 on udf(t1.two) = udf(t2.two)
-- !query 172 schema
struct<>
-- !query 172 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 7
+cannot resolve '`uunique1`' given input columns: [t1.even, t2.even, t1.fivethous, t2.fivethous, t1.four, t2.four, t1.hundred, t2.hundred, t1.odd, t2.odd, t1.string4, t2.string4, t1.stringu1, t2.stringu1, t1.stringu2, t2.stringu2, t1.ten, t2.ten, t1.tenthous, t2.tenthous, t1.thousand, t2.thousand, t1.twenty, t2.twenty, t1.two, t2.two, t1.twothousand, t2.twothousand, t1.unique1, t2.unique1, t1.unique2, t2.unique2]; line 1 pos 11
-- !query 173
-select f1,g from int4_tbl a, (select f1 as g) ss
+select udf(udf(f1,g)) from int4_tbl a, (select udf(udf(f1)) as g) ss
-- !query 173 schema
struct<>
-- !query 173 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`f1`' given input columns: []; line 1 pos 37
+cannot resolve '`f1`' given input columns: []; line 1 pos 55
-- !query 174
-select f1,g from int4_tbl a, (select a.f1 as g) ss
+select udf(f1,g) from int4_tbl a, (select a.f1 as g) ss
-- !query 174 schema
struct<>
-- !query 174 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`a.f1`' given input columns: []; line 1 pos 37
+cannot resolve '`a.f1`' given input columns: []; line 1 pos 42
-- !query 175
-select f1,g from int4_tbl a cross join (select f1 as g) ss
+select udf(udf(f1,g)) from int4_tbl a cross join (select udf(f1) as g) ss
-- !query 175 schema
struct<>
-- !query 175 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`f1`' given input columns: []; line 1 pos 47
+cannot resolve '`f1`' given input columns: []; line 1 pos 61
-- !query 176
-select f1,g from int4_tbl a cross join (select a.f1 as g) ss
+select udf(f1,g) from int4_tbl a cross join (select udf(udf(a.f1)) as g) ss
-- !query 176 schema
struct<>
-- !query 176 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`a.f1`' given input columns: []; line 1 pos 47
+cannot resolve '`a.f1`' given input columns: []; line 1 pos 60
-- !query 177
-3383,8 +3383,8 struct<>
-- !query 182
select * from j1
-inner join j2 on j1.id1 = j2.id1 and j1.id2 = j2.id2
-where j1.id1 % 1000 = 1 and j2.id1 % 1000 = 1
+inner join j2 on udf(j1.id1) = udf(j2.id1) and udf(udf(j1.id2)) = udf(j2.id2)
+where udf(j1.id1) % 1000 = 1 and udf(udf(j2.id1) % 1000) = 1
-- !query 182 schema
struct<id1:int,id2:int,id1:int,id2:int>
-- !query 182 output
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
Closes#25371 from huaxingao/spark-28393.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
The Spark SQL test framework needs to support 2 kinds of tests:
1. tests inside Spark to test Spark itself (extends `SparkFunSuite`)
2. test outside of Spark to test Spark applications (introduced at b57ed2245c)
The class hierarchy of the major testing traits:
![image](https://user-images.githubusercontent.com/3182036/63088526-c0f0af80-bf87-11e9-9bed-c144c2486da9.png)
`PlanTestBase`, `SQLTestUtilsBase` and `SharedSparkSession` intentionally don't extend `SparkFunSuite`, so that they can be used for tests outside of Spark. Tests in Spark should extends `QueryTest` and/or `SharedSQLContext` in most cases.
However, the name is a little confusing. As a result, some test suites extend `SharedSparkSession` instead of `SharedSQLContext`. `SharedSparkSession` doesn't work well with `SparkFunSuite` as it doesn't have the special handling of thread auditing in `SharedSQLContext`. For example, you will see a warning starting with `===== POSSIBLE THREAD LEAK IN SUITE` when you run `DataFrameSelfJoinSuite`.
This PR proposes to rename `SharedSparkSession` to `SharedSparkSessionBase`, and rename `SharedSQLContext` to `SharedSparkSession`.
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
Please review https://spark.apache.org/contributing.html before opening a pull request.
Closes#25463 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR reverts some of the latest changes in `ReduceNumShufflePartitions` to fix the case when there are different pre-shuffle partition numbers in the plan. Please see the new UT for an example.
### Why are the changes needed?
Eliminate a bug.
### Does this PR introduce any user-facing change?
Yes, some queries that failed will succeed now.
### How was this patch tested?
Added new UT.
Closes#25479 from peter-toth/SPARK-28356-followup.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
in order to address cases where foreach writer task is failing without calling the close() method, (for example when a task is interrupted) added the option to implement an abort() method that will be called when the task is aborted. users should handle resource cleanup (such as connections) in the abort() method
## How was this patch tested?
update existing unit tests.
Closes#24382 from eyalzit/SPARK-27330-foreach-writer-abort.
Lead-authored-by: Eyal Zituny <eyal.zituny@equalum.io>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: eyalzit <eyal.zituny@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
only todo message updated. Need to add udf() for GroupBy Tests, after resolving following jira
[SPARK-28386] and [SPARK-26741]
## How was this patch tested?
NA, only TODO message updated.
Closes#25415 from shivusondur/jiraFollowup.
Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Add following functions:
```
def add_months(startDate: Column, numMonths: Column): Column
def date_add(start: Column, days: Column): Column
def date_sub(start: Column, days: Column): Column
```
## How was this patch tested?
UT.
Please review https://spark.apache.org/contributing.html before opening a pull request.
Closes#25334 from WeichenXu123/datefunc_impr.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR makes analysis error messages more meaningful when the function does not support the modifier DISTINCT:
```sql
postgres=# select upper(distinct a) from (values('a'), ('b')) v(a);
ERROR: DISTINCT specified, but upper is not an aggregate function
LINE 1: select upper(distinct a) from (values('a'), ('b')) v(a);
spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a);
Error in query: upper does not support the modifier DISTINCT; line 1 pos 7
spark-sql>
```
After this pr:
```sql
spark-sql> select upper(distinct a) from (values('a'), ('b')) v(a);
Error in query: DISTINCT specified, but upper is not an aggregate function; line 1 pos 7
spark-sql>
```
## How was this patch tested?
Unit test
Closes#25486 from wangyum/DISTINCT.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
If both options `path` and `paths` are passed to file data source v2, both values of the options should be included as the target paths.
### Why are the changes needed?
In V1 implementation, file table location includes both values of option `path` and `paths`.
In the refactoring of https://github.com/apache/spark/pull/24025, the value of option `path` is ignored if "paths" are specified. We should make it consistent with V1.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test
Closes#25473 from gengliangwang/fixPathOption.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
**Before Fix**
When a non existent permanent function is dropped, generic NoSuchFunctionException was thrown.- which printed "This function is neither a registered temporary function nor a permanent function registered in the database" .
This creates a ambiguity when a temp function in the same name exist.
**After Fix**
NoSuchPermanentFunctionException will be thrown, which will print
"NoSuchPermanentFunctionException:Function not found in database "
## How was this patch tested?
Unit test was run and corrected the UT.
Closes#25394 from PavithraRamachandran/funcIssue.
Lead-authored-by: pavithra <pavi.rams@gmail.com>
Co-authored-by: pavithraramachandran <pavi.rams@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
Added new benchmark `ExtractBenchmark` for the `EXTRACT(field FROM source)` function. It was executed on all currently supported values of the `field` argument: `MILLENNIUM`, `CENTURY`, `DECADE`, `YEAR`, `ISOYEAR`, `QUARTER`, `MONTH`, `WEEK`, `DAY`, `DAYOFWEEK`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECONDS`, `MICROSECONDS`, `EPOCH`. The `cast(id as timestamp)` was taken as the `source` argument.
## How was this patch tested?
By running the benchmark via:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"
```
Closes#25462 from MaxGekk/extract-benchmark.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Hive using incorrect **InputFormat**(`org.apache.hadoop.mapred.SequenceFileInputFormat`) to read Spark's **Parquet** bucketed data source table.
Spark side:
```sql
spark-sql> CREATE TABLE t (c1 INT, c2 INT) USING parquet CLUSTERED BY (c1) SORTED BY (c1) INTO 2 BUCKETS;
2019-04-29 17:52:05 WARN HiveExternalCatalog:66 - Persisting bucketed data source table `default`.`t` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive.
spark-sql> DESC FORMATTED t;
c1 int NULL
c2 int NULL
# Detailed Table Information
Database default
Table t
Owner yumwang
Created Time Mon Apr 29 17:52:05 CST 2019
Last Access Thu Jan 01 08:00:00 CST 1970
Created By Spark 2.4.0
Type MANAGED
Provider parquet
Num Buckets 2
Bucket Columns [`c1`]
Sort Columns [`c1`]
Table Properties [transient_lastDdlTime=1556531525]
Location file:/user/hive/warehouse/t
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties [serialization.format=1]
```
Hive side:
```sql
hive> DESC FORMATTED t;
OK
# col_name data_type comment
c1 int
c2 int
# Detailed Table Information
Database: default
Owner: root
CreateTime: Wed May 08 03:38:46 GMT-07:00 2019
LastAccessTime: UNKNOWN
Retention: 0
Location: file:/user/hive/warehouse/t
Table Type: MANAGED_TABLE
Table Parameters:
bucketing_version spark
spark.sql.create.version 3.0.0-SNAPSHOT
spark.sql.sources.provider parquet
spark.sql.sources.schema.bucketCol.0 c1
spark.sql.sources.schema.numBucketCols 1
spark.sql.sources.schema.numBuckets 2
spark.sql.sources.schema.numParts 1
spark.sql.sources.schema.numSortCols 1
spark.sql.sources.schema.part.0 {\"type\":\"struct\",\"fields\":[{\"name\":\"c1\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}},{\"name\":\"c2\",\"type\":\"integer\",\"nullable\":true,\"metadata\":{}}]}
spark.sql.sources.schema.sortCol.0 c1
transient_lastDdlTime 1557311926
# Storage Information
SerDe Library: org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Compressed: No
Num Buckets: -1
Bucket Columns: []
Sort Columns: []
Storage Desc Params:
path file:/user/hive/warehouse/t
serialization.format 1
```
So it's non-bucketed table at Hive side. This pr set the `SerDe` correctly so Hive can read these tables.
Related code:
33f3c48cac/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L976-L990)f9776e3892/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala (L444-L459)
## How was this patch tested?
unit tests
Closes#24486 from wangyum/SPARK-27592.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
We add support for the V2SessionCatalog for saveAsTable, such that V2 tables can plug in and leverage existing DataFrameWriter.saveAsTable APIs to write and create tables through the session catalog.
## How was this patch tested?
Unit tests. A lot of tests broke under hive when things were not working properly under `ResolveTables`, therefore I believe the current set of tests should be sufficient in testing the table resolution and read code paths.
Closes#25402 from brkyvz/saveAsV2.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
In the PR, I propose new expressions `Epoch`, `IsoYear`, `Milliseconds` and `Microseconds`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):
1. `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.
2. `isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
3. `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
4. `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
Here are examples:
```sql
spark-sql> SELECT EXTRACT(EPOCH FROM TIMESTAMP '2019-08-11 19:07:30.123456');
1565550450.123456
spark-sql> SELECT EXTRACT(ISOYEAR FROM DATE '2006-01-01');
2005
spark-sql> SELECT EXTRACT(MILLISECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456');
30123.456
spark-sql> SELECT EXTRACT(MICROSECONDS FROM TIMESTAMP '2019-08-11 19:07:30.123456');
30123456
```
## How was this patch tested?
Added new tests to `DateExpressionsSuite`, and uncommented existing tests in `extract.sql` and `pgSQL/date.sql`.
Closes#25408 from MaxGekk/extract-ext3.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This pr adds DELETE support for V2 datasources. As a first step, this pr only support delete by source filters:
```scala
void delete(Filter[] filters);
```
which could not deal with complicated cases like subqueries.
Since it's uncomfortable to embed the implementation of DELETE in the current V2 APIs, a new mix-in of datasource is added, which is called `SupportsMaintenance`, similar to `SupportsRead` and `SupportsWrite`. A datasource which can be maintained means we can perform DELETE/UPDATE/MERGE/OPTIMIZE on the datasource, as long as the datasource implements the necessary mix-ins.
## How was this patch tested?
new test case.
Please review https://spark.apache.org/contributing.html before opening a pull request.
Closes#25115 from xianyinxin/SPARK-28351.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
CacheManager.cacheQuery saves the stats from the optimized plan to cache.
## How was this patch tested?
Existing testss.
Closes#24623 from jzhuge/SPARK-27739.
Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Here is the problem description from the JIRA.
```
When the inputs contain the constant 'infinity', Spark SQL does not generate the expected results.
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('1'), (CAST('infinity' AS DOUBLE))) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('1')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('infinity'), ('infinity')) v(x);
SELECT avg(CAST(x AS DOUBLE)), var_pop(CAST(x AS DOUBLE))
FROM (VALUES ('-infinity'), ('infinity')) v(x);
The root cause: Spark SQL does not recognize the special constants in a case insensitive way. In PostgreSQL, they are recognized in a case insensitive way.
Link: https://www.postgresql.org/docs/9.3/datatype-numeric.html
```
In this PR, the casting code is enhanced to handle these `special` string literals in case insensitive manner.
## How was this patch tested?
Added tests in CastSuite and modified existing test suites.
Closes#25331 from dilipbiswal/double_infinity.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Changed type of `sec` argument in the `make_timestamp()` function from `DOUBLE` to `DECIMAL(8, 6)`. The scale is set to 6 to cover microsecond fractions, and the precision is 2 digits for seconds + 6 digits for microsecond fraction. New type prevents losing precision in some cases, for example:
Before:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001);
2019-08-12 00:00:58
```
After:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 58.000001);
2019-08-12 00:00:58.000001
```
Also switching to `DECIMAL` fixes rounding `sec` towards "nearest neighbor" unless both neighbors are equidistant, in which case round up. For example:
Before:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567);
2019-08-12 00:00:00.123456
```
After:
```sql
spark-sql> select make_timestamp(2019, 8, 12, 0, 0, 0.1234567);
2019-08-12 00:00:00.123457
```
## How was this patch tested?
This was tested by `DateExpressionsSuite` and `pgSQL/timestamp.sql`.
Closes#25421 from MaxGekk/make_timestamp-decimal.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
In the PR, I propose additional synonyms for the `field` argument of `extract` supported by PostgreSQL. The `extract.sql` is updated to check all supported values of the `field` argument. The list of synonyms was taken from https://github.com/postgres/postgres/blob/master/src/backend/utils/adt/datetime.c .
## How was this patch tested?
By running `extract.sql` via:
```
$ build/sbt "sql/test-only *SQLQueryTestSuite -- -z extract.sql"
```
Closes#25438 from MaxGekk/extract-field-synonyms.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
A GROUPED_AGG pandas python udf can't work, if without group by clause, like `select udf(id) from table`.
This doesn't match with aggregate function like sum, count..., and also dataset API like `df.agg(udf(df['id']))`.
When we parse a udf (or an aggregate function) like that from SQL syntax, it is known as a function in a project. `GlobalAggregates` rule in analysis makes such project as aggregate, by looking for aggregate expressions. At the moment, we should also look for GROUPED_AGG pandas python udf.
## How was this patch tested?
Added tests.
Closes#25352 from viirya/SPARK-28422.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
#25242 proposed to disallow upcasting complex data types to string type, however, upcasting from null type to any types should still be safe.
## How was this patch tested?
Add corresponding case in `CastSuite`.
Closes#25425 from jiangxb1987/nullToString.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
It throws `Table or view not found` when showing temporary views:
```sql
spark-sql> CREATE TEMPORARY VIEW temp_view AS SELECT 1 AS a;
spark-sql> show create table temp_view;
Error in query: Table or view 'temp_view' not found in database 'default';
```
It's not easy to support temporary views. This pr changed it to throws `SHOW CREATE TABLE is not supported on a temporary view`:
```sql
spark-sql> CREATE TEMPORARY VIEW temp_view AS SELECT 1 AS a;
spark-sql> show create table temp_view;
Error in query: SHOW CREATE TABLE is not supported on a temporary view: temp_view;
```
## How was this patch tested?
unit tests
Closes#25149 from wangyum/SPARK-28383.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR skip more test when testing with `JAVA_9` or later:
1. Skip `HiveExternalCatalogVersionsSuite` when testing with `JAVA_9` or later because our previous version does not support `JAVA_9` or later.
2. Skip 3 tests in `HiveSparkSubmitSuite` because the `spark.sql.hive.metastore.version` of these tests is lower than `2.0`, however Datanucleus 3.x seem does not support `JAVA_9` or later. Hive upgrade Datanucleus to 4.x from Hive 2.0([HIVE-6113](https://issues.apache.org/jira/browse/HIVE-6113)):
```
[info] Cause: org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is available.
[info] at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
[info] at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
[info] at org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
[info] at org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
[info] at org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
```
Please note that this exclude only the tests related to the old metastore library, some other tests of `HiveSparkSubmitSuite` still fail on JDK9+.
## How was this patch tested?
manual tests:
Test with JDK 11:
```
[info] HiveExternalCatalogVersionsSuite:
[info] - backward compatibility !!! CANCELED !!! (37 milliseconds)
[info] HiveSparkSubmitSuite:
...
[info] - SPARK-8020: set sql conf in spark conf !!! CANCELED !!! (30 milliseconds)
[info] org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:130)
...
[info] - SPARK-9757 Persist Parquet relation with decimal column !!! CANCELED !!! (1 millisecond)
[info] org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:168)
...
[info] - SPARK-16901: set javax.jdo.option.ConnectionURL !!! CANCELED !!! (1 millisecond)
[info] org.apache.commons.lang3.SystemUtils.isJavaVersionAtLeast(JAVA_9) was true (HiveSparkSubmitSuite.scala:260)
...
```
Closes#25426 from wangyum/SPARK-28703.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR is a followup of a fix as described in here: https://github.com/apache/spark/pull/25215#issuecomment-517659981
<details><summary>Diff comparing to 'group-by.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
index 3a5df254f2..febe47b5ba 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
-13,26 +13,26 struct<>
-- !query 1
-SELECT a, COUNT(b) FROM testData
+SELECT udf(a), udf(COUNT(b)) FROM testData
-- !query 1 schema
struct<>
-- !query 1 output
org.apache.spark.sql.AnalysisException
-grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(count(testdata.`b`) AS `count(b)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.;
+grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(CAST(udf(cast(count(b) as string)) AS BIGINT) AS `CAST(udf(cast(count(b) as string)) AS BIGINT)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.;
-- !query 2
-SELECT COUNT(a), COUNT(b) FROM testData
+SELECT COUNT(udf(a)), udf(COUNT(b)) FROM testData
-- !query 2 schema
-struct<count(a):bigint,count(b):bigint>
+struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 2 output
7 7
-- !query 3
-SELECT a, COUNT(b) FROM testData GROUP BY a
+SELECT udf(a), COUNT(udf(b)) FROM testData GROUP BY a
-- !query 3 schema
-struct<a:int,count(b):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(b as string)) AS INT)):bigint>
-- !query 3 output
1 2
2 2
-41,7 +41,7 NULL 1
-- !query 4
-SELECT a, COUNT(b) FROM testData GROUP BY b
+SELECT udf(a), udf(COUNT(udf(b))) FROM testData GROUP BY b
-- !query 4 schema
struct<>
-- !query 4 output
-50,9 +50,9 expression 'testdata.`a`' is neither present in the group by, nor is it an aggre
-- !query 5
-SELECT COUNT(a), COUNT(b) FROM testData GROUP BY a
+SELECT COUNT(udf(a)), COUNT(udf(b)) FROM testData GROUP BY udf(a)
-- !query 5 schema
-struct<count(a):bigint,count(b):bigint>
+struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,count(CAST(udf(cast(b as string)) AS INT)):bigint>
-- !query 5 output
0 1
2 2
-61,15 +61,15 struct<count(a):bigint,count(b):bigint>
-- !query 6
-SELECT 'foo', COUNT(a) FROM testData GROUP BY 1
+SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1
-- !query 6 schema
-struct<foo:string,count(a):bigint>
+struct<foo:string,count(CAST(udf(cast(a as string)) AS INT)):bigint>
-- !query 6 output
foo 7
-- !query 7
-SELECT 'foo' FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1)
-- !query 7 schema
struct<foo:string>
-- !query 7 output
-77,25 +77,25 struct<foo:string>
-- !query 8
-SELECT 'foo', APPROX_COUNT_DISTINCT(a) FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1)
-- !query 8 schema
-struct<foo:string,approx_count_distinct(a):bigint>
+struct<foo:string,CAST(udf(cast(approx_count_distinct(cast(udf(cast(a as string)) as int), 0.05, 0, 0) as string)) AS BIGINT):bigint>
-- !query 8 output
-- !query 9
-SELECT 'foo', MAX(STRUCT(a)) FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY udf(1)
-- !query 9 schema
-struct<foo:string,max(named_struct(a, a)):struct<a:int>>
+struct<foo:string,max(named_struct(col1, CAST(udf(cast(a as string)) AS INT))):struct<col1:int>>
-- !query 9 output
-- !query 10
-SELECT a + b, COUNT(b) FROM testData GROUP BY a + b
+SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b
-- !query 10 schema
-struct<(a + b):int,count(b):bigint>
+struct<CAST(udf(cast((a + b) as string)) AS INT):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 10 output
2 1
3 2
-105,7 +105,7 NULL 1
-- !query 11
-SELECT a + 2, COUNT(b) FROM testData GROUP BY a + 1
+SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1
-- !query 11 schema
struct<>
-- !query 11 output
-114,9 +114,9 expression 'testdata.`a`' is neither present in the group by, nor is it an aggre
-- !query 12
-SELECT a + 1 + 1, COUNT(b) FROM testData GROUP BY a + 1
+SELECT udf(a + 1) + 1, udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)
-- !query 12 schema
-struct<((a + 1) + 1):int,count(b):bigint>
+struct<(CAST(udf(cast((a + 1) as string)) AS INT) + 1):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 12 output
3 2
4 2
-125,26 +125,26 NULL 1
-- !query 13
-SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a)
+SELECT SKEWNESS(udf(a)), udf(KURTOSIS(a)), udf(MIN(a)), MAX(udf(a)), udf(AVG(udf(a))), udf(VARIANCE(a)), STDDEV(udf(a)), udf(SUM(a)), udf(COUNT(a))
FROM testData
-- !query 13 schema
-struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS DOUBLE)):double,stddev_samp(CAST(a AS DOUBLE)):double,sum(a):bigint,count(a):bigint>
+struct<skewness(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(kurtosis(cast(a as double)) as string)) AS DOUBLE):double,CAST(udf(cast(min(a) as string)) AS INT):int,max(CAST(udf(cast(a as string)) AS INT)):int,CAST(udf(cast(avg(cast(cast(udf(cast(a as string)) as int) as bigint)) as string)) AS DOUBLE):double,CAST(udf(cast(var_samp(cast(a as double)) as string)) AS DOUBLE):double,stddev_samp(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(sum(cast(a as bigint)) as string)) AS BIGINT):bigint,CAST(udf(cast(count(a) as string)) AS BIGINT):bigint>
-- !query 13 output
-0.2723801058145729 -1.5069204152249134 1 3 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
-- !query 14
-SELECT COUNT(DISTINCT b), COUNT(DISTINCT b, c) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY a
+SELECT COUNT(DISTINCT udf(b)), udf(COUNT(DISTINCT b, c)) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY udf(a)
-- !query 14 schema
-struct<count(DISTINCT b):bigint,count(DISTINCT b, c):bigint>
+struct<count(DISTINCT CAST(udf(cast(b as string)) AS INT)):bigint,CAST(udf(cast(count(distinct b, c) as string)) AS BIGINT):bigint>
-- !query 14 output
1 1
-- !query 15
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k
+SELECT udf(a) AS k, COUNT(udf(b)) FROM testData GROUP BY k
-- !query 15 schema
-struct<k:int,count(b):bigint>
+struct<k:int,count(CAST(udf(cast(b as string)) AS INT)):bigint>
-- !query 15 output
1 2
2 2
-153,21 +153,21 NULL 1
-- !query 16
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k HAVING k > 1
+SELECT a AS k, udf(COUNT(b)) FROM testData GROUP BY k HAVING k > 1
-- !query 16 schema
-struct<k:int,count(b):bigint>
+struct<k:int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 16 output
2 2
3 2
-- !query 17
-SELECT COUNT(b) AS k FROM testData GROUP BY k
+SELECT udf(COUNT(b)) AS k FROM testData GROUP BY k
-- !query 17 schema
struct<>
-- !query 17 output
org.apache.spark.sql.AnalysisException
-aggregate functions are not allowed in GROUP BY, but found count(testdata.`b`);
+aggregate functions are not allowed in GROUP BY, but found CAST(udf(cast(count(b) as string)) AS BIGINT);
-- !query 18
-180,7 +180,7 struct<>
-- !query 19
-SELECT k AS a, COUNT(v) FROM testDataHasSameNameWithAlias GROUP BY a
+SELECT k AS a, udf(COUNT(udf(v))) FROM testDataHasSameNameWithAlias GROUP BY udf(a)
-- !query 19 schema
struct<>
-- !query 19 output
-197,32 +197,32 spark.sql.groupByAliases false
-- !query 21
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k
+SELECT a AS k, udf(COUNT(udf(b))) FROM testData GROUP BY k
-- !query 21 schema
struct<>
-- !query 21 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 47
+cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 57
-- !query 22
-SELECT a, COUNT(1) FROM testData WHERE false GROUP BY a
+SELECT udf(a), COUNT(udf(1)) FROM testData WHERE false GROUP BY udf(a)
-- !query 22 schema
-struct<a:int,count(1):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(1 as string)) AS INT)):bigint>
-- !query 22 output
-- !query 23
-SELECT COUNT(1) FROM testData WHERE false
+SELECT udf(COUNT(1)) FROM testData WHERE false
-- !query 23 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 23 output
0
-- !query 24
-SELECT 1 FROM (SELECT COUNT(1) FROM testData WHERE false) t
+SELECT 1 FROM (SELECT udf(COUNT(1)) FROM testData WHERE false) t
-- !query 24 schema
struct<1:int>
-- !query 24 output
-232,7 +232,7 struct<1:int>
-- !query 25
SELECT 1 from (
SELECT 1 AS z,
- MIN(a.x)
+ udf(MIN(a.x))
FROM (select 1 as x) a
WHERE false
) b
-244,32 +244,32 struct<1:int>
-- !query 26
-SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
+SELECT corr(DISTINCT x, y), udf(corr(DISTINCT y, x)), count(*)
FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y)
-- !query 26 schema
-struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,corr(DISTINCT CAST(y AS DOUBLE), CAST(x AS DOUBLE)):double,count(1):bigint>
+struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,CAST(udf(cast(corr(distinct cast(y as double), cast(x as double)) as string)) AS DOUBLE):double,count(1):bigint>
-- !query 26 output
1.0 1.0 3
-- !query 27
-SELECT 1 FROM range(10) HAVING true
+SELECT udf(1) FROM range(10) HAVING true
-- !query 27 schema
-struct<1:int>
+struct<CAST(udf(cast(1 as string)) AS INT):int>
-- !query 27 output
1
-- !query 28
-SELECT 1 FROM range(10) HAVING MAX(id) > 0
+SELECT udf(udf(1)) FROM range(10) HAVING MAX(id) > 0
-- !query 28 schema
-struct<1:int>
+struct<CAST(udf(cast(cast(udf(cast(1 as string)) as int) as string)) AS INT):int>
-- !query 28 output
1
-- !query 29
-SELECT id FROM range(10) HAVING id > 0
+SELECT udf(id) FROM range(10) HAVING id > 0
-- !query 29 schema
struct<>
-- !query 29 output
-291,33 +291,33 struct<>
-- !query 31
-SELECT every(v), some(v), any(v) FROM test_agg WHERE 1 = 0
+SELECT udf(every(v)), udf(some(v)), any(v) FROM test_agg WHERE 1 = 0
-- !query 31 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
-- !query 31 output
NULL NULL NULL
-- !query 32
-SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 4
+SELECT udf(every(udf(v))), some(v), any(v) FROM test_agg WHERE k = 4
-- !query 32 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(every(cast(udf(cast(v as string)) as boolean)) as string)) AS BOOLEAN):boolean,some(v):boolean,any(v):boolean>
-- !query 32 output
NULL NULL NULL
-- !query 33
-SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 5
+SELECT every(v), udf(some(v)), any(v) FROM test_agg WHERE k = 5
-- !query 33 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
-- !query 33 output
false true true
-- !query 34
-SELECT k, every(v), some(v), any(v) FROM test_agg GROUP BY k
+SELECT udf(k), every(v), udf(some(v)), any(v) FROM test_agg GROUP BY udf(k)
-- !query 34 schema
-struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
-- !query 34 output
1 false true true
2 true true true
-327,9 +327,9 struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean>
-- !query 35
-SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) = false
+SELECT udf(k), every(v) FROM test_agg GROUP BY k HAVING every(v) = false
-- !query 35 schema
-struct<k:int,every(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean>
-- !query 35 output
1 false
3 false
-337,77 +337,77 struct<k:int,every(v):boolean>
-- !query 36
-SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) IS NULL
+SELECT udf(k), udf(every(v)) FROM test_agg GROUP BY udf(k) HAVING every(v) IS NULL
-- !query 36 schema
-struct<k:int,every(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean>
-- !query 36 output
4 NULL
-- !query 37
-SELECT k,
- Every(v) AS every
+SELECT udf(k),
+ udf(Every(v)) AS every
FROM test_agg
WHERE k = 2
AND v IN (SELECT Any(v)
FROM test_agg
WHERE k = 1)
-GROUP BY k
+GROUP BY udf(k)
-- !query 37 schema
-struct<k:int,every:boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every:boolean>
-- !query 37 output
2 true
-- !query 38
-SELECT k,
+SELECT udf(udf(k)),
Every(v) AS every
FROM test_agg
WHERE k = 2
AND v IN (SELECT Every(v)
FROM test_agg
WHERE k = 1)
-GROUP BY k
+GROUP BY udf(udf(k))
-- !query 38 schema
-struct<k:int,every:boolean>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,every:boolean>
-- !query 38 output
-- !query 39
-SELECT every(1)
+SELECT every(udf(1))
-- !query 39 schema
struct<>
-- !query 39 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'every(1)' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7
+cannot resolve 'every(CAST(udf(cast(1 as string)) AS INT))' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7
-- !query 40
-SELECT some(1S)
+SELECT some(udf(1S))
-- !query 40 schema
struct<>
-- !query 40 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'some(1S)' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7
+cannot resolve 'some(CAST(udf(cast(1 as string)) AS SMALLINT))' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7
-- !query 41
-SELECT any(1L)
+SELECT any(udf(1L))
-- !query 41 schema
struct<>
-- !query 41 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'any(1L)' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7
+cannot resolve 'any(CAST(udf(cast(1 as string)) AS BIGINT))' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7
-- !query 42
-SELECT every("true")
+SELECT udf(every("true"))
-- !query 42 schema
struct<>
-- !query 42 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7
+cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 11
-- !query 43
-428,9 +428,9 struct<k:int,v:boolean,every(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST
-- !query 44
-SELECT k, v, some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
+SELECT k, udf(udf(v)), some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
-- !query 44 schema
-struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
+struct<k:int,CAST(udf(cast(cast(udf(cast(v as string)) as boolean) as string)) AS BOOLEAN):boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
-- !query 44 output
1 false false
1 true true
-445,9 +445,9 struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST R
-- !query 45
-SELECT k, v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
+SELECT udf(udf(k)), v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
-- !query 45 schema
-struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
-- !query 45 output
1 false false
1 true true
-462,17 +462,17 struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RA
-- !query 46
-SELECT count(*) FROM test_agg HAVING count(*) > 1L
+SELECT udf(count(*)) FROM test_agg HAVING count(*) > 1L
-- !query 46 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 46 output
10
-- !query 47
-SELECT k, max(v) FROM test_agg GROUP BY k HAVING max(v) = true
+SELECT k, udf(max(v)) FROM test_agg GROUP BY k HAVING max(v) = true
-- !query 47 schema
-struct<k:int,max(v):boolean>
+struct<k:int,CAST(udf(cast(max(v) as string)) AS BOOLEAN):boolean>
-- !query 47 output
1 true
2 true
-480,7 +480,7 struct<k:int,max(v):boolean>
-- !query 48
-SELECT * FROM (SELECT COUNT(*) AS cnt FROM test_agg) WHERE cnt > 1L
+SELECT * FROM (SELECT udf(COUNT(*)) AS cnt FROM test_agg) WHERE cnt > 1L
-- !query 48 schema
struct<cnt:bigint>
-- !query 48 output
-488,7 +488,7 struct<cnt:bigint>
-- !query 49
-SELECT count(*) FROM test_agg WHERE count(*) > 1L
+SELECT udf(count(*)) FROM test_agg WHERE count(*) > 1L
-- !query 49 schema
struct<>
-- !query 49 output
-500,7 +500,7 Invalid expressions: [count(1)];
-- !query 50
-SELECT count(*) FROM test_agg WHERE count(*) + 1L > 1L
+SELECT udf(count(*)) FROM test_agg WHERE count(*) + 1L > 1L
-- !query 50 schema
struct<>
-- !query 50 output
-512,7 +512,7 Invalid expressions: [count(1)];
-- !query 51
-SELECT count(*) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1
+SELECT udf(count(*)) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1
-- !query 51 schema
struct<>
-- !query 51 output
```
</p>
</details>
## How was this patch tested?
Tested as instructed in SPARK-27921.
Closes#25360 from skonto/group-by-followup.
Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
While processing the Rowdata in the server side ColumnValue BigDecimal type value processed by server has to converted to the HiveDecmal data type for successful processing of query using Hive ODBC client.As per current logic corresponding to the Decimal column datatype, the Spark server uses BigDecimal, and the ODBC client uses HiveDecimal. If the data type does not match, the client fail to parse
Since this handing was missing the query executed in Hive ODBC client wont return or provides result to the user even though the decimal type column value data present.
## How was this patch tested?
Manual test report and impact assessment is done using existing test-cases
Before fix
![decimal_odbc](https://user-images.githubusercontent.com/12999161/53440179-e74a7f00-3a29-11e9-93db-83f2ae37ef16.PNG)
After Fix
![hive_odbc](https://user-images.githubusercontent.com/12999161/53679519-70e0a200-3cf3-11e9-9437-9c27d2e5056d.PNG)
Closes#23899 from sujith71955/master_decimalissue.
Authored-by: s71955 <sujithchacko.2010@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Restored comments in `date.sql` removed by 924d794a6f and 997d153e54 . The comments was introduced by 51379b731d .
## How was this patch tested?
By re-running `date.sql` via:
```shell
$ build/sbt "sql/test-only *SQLQueryTestSuite -- -z date.sql"
```
Closes#25422 from MaxGekk/sql-comments-followup.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR moves `udf_radians` from `HiveCompatibilitySuite` to `HiveQuerySuite` to make it easy to test with JDK 11 because it returns different value from JDK 9:
```java
public class TestRadians {
public static void main(String[] args) {
System.out.println(java.lang.Math.toRadians(57.2958));
}
}
```
```sh
[rootspark-3267648 ~]# javac TestRadians.java
[rootspark-3267648 ~]# /usr/lib/jdk-9.0.4+11/bin/java TestRadians
1.0000003575641672
[rootspark-3267648 ~]# /usr/lib/jdk-11.0.3/bin/java TestRadians
1.0000003575641672
[rootspark-3267648 ~]# /usr/lib/jdk8u222-b10/bin/java TestRadians
1.000000357564167
```
## How was this patch tested?
manual tests
Closes#25417 from wangyum/SPARK-28686.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
It seems Datanucleus 3.x can not support JDK 11:
```java
[info] Cause: org.datanucleus.exceptions.NucleusException: The java type java.lang.Long (jdbc-type="", sql-type="") cant be mapped for this datastore. No mapping is available.
[info] at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.getDatastoreMappingClass(RDBMSMappingManager.java:1215)
[info] at org.datanucleus.store.rdbms.mapping.RDBMSMappingManager.createDatastoreMapping(RDBMSMappingManager.java:1378)
[info] at org.datanucleus.store.rdbms.table.AbstractClassTable.addDatastoreId(AbstractClassTable.java:392)
[info] at org.datanucleus.store.rdbms.table.ClassTable.initializePK(ClassTable.java:1087)
[info] at org.datanucleus.store.rdbms.table.ClassTable.preInitialize(ClassTable.java:247)
```
Hive upgrade Datanucleus to 4.x from Hive 2.0([HIVE-6113](https://issues.apache.org/jira/browse/HIVE-6113)). This PR makes it skip `0.12`, `0.13`, `0.14`, `1.0`, `1.1` and `1.2` when testing with JDK 11.
Note that, this pr will not fix sql read hive materialized view. It's another issue:
```
3.0: sql read hive materialized view *** FAILED *** (1 second, 521 milliseconds)
3.1: sql read hive materialized view *** FAILED *** (1 second, 536 milliseconds)
```
## How was this patch tested?
manual tests:
```shell
export JAVA_HOME="/usr/lib/jdk-11.0.3"
build/sbt "hive/test-only *.VersionsSuite *.HiveClientSuites" -Phive -Phadoop-3.2
```
Closes#25405 from wangyum/SPARK-28685.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR replaces `CatalogUtils.maskCredentials` with `SQLConf.get.redactOptions` to match other redacts.
## How was this patch tested?
unit test and manual tests:
Before this PR:
```sql
spark-sql> DESC EXTENDED test_spark_28675;
id int NULL
# Detailed Table Information
Database default
Table test_spark_28675
Owner root
Created Time Fri Aug 09 08:23:17 GMT-07:00 2019
Last Access Wed Dec 31 17:00:00 GMT-07:00 1969
Created By Spark 3.0.0-SNAPSHOT
Type MANAGED
Provider org.apache.spark.sql.jdbc
Location file:/user/hive/warehouse/test_spark_28675
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675';
default test_spark_28675 false Database: default
Table: test_spark_28675
Owner: root
Created Time: Fri Aug 09 08:23:17 GMT-07:00 2019
Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969
Created By: Spark 3.0.0-SNAPSHOT
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Location: file:/user/hive/warehouse/test_spark_28675
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [url=###, driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
Schema: root
|-- id: integer (nullable = true)
```
After this PR:
```sql
spark-sql> DESC EXTENDED test_spark_28675;
id int NULL
# Detailed Table Information
Database default
Table test_spark_28675
Owner root
Created Time Fri Aug 09 08:19:49 GMT-07:00 2019
Last Access Wed Dec 31 17:00:00 GMT-07:00 1969
Created By Spark 3.0.0-SNAPSHOT
Type MANAGED
Provider org.apache.spark.sql.jdbc
Location file:/user/hive/warehouse/test_spark_28675
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties [url=*********(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
spark-sql> SHOW TABLE EXTENDED LIKE 'test_spark_28675';
default test_spark_28675 false Database: default
Table: test_spark_28675
Owner: root
Created Time: Fri Aug 09 08:19:49 GMT-07:00 2019
Last Access: Wed Dec 31 17:00:00 GMT-07:00 1969
Created By: Spark 3.0.0-SNAPSHOT
Type: MANAGED
Provider: org.apache.spark.sql.jdbc
Location: file:/user/hive/warehouse/test_spark_28675
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [url=*********(redacted), driver=com.mysql.jdbc.Driver, dbtable=test_spark_28675]
Schema: root
|-- id: integer (nullable = true)
```
Closes#25395 from wangyum/SPARK-28675.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR fixed typos in comments and replace the explicit type with '<>' for Java 8+.
## How was this patch tested?
Manually tested.
Closes#25338 from younggyuchun/younggyu.
Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
In the PR, I propose new expressions `Millennium`, `Century` and `Decade`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):
1. `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
2. `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
3. `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
Here are examples:
```sql
spark-sql> SELECT EXTRACT(MILLENNIUM FROM DATE '1981-01-19');
2
spark-sql> SELECT EXTRACT(CENTURY FROM DATE '1981-01-19');
20
spark-sql> SELECT EXTRACT(DECADE FROM DATE '1981-01-19');
198
```
## How was this patch tested?
Added new tests to `DateExpressionsSuite` and uncommented existing tests in `pgSQL/date.sql`.
Closes#25388 from MaxGekk/extract-ext2.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR is a follow-up to https://github.com/apache/spark/pull/24918
## How was this patch tested?
Pass the Jenkins with the newly update test files.
Closes#25393 from beliefer/enable-overlay-tests.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
Right now, batch DataFrame always changes the schema to nullable automatically (See this line: 325bc8e9c6/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L399)). But streaming file source is missing this.
This PR updates the streaming file source schema to force it be nullable. I also added a flag `spark.sql.streaming.fileSource.schema.forceNullable` to disable this change since some users may rely on the old behavior.
## How was this patch tested?
The new unit test.
Closes#25382 from zsxwing/SPARK-28651.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Adds support for V2 catalogs and the V2SessionCatalog for V2 tables for saveAsTable.
If the table can resolve through the V2SessionCatalog, we use SaveMode for datasource v1 for backwards compatibility to select the code path we're going to hit.
Depending on the SaveMode:
- SaveMode.Append:
a) If table exists: Use AppendData.byName
b) If table doesn't exist, use CTAS (ignoreIfExists = false)
- SaveMode.Overwrite: Use RTAS (orCreate = true)
- SaveMode.Ignore: Use CTAS (ignoreIfExists = true)
- SaveMode.ErrorIfExists: Use CTAS (ignoreIfExists = false)
## How was this patch tested?
Unit tests in DataSourceV2DataFrameSuite
Closes#25330 from brkyvz/saveAsTable.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
I propose new levels of truncations for the `date_trunc()` and `trunc()` functions:
1. `MICROSECOND` and `MILLISECOND` truncate values of the `TIMESTAMP` type to microsecond and millisecond precision.
2. `DECADE`, `CENTURY` and `MILLENNIUM` truncate dates/timestamps to lowest date of current decade/century/millennium.
Also the `WEEK` and `QUARTER` levels have been supported by the `trunc()` function.
The function is implemented similarly to `date_trunc` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC to maintain feature parity with it.
Here are examples of `TRUNC`:
```sql
spark-sql> SELECT TRUNC('2015-10-27', 'DECADE');
2010-01-01
spark-sql> set spark.sql.datetime.java8API.enabled=true;
spark.sql.datetime.java8API.enabled true
spark-sql> SELECT TRUNC('1999-10-27', 'millennium');
1001-01-01
```
Examples of `DATE_TRUNC`:
```sql
spark-sql> SELECT DATE_TRUNC('CENTURY', '2015-03-05T09:32:05.123456');
2001-01-01T00:00:00Z
```
## How was this patch tested?
Added new tests to `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`, and uncommented existing tests in `pgSQL/date.sql`.
Closes#25336 from MaxGekk/date_truct-ext.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Adds checks around:
- The existence of transforms in the table schema (even in nested fields)
- Duplications of transforms
- Case sensitivity checks around column names
in the V2 table creation code paths.
## How was this patch tested?
Unit tests.
Closes#25305 from brkyvz/v2CreateTable.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Plans after "Extract Python UDFs" are very flaky and error-prone to other rules.
For instance, if we add some rules, for instance, `PushDownPredicates` in `postHocOptimizationBatches`, the test in `BatchEvalPythonExecSuite` fails:
```scala
test("Python UDF refers to the attributes from more than one child") {
val df = Seq(("Hello", 4)).toDF("a", "b")
val df2 = Seq(("Hello", 4)).toDF("c", "d")
val joinDF = df.crossJoin(df2).where("dummyPythonUDF(a, c) == dummyPythonUDF(d, c)")
val qualifiedPlanNodes = joinDF.queryExecution.executedPlan.collect {
case b: BatchEvalPythonExec => b
}
assert(qualifiedPlanNodes.size == 1)
}
```
```
Invalid PythonUDF dummyUDF(a#63, c#74), requires attributes from more than one child.
```
This is because Python UDF extraction optimization is rolled back as below:
```
=== Applying Rule org.apache.spark.sql.catalyst.optimizer.PushDownPredicates ===
!Filter (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18)) Join Cross, (dummyUDF(a#7, c#18) = dummyUDF(d#19, c#18))
!+- Join Cross :- Project [_1#2 AS a#7, _2#3 AS b#8]
! :- Project [_1#2 AS a#7, _2#3 AS b#8] : +- LocalRelation [_1#2, _2#3]
! : +- LocalRelation [_1#2, _2#3] +- Project [_1#13 AS c#18, _2#14 AS d#19]
! +- Project [_1#13 AS c#18, _2#14 AS d#19] +- LocalRelation [_1#13, _2#14]
! +- LocalRelation [_1#13, _2#14]
```
Seems we should do Python UDFs cases at the last even after post hoc rules.
Note that this actually rather follows the way in previous versions when those were in physical plans (see SPARK-24721 and SPARK-12981). Those optimization rules were supposed to be placed at the end.
Note that I intentionally didn't move `ExperimentalMethods` (`spark.experimental.extraStrategies`). This is an explicit experimental API and I wanted to just-in-case workaround after this change for now.
## How was this patch tested?
Existing tests should cover.
Closes#25386 from HyukjinKwon/SPARK-28654.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR port [HIVE-10646](https://issues.apache.org/jira/browse/HIVE-10646) to fix Hive 0.12's JDBC client can not handle `NULL_TYPE`:
```sql
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:10000> select null;
org.apache.thrift.transport.TTransportException
at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)
at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)
at org.apache.thrift.transport.TSaslTransport.readLength(TSaslTransport.java:346)
at org.apache.thrift.transport.TSaslTransport.readFrame(TSaslTransport.java:423)
at org.apache.thrift.transport.TSaslTransport.read(TSaslTransport.java:405)
```
Server log:
```
19/08/07 09:34:07 ERROR TThreadPoolServer: Error occurred during processing of message.
java.lang.NullPointerException
at org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:388)
at org.apache.hive.service.cli.thrift.TRow$TRowStandardScheme.write(TRow.java:338)
at org.apache.hive.service.cli.thrift.TRow.write(TRow.java:288)
at org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:605)
at org.apache.hive.service.cli.thrift.TRowSet$TRowSetStandardScheme.write(TRowSet.java:525)
at org.apache.hive.service.cli.thrift.TRowSet.write(TRowSet.java:455)
at org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:550)
at org.apache.hive.service.cli.thrift.TFetchResultsResp$TFetchResultsRespStandardScheme.write(TFetchResultsResp.java:486)
at org.apache.hive.service.cli.thrift.TFetchResultsResp.write(TFetchResultsResp.java:412)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13192)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result$FetchResults_resultStandardScheme.write(TCLIService.java:13156)
at org.apache.hive.service.cli.thrift.TCLIService$FetchResults_result.write(TCLIService.java:13107)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:58)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:819)
```
## How was this patch tested?
unit tests
Closes#25378 from wangyum/SPARK-28644.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR fix Hive 0.12 JDBC client can not handle binary type:
```sql
Connected to: Hive (version 3.0.0-SNAPSHOT)
Driver: Hive (version 0.12.0)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Beeline version 0.12.0 by Apache Hive
0: jdbc:hive2://localhost:10000> SELECT cast('ABC' as binary);
Error: java.lang.ClassCastException: [B incompatible with java.lang.String (state=,code=0)
```
Server log:
```
19/08/07 10:10:04 WARN ThriftCLIService: Error fetching results:
java.lang.RuntimeException: java.lang.ClassCastException: [B incompatible with java.lang.String
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:83)
at org.apache.hive.service.cli.session.HiveSessionProxy.access$000(HiveSessionProxy.java:36)
at org.apache.hive.service.cli.session.HiveSessionProxy$1.run(HiveSessionProxy.java:63)
at java.security.AccessController.doPrivileged(AccessController.java:770)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:59)
at com.sun.proxy.$Proxy26.fetchResults(Unknown Source)
at org.apache.hive.service.cli.CLIService.fetchResults(CLIService.java:455)
at org.apache.hive.service.cli.thrift.ThriftCLIService.FetchResults(ThriftCLIService.java:621)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1553)
at org.apache.hive.service.cli.thrift.TCLIService$Processor$FetchResults.getResult(TCLIService.java:1538)
at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:38)
at org.apache.thrift.TBaseProcessor.process(TBaseProcessor.java:39)
at org.apache.hive.service.auth.TSetIpAddressProcessor.process(TSetIpAddressProcessor.java:53)
at org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:310)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:819)
Caused by: java.lang.ClassCastException: [B incompatible with java.lang.String
at org.apache.hive.service.cli.ColumnValue.toTColumnValue(ColumnValue.java:198)
at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:60)
at org.apache.hive.service.cli.RowBasedSet.addRow(RowBasedSet.java:32)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.$anonfun$getNextRowSet$1(SparkExecuteStatementOperation.scala:151)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation$$Lambda$1923.000000009113BFE0.apply(Unknown Source)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.withSchedulerPool(SparkExecuteStatementOperation.scala:299)
at org.apache.spark.sql.hive.thriftserver.SparkExecuteStatementOperation.getNextRowSet(SparkExecuteStatementOperation.scala:113)
at org.apache.hive.service.cli.operation.OperationManager.getOperationNextRowSet(OperationManager.java:220)
at org.apache.hive.service.cli.session.HiveSessionImpl.fetchResults(HiveSessionImpl.java:785)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hive.service.cli.session.HiveSessionProxy.invoke(HiveSessionProxy.java:78)
... 18 more
```
## How was this patch tested?
unit tests
Closes#25379 from wangyum/SPARK-28474.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Fix the typo in java doc.
## How was this patch tested?
N/A
Signed-off-by: Yishuang Lu <luystugmail.com>
Closes#25377 from lys0716/dev.
Authored-by: Yishuang Lu <luystu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR makes `spark.sql.function.preferIntegralDivision` to internal configuration because it is only used for PostgreSQL test cases.
More details:
https://github.com/apache/spark/pull/25158#discussion_r309764541
## How was this patch tested?
N/A
Closes#25376 from wangyum/SPARK-28395-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
In `Catalogs.load`, the `pluginClassName` in the following code
```
String pluginClassName = conf.getConfString("spark.sql.catalog." + name, null);
```
is always null for built-in catalogs, e.g there is a SQLConf entry `spark.sql.catalog.session`.
This is because of https://github.com/apache/spark/pull/18852: SQLConf.conf.getConfString(key, null) always returns null.
## How was this patch tested?
Apply code changes of https://github.com/apache/spark/pull/24768 and tried loading session catalog.
Closes#25094 from gengliangwang/fixCatalogLoad.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
The flag `spark.sql.decimalOperations.nullOnOverflow` is not honored by the `Cast` operator. This means that a casting which causes an overflow currently returns `null`.
The PR makes `Cast` respecting that flag, ie. when it is turned to false and a decimal overflow occurs, an exception id thrown.
## How was this patch tested?
Added UT
Closes#25253 from mgaido91/SPARK-28470.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
Subqueries do not have their own execution id, thus when calling `AdaptiveSparkPlanExec.onUpdatePlan`, it will actually get the `QueryExecution` instance of the main query, which is wasteful and problematic. It could cause issues like stack overflow or dead locks in some circumstances.
This PR fixes this issue by making `AdaptiveSparkPlanExec` compare the `QueryExecution` object retrieved by current execution ID against the `QueryExecution` object from which this plan is created, and only update the UI when the two instances are the same.
## How was this patch tested?
Manual tests on TPC-DS queries.
Closes#25316 from maryannxue/aqe-updateplan-fix.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: herman <herman@databricks.com>
## What changes were proposed in this pull request?
Sometimes when you explain a query, you will get stuck for a while. What's worse, you will get stuck again if you explain again.
This is caused by `FileSourceScanExec`:
1. In its `toString`, it needs to report the number of partitions it reads. This needs to query the hive metastore.
2. In its `outputOrdering`, it needs to get all the files. This needs to query the hive metastore.
This PR fixes by:
1. `toString` do not need to report the number of partitions it reads. We should report it via SQL metrics.
2. The `outputOrdering` is not very useful. We can only apply it if a) all the bucket columns are read. b) there is only one file in each bucket. This condition is really hard to meet, and even if we meet, sorting an already sorted file is pretty fast and avoiding the sort is not that useful. I think it's worth to give up this optimization so that explain don't need to get stuck.
## How was this patch tested?
existing tests
Closes#25328 from cloud-fan/ui.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR is a follow-up to https://github.com/apache/spark/pull/25074
## How was this patch tested?
Pass the Jenkins with the newly update test files.
Closes#25366 from beliefer/uncomment-boolean-test.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Implements the `DESCRIBE TABLE` logical and physical plans for data source v2 tables.
## How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25040 from mccheah/describe-table-v2.
Authored-by: mcheah <mcheah@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Add a guide line for dataframe functions, say:
```
This function APIs usually have methods with Column signature only because it can support not only Column but also other types such as a native string. The other variants currently exist for historical reasons.
```
## How was this patch tested?
N/A
Closes#25355 from WeichenXu123/update_functions_guide2.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Add's the higher order function `forall`, which tests an array to see if a predicate holds for every element.
The function is implemented in `org.apache.spark.sql.catalyst.expressions.ArrayForAll`.
The function is added to the function registry under the pretty name `forall`.
## How was this patch tested?
I've added appropriate unit tests for the new ArrayForAll expression in
`sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala`.
Also added tests for the function in `sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala`.
Not sure who is best to ask about this PR so:
HyukjinKwon rxin gatorsmile ueshin srowen hvanhovell gatorsmile
Closes#24761 from nvander1/feature/for_all.
Lead-authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com>
Co-authored-by: Nik <nikolasrvanderhoof@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
## What changes were proposed in this pull request?
In the PR, I propose to use existing expressions `DayOfYear`, `WeekDay` and `DayOfWeek`, and support additional parameters of `extract()` for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT):
1. `dow` - the day of the week as Sunday (0) to Saturday (6)
2. `isodow` - the day of the week as Monday (1) to Sunday (7)
3. `doy` - the day of the year (1 - 365/366)
Here are examples:
```sql
spark-sql> SELECT EXTRACT(DOW FROM TIMESTAMP '2001-02-16 20:38:40');
5
spark-sql> SELECT EXTRACT(ISODOW FROM TIMESTAMP '2001-02-18 20:38:40');
7
spark-sql> SELECT EXTRACT(DOY FROM TIMESTAMP '2001-02-16 20:38:40');
47
```
## How was this patch tested?
Updated `extract.sql`.
Closes#25367 from MaxGekk/extract-ext.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR targets to rename `PullOutPythonUDFInJoinCondition` to `ExtractPythonUDFFromJoinCondition` and move to 'Extract Python UDFs' together with other Python UDF related rules.
Currently `PullOutPythonUDFInJoinCondition` rule is alone outside of other 'Extract Python UDFs' rules together.
and the name `ExtractPythonUDFFromJoinCondition` is matched to existing Python UDF extraction rules.
## How was this patch tested?
Existing tests should cover.
Closes#25358 from HyukjinKwon/move-python-join-rule.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This PR adds UDF cases into group by clause in 'pgSQL/select_implicit.sql'
<details><summary>Diff comparing to 'pgSQL/select_implicit.sql'</summary>
<p>
```diff
diff --git a/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_implicit.sql.out b/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/pgSQL/select_implicit.sql.out
index 17303b2..0675820 100755
--- a/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_implicit.sql.out
+++ b/home/root1/src/spark/sql/core/src/test/resources/sql-tests/results/pgSQL/select_implicit.sql.out
-91,11 +91,9 struct<>
-- !query 11
-SELECT udf(c), udf(count(*)) FROM test_missing_target GROUP BY
-udf(test_missing_target.c)
-ORDER BY udf(c)
+SELECT c, count(*) FROM test_missing_target GROUP BY test_missing_target.c ORDER BY c
-- !query 11 schema
-struct<CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<c:string,count(1):bigint>
-- !query 11 output
ABAB 2
BBBB 2
-106,10 +104,9 cccc 2
-- !query 12
-SELECT udf(count(*)) FROM test_missing_target GROUP BY udf(test_missing_target.c)
-ORDER BY udf(c)
+SELECT count(*) FROM test_missing_target GROUP BY test_missing_target.c ORDER BY c
-- !query 12 schema
-struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<count(1):bigint>
-- !query 12 output
2
2
-120,18 +117,18 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 13
-SELECT udf(count(*)) FROM test_missing_target GROUP BY udf(a) ORDER BY udf(b)
+SELECT count(*) FROM test_missing_target GROUP BY a ORDER BY b
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`b`' given input columns: [CAST(udf(cast(count(1) as string)) AS BIGINT)]; line 1 pos 75
+cannot resolve '`b`' given input columns: [count(1)]; line 1 pos 61
-- !query 14
-SELECT udf(count(*)) FROM test_missing_target GROUP BY udf(b) ORDER BY udf(b)
+SELECT count(*) FROM test_missing_target GROUP BY b ORDER BY b
-- !query 14 schema
-struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<count(1):bigint>
-- !query 14 output
1
2
-140,10 +137,10 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 15
-SELECT udf(test_missing_target.b), udf(count(*))
- FROM test_missing_target GROUP BY udf(b) ORDER BY udf(b)
+SELECT test_missing_target.b, count(*)
+ FROM test_missing_target GROUP BY b ORDER BY b
-- !query 15 schema
-struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<b:int,count(1):bigint>
-- !query 15 output
1 1
2 2
-152,9 +149,9 struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)
-- !query 16
-SELECT udf(c) FROM test_missing_target ORDER BY udf(a)
+SELECT c FROM test_missing_target ORDER BY a
-- !query 16 schema
-struct<CAST(udf(cast(c as string)) AS STRING):string>
+struct<c:string>
-- !query 16 output
XXXX
ABAB
-169,10 +166,9 CCCC
-- !query 17
-SELECT udf(count(*)) FROM test_missing_target GROUP BY udf(b) ORDER BY udf(b)
-desc
+SELECT count(*) FROM test_missing_target GROUP BY b ORDER BY b desc
-- !query 17 schema
-struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<count(1):bigint>
-- !query 17 output
4
3
-181,17 +177,17 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 18
-SELECT udf(count(*)) FROM test_missing_target ORDER BY udf(1) desc
+SELECT count(*) FROM test_missing_target ORDER BY 1 desc
-- !query 18 schema
-struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<count(1):bigint>
-- !query 18 output
10
-- !query 19
-SELECT udf(c), udf(count(*)) FROM test_missing_target GROUP BY 1 ORDER BY 1
+SELECT c, count(*) FROM test_missing_target GROUP BY 1 ORDER BY 1
-- !query 19 schema
-struct<CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<c:string,count(1):bigint>
-- !query 19 output
ABAB 2
BBBB 2
-202,30 +198,30 cccc 2
-- !query 20
-SELECT udf(c), udf(count(*)) FROM test_missing_target GROUP BY 3
+SELECT c, count(*) FROM test_missing_target GROUP BY 3
-- !query 20 schema
struct<>
-- !query 20 output
org.apache.spark.sql.AnalysisException
-GROUP BY position 3 is not in select list (valid range is [1, 2]); line 1 pos 63
+GROUP BY position 3 is not in select list (valid range is [1, 2]); line 1 pos 53
-- !query 21
-SELECT udf(count(*)) FROM test_missing_target x, test_missing_target y
- WHERE udf(x.a) = udf(y.a)
- GROUP BY udf(b) ORDER BY udf(b)
+SELECT count(*) FROM test_missing_target x, test_missing_target y
+ WHERE x.a = y.a
+ GROUP BY b ORDER BY b
-- !query 21 schema
struct<>
-- !query 21 output
org.apache.spark.sql.AnalysisException
-Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 14
+Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 10
-- !query 22
-SELECT udf(a), udf(a) FROM test_missing_target
- ORDER BY udf(a)
+SELECT a, a FROM test_missing_target
+ ORDER BY a
-- !query 22 schema
-struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
+struct<a:int,a:int>
-- !query 22 output
0 0
1 1
-240,10 +236,10 struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS IN
-- !query 23
-SELECT udf(udf(a)/2), udf(udf(a)/2) FROM test_missing_target
- ORDER BY udf(udf(a)/2)
+SELECT a/2, a/2 FROM test_missing_target
+ ORDER BY a/2
-- !query 23 schema
-struct<CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS INT):int,CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS INT):int>
+struct<(a div 2):int,(a div 2):int>
-- !query 23 output
0 0
0 0
-258,10 +254,10 struct<CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS
-- !query 24
-SELECT udf(a/2), udf(a/2) FROM test_missing_target
- GROUP BY udf(a/2) ORDER BY udf(a/2)
+SELECT a/2, a/2 FROM test_missing_target
+ GROUP BY a/2 ORDER BY a/2
-- !query 24 schema
-struct<CAST(udf(cast((a div 2) as string)) AS INT):int,CAST(udf(cast((a div 2) as string)) AS INT):int>
+struct<(a div 2):int,(a div 2):int>
-- !query 24 output
0 0
1 1
-271,11 +267,11 struct<CAST(udf(cast((a div 2) as string)) AS INT):int,CAST(udf(cast((a div 2) a
-- !query 25
-SELECT udf(x.b), udf(count(*)) FROM test_missing_target x, test_missing_target y
- WHERE udf(x.a) = udf(y.a)
- GROUP BY udf(x.b) ORDER BY udf(x.b)
+SELECT x.b, count(*) FROM test_missing_target x, test_missing_target y
+ WHERE x.a = y.a
+ GROUP BY x.b ORDER BY x.b
-- !query 25 schema
-struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<b:int,count(1):bigint>
-- !query 25 output
1 1
2 2
-284,11 +280,11 struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)
-- !query 26
-SELECT udf(count(*)) FROM test_missing_target x, test_missing_target y
- WHERE udf(x.a) = udf(y.a)
- GROUP BY udf(x.b) ORDER BY udf(x.b)
+SELECT count(*) FROM test_missing_target x, test_missing_target y
+ WHERE x.a = y.a
+ GROUP BY x.b ORDER BY x.b
-- !query 26 schema
-struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
+struct<count(1):bigint>
-- !query 26 output
1
2
-297,22 +293,22 struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 27
-SELECT udf(a%2), udf(count(udf(b))) FROM test_missing_target
-GROUP BY udf(test_missing_target.a%2)
-ORDER BY udf(test_missing_target.a%2)
+SELECT a%2, count(b) FROM test_missing_target
+GROUP BY test_missing_target.a%2
+ORDER BY test_missing_target.a%2
-- !query 27 schema
-struct<CAST(udf(cast((a % 2) as string)) AS INT):int,CAST(udf(cast(count(cast(udf(cast(b as string)) as int)) as string)) AS BIGINT):bigint>
+struct<(a % 2):int,count(b):bigint>
-- !query 27 output
0 5
1 5
-- !query 28
-SELECT udf(count(c)) FROM test_missing_target
-GROUP BY udf(lower(test_missing_target.c))
-ORDER BY udf(lower(test_missing_target.c))
+SELECT count(c) FROM test_missing_target
+GROUP BY lower(test_missing_target.c)
+ORDER BY lower(test_missing_target.c)
-- !query 28 schema
-struct<CAST(udf(cast(count(c) as string)) AS BIGINT):bigint>
+struct<count(c):bigint>
-- !query 28 output
2
3
-321,18 +317,18 struct<CAST(udf(cast(count(c) as string)) AS BIGINT):bigint>
-- !query 29
-SELECT udf(count(udf(a))) FROM test_missing_target GROUP BY udf(a) ORDER BY udf(b)
+SELECT count(a) FROM test_missing_target GROUP BY a ORDER BY b
-- !query 29 schema
struct<>
-- !query 29 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`b`' given input columns: [CAST(udf(cast(count(cast(udf(cast(a as string)) as int)) as string)) AS BIGINT)]; line 1 pos 80
+cannot resolve '`b`' given input columns: [count(a)]; line 1 pos 61
-- !query 30
-SELECT udf(count(b)) FROM test_missing_target GROUP BY udf(b/2) ORDER BY udf(b/2)
+SELECT count(b) FROM test_missing_target GROUP BY b/2 ORDER BY b/2
-- !query 30 schema
-struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
+struct<count(b):bigint>
-- !query 30 output
1
5
-340,10 +336,10 struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 31
-SELECT udf(lower(test_missing_target.c)), udf(count(udf(c)))
- FROM test_missing_target GROUP BY udf(lower(c)) ORDER BY udf(lower(c))
+SELECT lower(test_missing_target.c), count(c)
+ FROM test_missing_target GROUP BY lower(c) ORDER BY lower(c)
-- !query 31 schema
-struct<CAST(udf(cast(lower(c) as string)) AS STRING):string,CAST(udf(cast(count(cast(udf(cast(c as string)) as string)) as string)) AS BIGINT):bigint>
+struct<lower(c):string,count(c):bigint>
-- !query 31 output
abab 2
bbbb 3
-352,9 +348,9 xxxx 1
-- !query 32
-SELECT udf(a) FROM test_missing_target ORDER BY udf(upper(udf(d)))
+SELECT a FROM test_missing_target ORDER BY upper(d)
-- !query 32 schema
-struct<CAST(udf(cast(a as string)) AS INT):int>
+struct<a:int>
-- !query 32 output
0
1
-369,33 +365,32 struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 33
-SELECT udf(count(b)) FROM test_missing_target
- GROUP BY udf((b + 1) / 2) ORDER BY udf((b + 1) / 2) desc
+SELECT count(b) FROM test_missing_target
+ GROUP BY (b + 1) / 2 ORDER BY (b + 1) / 2 desc
-- !query 33 schema
-struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
+struct<count(b):bigint>
-- !query 33 output
7
3
-- !query 34
-SELECT udf(count(udf(x.a))) FROM test_missing_target x, test_missing_target y
- WHERE udf(x.a) = udf(y.a)
- GROUP BY udf(b/2) ORDER BY udf(b/2)
+SELECT count(x.a) FROM test_missing_target x, test_missing_target y
+ WHERE x.a = y.a
+ GROUP BY b/2 ORDER BY b/2
-- !query 34 schema
struct<>
-- !query 34 output
org.apache.spark.sql.AnalysisException
-Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 14
+Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 10
-- !query 35
-SELECT udf(x.b/2), udf(count(udf(x.b))) FROM test_missing_target x,
-test_missing_target y
- WHERE udf(x.a) = udf(y.a)
- GROUP BY udf(x.b/2) ORDER BY udf(x.b/2)
+SELECT x.b/2, count(x.b) FROM test_missing_target x, test_missing_target y
+ WHERE x.a = y.a
+ GROUP BY x.b/2 ORDER BY x.b/2
-- !query 35 schema
-struct<CAST(udf(cast((b div 2) as string)) AS INT):int,CAST(udf(cast(count(cast(udf(cast(b as string)) as int)) as string)) AS BIGINT):bigint>
+struct<(b div 2):int,count(b):bigint>
-- !query 35 output
0 1
1 5
-403,14 +398,14 struct<CAST(udf(cast((b div 2) as string)) AS INT):int,CAST(udf(cast(count(cast(
-- !query 36
-SELECT udf(count(udf(b))) FROM test_missing_target x, test_missing_target y
- WHERE udf(x.a) = udf(y.a)
- GROUP BY udf(x.b/2)
+SELECT count(b) FROM test_missing_target x, test_missing_target y
+ WHERE x.a = y.a
+ GROUP BY x.b/2
-- !query 36 schema
struct<>
-- !query 36 output
org.apache.spark.sql.AnalysisException
-Reference 'b' is ambiguous, could be: x.b, y.b.; line 1 pos 21
+Reference 'b' is ambiguous, could be: x.b, y.b.; line 1 pos 13
-- !query 37
```
</p>
</details>
## How was this patch tested?
Tested as Guided in SPARK-27921
Closes#25350 from Udbhav30/master.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR add supportColumnar in DebugExec. Seems there was a conflict between https://github.com/apache/spark/pull/25274 and https://github.com/apache/spark/pull/25264
Currently tests are broken in Jenkins:
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108687/https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108688/https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/108693/
```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree: ColumnarToRow +- InMemoryTableScan [id#356956L] +- InMemoryRelation [id#356956L], StorageLevel(disk, memory, deserialized, 1 replicas) +- *(1) Range (0, 5, step=1, splits=2)
Stacktrace
sbt.ForkMain$ForkError: org.apache.spark.sql.catalyst.errors.package$TreeNodeException: makeCopy, tree:
ColumnarToRow
+- InMemoryTableScan [id#356956L]
+- InMemoryRelation [id#356956L], StorageLevel(disk, memory, deserialized, 1 replicas)
+- *(1) Range (0, 5, step=1, splits=2)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:431)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:404)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:323)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:287)
```
## How was this patch tested?
Manually tested the failed test.
Closes#25365 from HyukjinKwon/SPARK-28537.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR is a followup of a fix as described in here: #25215 (comment)
<details><summary>Diff comparing to 'group-analytics.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
index 3439a05727..de297ab166 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
-13,9 +13,9 struct<>
-- !query 1
-SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH CUBE
+SELECT udf(a + b), b, udf(SUM(a - b)) FROM testData GROUP BY udf(a + b), b WITH CUBE
-- !query 1 schema
-struct<(a + b):int,b:int,sum((a - b)):bigint>
+struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,CAST(udf(cast(sum(cast((a - b) as bigint)) as string)) AS BIGINT):bigint>
-- !query 1 output
2 1 0
2 NULL 0
-33,9 +33,9 NULL NULL 3
-- !query 2
-SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH CUBE
+SELECT udf(a), udf(b), SUM(b) FROM testData GROUP BY udf(a), b WITH CUBE
-- !query 2 schema
-struct<a:int,b:int,sum(b):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(b as string)) AS INT):int,sum(b):bigint>
-- !query 2 output
1 1 1
1 2 2
-52,9 +52,9 NULL NULL 9
-- !query 3
-SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP
+SELECT udf(a + b), b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP
-- !query 3 schema
-struct<(a + b):int,b:int,sum((a - b)):bigint>
+struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,sum((a - b)):bigint>
-- !query 3 output
2 1 0
2 NULL 0
-70,9 +70,9 NULL NULL 3
-- !query 4
-SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH ROLLUP
+SELECT udf(a), b, udf(SUM(b)) FROM testData GROUP BY udf(a), b WITH ROLLUP
-- !query 4 schema
-struct<a:int,b:int,sum(b):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,b:int,CAST(udf(cast(sum(cast(b as bigint)) as string)) AS BIGINT):bigint>
-- !query 4 output
1 1 1
1 2 2
-97,7 +97,7 struct<>
-- !query 6
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY course, year
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY udf(course), year
-- !query 6 schema
struct<course:string,year:int,sum(earnings):bigint>
-- !query 6 output
-111,7 +111,7 dotNET 2013 48000
-- !query 7
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, year
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, udf(year)
-- !query 7 schema
struct<course:string,year:int,sum(earnings):bigint>
-- !query 7 output
-127,9 +127,9 dotNET 2013 48000
-- !query 8
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year)
+SELECT course, udf(year), SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year)
-- !query 8 schema
-struct<course:string,year:int,sum(earnings):bigint>
+struct<course:string,CAST(udf(cast(year as string)) AS INT):int,sum(earnings):bigint>
-- !query 8 output
Java NULL 50000
NULL 2012 35000
-138,26 +138,26 dotNET NULL 63000
-- !query 9
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course)
+SELECT course, year, udf(SUM(earnings)) FROM courseSales GROUP BY course, year GROUPING SETS(course)
-- !query 9 schema
-struct<course:string,year:int,sum(earnings):bigint>
+struct<course:string,year:int,CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint>
-- !query 9 output
Java NULL 50000
dotNET NULL 63000
-- !query 10
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year)
+SELECT udf(course), year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year)
-- !query 10 schema
-struct<course:string,year:int,sum(earnings):bigint>
+struct<CAST(udf(cast(course as string)) AS STRING):string,year:int,sum(earnings):bigint>
-- !query 10 output
NULL 2012 35000
NULL 2013 78000
-- !query 11
-SELECT course, SUM(earnings) AS sum FROM courseSales
-GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum
+SELECT course, udf(SUM(earnings)) AS sum FROM courseSales
+GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, udf(sum)
-- !query 11 schema
struct<course:string,sum:bigint>
-- !query 11 output
-173,7 +173,7 dotNET 63000
-- !query 12
SELECT course, SUM(earnings) AS sum, GROUPING_ID(course, earnings) FROM courseSales
-GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum
+GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY udf(course), sum
-- !query 12 schema
struct<course:string,sum:bigint,grouping_id(course, earnings):int>
-- !query 12 output
-188,10 +188,10 dotNET 63000 1
-- !query 13
-SELECT course, year, GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales
+SELECT udf(course), udf(year), GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales
GROUP BY CUBE(course, year)
-- !query 13 schema
-struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int>
+struct<CAST(udf(cast(course as string)) AS STRING):string,CAST(udf(cast(year as string)) AS INT):int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int>
-- !query 13 output
Java 2012 0 0 0
Java 2013 0 0 0
-205,7 +205,7 dotNET NULL 0 1 1
-- !query 14
-SELECT course, year, GROUPING(course) FROM courseSales GROUP BY course, year
+SELECT course, udf(year), GROUPING(course) FROM courseSales GROUP BY course, udf(year)
-- !query 14 schema
struct<>
-- !query 14 output
-214,7 +214,7 grouping() can only be used with GroupingSets/Cube/Rollup;
-- !query 15
-SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY course, year
+SELECT course, udf(year), GROUPING_ID(course, year) FROM courseSales GROUP BY udf(course), year
-- !query 15 schema
struct<>
-- !query 15 output
-223,7 +223,7 grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 16
-SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year
+SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, udf(year)
-- !query 16 schema
struct<course:string,year:int,grouping__id:int>
-- !query 16 output
-240,7 +240,7 NULL NULL 3
-- !query 17
SELECT course, year FROM courseSales GROUP BY CUBE(course, year)
-HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, year
+HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, udf(year)
-- !query 17 schema
struct<course:string,year:int>
-- !query 17 output
-250,7 +250,7 dotNET NULL
-- !query 18
-SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0
+SELECT course, udf(year) FROM courseSales GROUP BY udf(course), year HAVING GROUPING(course) > 0
-- !query 18 schema
struct<>
-- !query 18 output
-259,7 +259,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 19
-SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0
+SELECT course, udf(udf(year)) FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0
-- !query 19 schema
struct<>
-- !query 19 output
-268,9 +268,9 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 20
-SELECT course, year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0
+SELECT udf(course), year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0
-- !query 20 schema
-struct<course:string,year:int>
+struct<CAST(udf(cast(course as string)) AS STRING):string,year:int>
-- !query 20 output
Java NULL
NULL 2012
-281,7 +281,7 dotNET NULL
-- !query 21
SELECT course, year, GROUPING(course), GROUPING(year) FROM courseSales GROUP BY CUBE(course, year)
-ORDER BY GROUPING(course), GROUPING(year), course, year
+ORDER BY GROUPING(course), GROUPING(year), course, udf(year)
-- !query 21 schema
struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint>
-- !query 21 output
-298,7 +298,7 NULL NULL 1 1
-- !query 22
SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year)
-ORDER BY GROUPING(course), GROUPING(year), course, year
+ORDER BY GROUPING(course), GROUPING(year), course, udf(year)
-- !query 22 schema
struct<course:string,year:int,grouping_id(course, year):int>
-- !query 22 output
-314,7 +314,7 NULL NULL 3
-- !query 23
-SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING(course)
+SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING(course)
-- !query 23 schema
struct<>
-- !query 23 output
-323,7 +323,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 24
-SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING_ID(course)
+SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING_ID(course)
-- !query 24 schema
struct<>
-- !query 24 output
-332,7 +332,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 25
-SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year
+SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, udf(course), year
-- !query 25 schema
struct<course:string,year:int>
-- !query 25 output
-348,7 +348,7 NULL NULL
-- !query 26
-SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2)
+SELECT udf(a + b) AS k1, udf(b) AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2)
-- !query 26 schema
struct<k1:int,k2:int,sum((a - b)):bigint>
-- !query 26 output
-368,7 +368,7 NULL NULL 3
-- !query 27
-SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b)
+SELECT udf(udf(a + b)) AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b)
-- !query 27 schema
struct<k:int,b:int,sum((a - b)):bigint>
-- !query 27 output
-386,9 +386,9 NULL NULL 3
-- !query 28
-SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k)
+SELECT udf(a + b), udf(udf(b)) AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k)
-- !query 28 schema
-struct<(a + b):int,k:int,sum((a - b)):bigint>
+struct<CAST(udf(cast((a + b) as string)) AS INT):int,k:int,sum((a - b)):bigint>
-- !query 28 output
NULL 1 3
NULL 2 0
```
</p>
</details>
## How was this patch tested?
Tested as instructed in SPARK-27921.
Closes#25362 from skonto/group-analytics-followup.
Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used.
This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible.
This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings.
## How was this patch tested?
Existing unit tests.
Closes#25335 from HeartSaVioR/SPARK-28601.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
I did a post-hoc review of https://github.com/apache/spark/pull/25008 , and would like to propose some cleanups/fixes/improvements:
1. Do not track the scanTime metrics in `ColumnarToRowExec`. This metrics is specific to file scan, and doesn't make sense for a general batch-to-row operator.
2. Because of 2, we need to track scanTime when building RDDs in the file scan node.
3. use `RDD#mapPartitionsInternal` instead of `flatMap` in several places, as `mapPartitionsInternal` is created for Spark SQL and we use it in almost all the SQL operators.
4. Add `limitNotReachedCond` in `ColumnarToRowExec`. This was in the `ColumnarBatchScan` before and is critical for performance.
5. Clear the relationship between codegen stage and columnar stage. The whole-stage-codegen framework is completely row-based, so these 2 kinds of stages can NEVER overlap. When they are adjacent, it's either a `RowToColumnarExec` above `WholeStageExec`, or a `ColumnarToRowExec` above the `InputAdapter`.
6. Reuse the `ColumnarBatch` in `RowToColumnarExec`. We don't need to create a new one every time, just need to reset it.
7. Do not skip testing full scan node in `LogicalPlanTagInSparkPlanSuite`
8. Add back the removed tests in `WholeStageCodegenSuite`.
## How was this patch tested?
existing tests
Closes#25264 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This is an alternative solution of https://github.com/apache/spark/pull/24442 . It fails the query if ambiguous self join is detected, instead of trying to disambiguate it. The problem is that, it's hard to come up with a reasonable rule to disambiguate, the rule proposed by #24442 is mostly a heuristic.
### background of the self-join problem:
This is a long-standing bug and I've seen many people complaining about it in JIRA/dev list.
A typical example:
```
val df1 = …
val df2 = df1.filter(...)
df1.join(df2, df1("a") > df2("a")) // returns empty result
```
The root cause is, `Dataset.apply` is so powerful that users think it returns a column reference which can point to the column of the Dataset at anywhere. This is not true in many cases. `Dataset.apply` returns an `AttributeReference` . Different Datasets may share the same `AttributeReference`. In the example above, `df2` adds a Filter operator above the logical plan of `df1`, and the Filter operator reserves the output `AttributeReference` of its child. This means, `df1("a")` is exactly the same as `df2("a")`, and `df1("a") > df2("a")` always evaluates to false.
### The rule to detect ambiguous column reference caused by self join:
We can reuse the infra in #24442 :
1. each Dataset has a globally unique id.
2. the `AttributeReference` returned by `Dataset.apply` carries the ID and column position(e.g. 3rd column of the Dataset) via metadata.
3. the logical plan of a `Dataset` carries the ID via `TreeNodeTag`
When self-join happens, the analyzer asks the right side plan of join to re-generate output attributes with new exprIds. Based on it, a simple rule to detect ambiguous self join is:
1. find all column references (i.e. `AttributeReference`s with Dataset ID and col position) in the root node of a query plan.
2. for each column reference, traverse the query plan tree, find a sub-plan that carries Dataset ID and the ID is the same as the one in the column reference.
3. get the corresponding output attribute of the sub-plan by the col position in the column reference.
4. if the corresponding output attribute has a different exprID than the column reference, then it means this sub-plan is on the right side of a self-join and has regenerated its output attributes. This is an ambiguous self join because the column reference points to a table being self-joined.
## How was this patch tested?
existing tests and new test cases
Closes#25107 from cloud-fan/new-self-join.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
DebugExec does not implement doExecuteBroadcast and doExecuteColumnar so we can't debug broadcast or columnar related query.
One example for broadcast is here.
```
val df1 = Seq(1, 2, 3).toDF
val df2 = Seq(1, 2, 3).toDF
val joined = df1.join(df2, df1("value") === df2("value"))
joined.debug()
java.lang.UnsupportedOperationException: Debug does not implement doExecuteBroadcast
...
```
Another for columnar is here.
```
val df = Seq(1, 2, 3).toDF
df.persist
df.debug()
java.lang.IllegalStateException: Internal Error class org.apache.spark.sql.execution.debug.package$DebugExec has column support mismatch:
...
```
## How was this patch tested?
Additional test cases in DebuggingSuite.
Closes#25274 from sarutak/fix-debugexec.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
- DataFrameWriter.insertInto should match column names by position.
- Clean up test cases.
## How was this patch tested?
New tests:
- insertInto: append by position
- insertInto: overwrite partitioned table in static mode by position
- insertInto: overwrite partitioned table in dynamic mode by position
Closes#25353 from jzhuge/SPARK-28178-bypos.
Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This adds an interface for catalog plugins that exposes namespace operations:
* `listNamespaces`
* `namespaceExists`
* `loadNamespaceMetadata`
* `createNamespace`
* `alterNamespace`
* `dropNamespace`
## How was this patch tested?
API only. Existing tests for regressions.
Closes#24560 from rdblue/SPARK-27661-add-catalog-namespace-api.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
It's hard to know if the query needs to be sorted like [`SQLQueryTestSuite.isSorted`](2ecc39c8d3/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L375-L380)) when building a test framework for Thriftserver. So we can sort both the `outputs` and the `expectedOutputs. However, we removed leading write space in the golden result file. This can lead to inconsistent results.
This PR makes it does not remove leading write space in the golden result file. Trailing write space still needs to be removed.
## How was this patch tested?
N/A
Closes#25351 from wangyum/SPARK-28614.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Explained why "Subquery" and "Join Reorder" optimization batches should be `FixedPoint(1)`, which was introduced in SPARK-28532 and SPARK-28530.
## How was this patch tested?
Existing UTs.
Closes#25320 from yeshengm/SPARK-28530-followup.
Lead-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
See discussion on the JIRA (and dev). At heart, we find that math.log and math.pow can actually return slightly different results across platforms because of hardware optimizations. For the actual SQL log and pow functions, I propose that we should use StrictMath instead to ensure the answers are already the same. (This should have the benefit of helping tests pass on aarch64.)
Further, the atanh function (which is not part of java.lang.Math) can be implemented in a slightly different and more accurate way.
## How was this patch tested?
Existing tests (which will need to be changed).
Some manual testing locally to understand the numeric issues.
Closes#25279 from srowen/SPARK-28519.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This PR implements Spark's own GetFunctionsOperation which mitigates the differences between Spark SQL and Hive UDFs. But our implementation is different from Hive's implementation:
- Our implementation always returns results. Hive only returns results when [(null == catalogName || "".equals(catalogName)) && (null == schemaName || "".equals(schemaName))](https://github.com/apache/hive/blob/rel/release-3.1.1/service/src/java/org/apache/hive/service/cli/operation/GetFunctionsOperation.java#L101-L119).
- Our implementation pads the `REMARKS` field with the function usage - Hive returns an empty string.
- Our implementation does not support `FUNCTION_TYPE`, but Hive does.
## How was this patch tested?
unit tests
Closes#25252 from wangyum/SPARK-28510.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
When PythonUDF is used in group by, and it is also in aggregate expression, like
```
SELECT pyUDF(a + 1), COUNT(b) FROM testData GROUP BY pyUDF(a + 1)
```
It causes analysis exception in `CheckAnalysis`, like
```
org.apache.spark.sql.AnalysisException: expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function.
```
First, `CheckAnalysis` can't check semantic equality between PythonUDFs.
Second, even we make it possible, runtime exception will be thrown
```
org.apache.spark.sql.catalyst.errors.package$TreeNodeException: Binding attribute, tree: pythonUDF1#8615
...
Cause: java.lang.RuntimeException: Couldn't find pythonUDF1#8615 in [cast(pythonUDF0#8614 as int)#8617,count(b#8599)#8607L]
```
The cause is, `ExtractPythonUDFs` extracts both PythonUDFs in group by and aggregate expression. The PythonUDFs are two different aliases now in the logical aggregate. In runtime, we can't bind the resulting expression in aggregate to its grouping and aggregate attributes.
This patch proposes a rule `ExtractGroupingPythonUDFFromAggregate` to extract PythonUDFs in group by and evaluate them before aggregate. We replace the group by PythonUDF in aggregate expression with aliased result.
The query plan of query `SELECT pyUDF(a + 1), pyUDF(COUNT(b)) FROM testData GROUP BY pyUDF(a + 1)`, like
```
== Optimized Logical Plan ==
Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L]
+- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616]
+- Aggregate [cast(groupingPythonUDF#8614 as int)], [cast(groupingPythonUDF#8614 as int) AS CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, count(b#8599) AS agg#8613L]
+- Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599]
+- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615]
+- LocalRelation [a#8598, b#8599]
== Physical Plan ==
*(3) Project [CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, cast(pythonUDF0#8616 as bigint) AS CAST(pyUDF(cast(count(b) as string)) AS BIGINT)#8610L]
+- BatchEvalPython [pyUDF(cast(agg#8613L as string))], [pythonUDF0#8616]
+- *(2) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int)#8617], functions=[count(b#8599)], output=[CAST(pyUDF(cast((a + 1) as string)) AS INT)#8608, agg#8613L])
+- Exchange hashpartitioning(cast(groupingPythonUDF#8614 as int)#8617, 5), true
+- *(1) HashAggregate(keys=[cast(groupingPythonUDF#8614 as int) AS cast(groupingPythonUDF#8614 as int)#8617], functions=[partial_count(b#8599)], output=[cast(groupingPythonUDF#8614 as int)#8617, count#8619L])
+- *(1) Project [pythonUDF0#8615 AS groupingPythonUDF#8614, b#8599]
+- BatchEvalPython [pyUDF(cast((a#8598 + 1) as string))], [pythonUDF0#8615]
+- LocalTableScan [a#8598, b#8599]
```
## How was this patch tested?
Added tests.
Closes#25215 from viirya/SPARK-28445.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
```sql
spark-sql> select cast(1);
19/07/26 00:54:17 ERROR SparkSQLDriver: Failed in [select cast(1)]
java.lang.UnsupportedOperationException: empty.init
at scala.collection.TraversableLike$class.init(TraversableLike.scala:451)
at scala.collection.mutable.ArrayOps$ofInt.scala$collection$IndexedSeqOptimized$$super$init(ArrayOps.scala:234)
at scala.collection.IndexedSeqOptimized$class.init(IndexedSeqOptimized.scala:135)
at scala.collection.mutable.ArrayOps$ofInt.init(ArrayOps.scala:234)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:565)
at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$7$$anonfun$11.apply(FunctionRegistry.scala:558)
at scala.Option.getOrElse(Option.scala:121)
```
The reason is that we did not handle the case [`validParametersCount.length == 0`](2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L588)) because the [parameter types](2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L589)) can be `Expression`, `DataType` and `Option`. This PR makes it handle the case `validParametersCount.length == 0`.
## How was this patch tested?
unit tests
Closes#25261 from wangyum/SPARK-28521.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Prior to this change, in an executor, on each heartbeat, memory metrics are polled and sent in the heartbeat. The heartbeat interval is 10s by default. With this change, in an executor, memory metrics can optionally be polled in a separate poller at a shorter interval.
For each executor, we use a map of (stageId, stageAttemptId) to (count of running tasks, executor metric peaks) to track what stages are active as well as the per-stage memory metric peaks. When polling the executor memory metrics, we attribute the memory to the active stage(s), and update the peaks. In a heartbeat, we send the per-stage peaks (for stages active at that time), and then reset the peaks. The semantics would be that the per-stage peaks sent in each heartbeat are the peaks since the last heartbeat.
We also keep a map of taskId to memory metric peaks. This tracks the metric peaks during the lifetime of the task. The polling thread updates this as well. At end of a task, we send the peak metric values in the task result. In case of task failure, we send the peak metric values in the `TaskFailedReason`.
We continue to do the stage-level aggregation in the EventLoggingListener.
For the driver, we still only poll on heartbeats. What the driver sends will be the current values of the metrics in the driver at the time of the heartbeat. This is semantically the same as before.
## How was this patch tested?
Unit tests. Manually tested applications on an actual system and checked the event logs; the metrics appear in the SparkListenerTaskEnd and SparkListenerStageExecutorMetrics events.
Closes#23767 from wypoon/wypoon_SPARK-26329.
Authored-by: Wing Yew Poon <wypoon@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
## What changes were proposed in this pull request?
Logging in driver when loading single large unsplittable file via `sc.textFile` or csv/json datasouce.
Current condition triggering logging is
* only generate one partition
* file is unsplittable, possible reason is:
- compressed by unsplittable compression algo such as gzip.
- multiLine mode in csv/json datasource
- wholeText mode in text datasource
* file size exceed the config threshold `spark.io.warning.largeFileThreshold` (default value is 1GB)
## How was this patch tested?
Manually test.
Generate one gzip file exceeding 1GB,
```
base64 -b 50 /dev/urandom | head -c 2000000000 > file1.txt
cat file1.txt | gzip > file1.gz
```
then launch spark-shell,
run
```
sc.textFile("file:///path/to/file1.gz").count()
```
Will print log like:
```
WARN HadoopRDD: Loading one large unsplittable file file:/.../f1.gz with only one partition, because the file is compressed by unsplittable compression codec
```
run
```
sc.textFile("file:///path/to/file1.txt").count()
```
Will print log like:
```
WARN HadoopRDD: Loading one large file file:/.../f1.gz with only one partition, we can increase partition numbers by the `minPartitions` argument in method `sc.textFile
```
run
```
spark.read.csv("file:///path/to/file1.gz").count
```
Will print log like:
```
WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the file is compressed by unsplittable compression codec
```
run
```
spark.read.option("multiLine", true).csv("file:///path/to/file1.gz").count
```
Will print log like:
```
WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the csv datasource is set multiLine mode
```
JSON and Text datasource also tested with similar cases.
Please review https://spark.apache.org/contributing.html before opening a pull request.
Closes#25134 from WeichenXu123/log_gz.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
When an overflow occurs performing an arithmetic operation, we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard.
## How was this patch tested?
added UT + existing UTs (improved)
Closes#21599 from mgaido91/SPARK-24598.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR moves `replaceFunctionName(usage: String, functionName: String)`
from `DescribeFunctionCommand` to `ExpressionInfo` in order to make `ExpressionInfo` returns actual name instead of placeholder. We can get `ExpressionInfo`s directly through `SessionCatalog.lookupFunctionInfo` API and get the real names.
## How was this patch tested?
unit tests
Closes#25314 from wangyum/SPARK-28581.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR aims to support ANSI SQL `Boolean-Predicate` syntax.
```sql
expression IS [NOT] TRUE
expression IS [NOT] FALSE
expression IS [NOT] UNKNOWN
```
There are some mainstream database support this syntax.
- **PostgreSQL:** https://www.postgresql.org/docs/9.1/functions-comparison.html
- **Hive:** https://issues.apache.org/jira/browse/HIVE-13583
- **Redshift:** https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
- **Vertica:** https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/Boolean-predicate.htm
For example:
```sql
spark-sql> select null is true, null is not true;
false true
spark-sql> select false is true, false is not true;
false true
spark-sql> select true is true, true is not true;
true false
spark-sql> select null is false, null is not false;
false true
spark-sql> select false is false, false is not false;
true false
spark-sql> select true is false, true is not false;
false true
spark-sql> select null is unknown, null is not unknown;
true false
spark-sql> select false is unknown, false is not unknown;
false true
spark-sql> select true is unknown, true is not unknown;
false true
```
**Note**: A null input is treated as the logical value "unknown".
## How was this patch tested?
Pass the Jenkins with the newly added test cases.
Closes#25074 from beliefer/ansi-sql-boolean-test.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR makes the optimizer rule PullupCorrelatedPredicates idempotent.
## How was this patch tested?
A new test PullupCorrelatedPredicatesSuite
Closes#25268 from dilipbiswal/pr-25164.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
Support multiple catalogs in the following InsertInto use cases:
- DataFrameWriter.insertInto("catalog.db.tbl")
Support matrix:
SaveMode|Partitioned Table|Partition Overwrite Mode|Action
--------|-----------------|------------------------|------
Append|*|*|AppendData
Overwrite|no|*|OverwriteByExpression(true)
Overwrite|yes|STATIC|OverwriteByExpression(true)
Overwrite|yes|DYNAMIC|OverwritePartitionsDynamic
## How was this patch tested?
New tests.
All existing catalyst and sql/core tests.
Closes#24980 from jzhuge/SPARK-28178-pr.
Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Right now `Error` is not sent to `QueryExecutionListener.onFailure`. If there is any `Error` (such as `AssertionError`) when running a query, `QueryExecutionListener.onFailure` cannot be triggered.
This PR changes `onFailure` to accept a `Throwable` instead.
## How was this patch tested?
Jenkins
Closes#25292 from zsxwing/fix-QueryExecutionListener.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
New function `make_timestamp()` takes 6 columns `year`, `month`, `day`, `hour`, `min`, `sec` + optionally `timezone`, and makes new column of the `TIMESTAMP` type. If values in the input columns are `null` or out of valid ranges, the function returns `null`. Valid ranges are:
- `year` - `[1, 9999]`
- `month` - `[1, 12]`
- `day` - `[1, 31]`
- `hour` - `[0, 23]`
- `min` - `[0, 59]`
- `sec` - `[0, 60]`. If the `sec` argument equals to 60, the seconds field is set to 0 and 1 minute is added to the final timestamp.
- `timezone` - an identifier of timezone. Actual database of timezones can be found there: https://www.iana.org/time-zones.
Also constructed timestamp must be valid otherwise `make_timestamp` returns `null`.
The function is implemented similarly to `make_timestamp` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html to maintain feature parity with it.
Here is an example:
```sql
select make_timestamp(2014, 12, 28, 6, 30, 45.887);
2014-12-28 06:30:45.887
select make_timestamp(2014, 12, 28, 6, 30, 45.887, 'CET');
2014-12-28 10:30:45.887
select make_timestamp(2019, 6, 30, 23, 59, 60)
2019-07-01 00:00:00
```
Returned value has Spark Catalyst type `TIMESTAMP` which is similar to Oracle's `TIMESTAMP WITH LOCAL TIME ZONE` (see https://docs.oracle.com/cd/B28359_01/server.111/b28298/ch4datetime.htm#i1006169) where data is stored in the session time zone, and the time zone offset is not stored as part of the column data. When users retrieve the data, Spark returns it in the session time zone specified by the SQL config `spark.sql.session.timeZone`.
## How was this patch tested?
Added new tests to `DateExpressionsSuite`, and uncommented a test for `make_timestamp` in `pgSQL/timestamp.sql`.
Closes#25220 from MaxGekk/make_timestamp.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
These are what I found during working on #22282.
- Remove unused value: `UnsafeArraySuite#defaultTz`
- Remove redundant new modifier to the case class, `KafkaSourceRDDPartition`
- Remove unused variables from `RDD.scala`
- Remove trailing space from `structured-streaming-kafka-integration.md`
- Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default.
- Remove leading empty line: `UnsafeRow`
- Remove trailing empty line: `KafkaTestUtils`
- Remove unthrown exception type: `UnsafeMapData`
- Replace unused declarations: `expressions`
- Remove duplicated default parameter: `AnalysisErrorSuite`
- `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable
Closes#25251 from dongjinleekr/cleanup/201907.
Authored-by: Lee Dongjin <dongjin@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR aims to add a SQL function alias `random` to the existing `rand` function.
Please note that this adds the alias to SQL layer only because this is for PostgreSQL feature parity.
- [PostgreSQL Random function](https://www.postgresql.org/docs/11/functions-math.html)
- [SPARK-23160 Port window.sql](https://github.com/apache/spark/pull/24881/files#diff-14489bae6b27814d4cde0456a7ae75c8R702)
- [SPARK-28406 Port union.sql](https://github.com/apache/spark/pull/25163/files#diff-23a3430e0e1ff88830cbb43701da1f2cR402)
## How was this patch tested?
Manual.
```sql
spark-sql> DESCRIBE FUNCTION random;
Function: random
Class: org.apache.spark.sql.catalyst.expressions.Rand
Usage: random([seed]) - Returns a random value with independent and identically distributed (i.i.d.) uniformly distributed values in [0, 1).
```
Closes#25282 from dongjoon-hyun/SPARK-28086.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
In the PR, I propose to use `uuuu` for years instead of `yyyy` in date/timestamp patterns without the era pattern `G` (https://docs.oracle.com/javase/8/docs/api/java/time/format/DateTimeFormatter.html). **Parsing/formatting of positive years (current era) will be the same.** The difference is in formatting negative years belong to previous era - BC (Before Christ).
I replaced the `yyyy` pattern by `uuuu` everywhere except:
1. Test, Suite & Benchmark. Existing tests must work as is.
2. `SimpleDateFormat` because it doesn't support the `uuuu` pattern.
3. Comments and examples (except comments related to already replaced patterns).
Before the changes, the year of common era `100` and the year of BC era `-99`, showed similarly as `100`. After the changes negative years will be formatted with the `-` sign.
Before:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+----------+
| value|
+----------+
|0100-01-01|
+----------+
```
After:
```Scala
scala> Seq(java.time.LocalDate.of(-99, 1, 1)).toDF().show
+-----------+
| value|
+-----------+
|-0099-01-01|
+-----------+
```
## How was this patch tested?
By existing test suites, and added tests for negative years to `DateFormatterSuite` and `TimestampFormatterSuite`.
Closes#25230 from MaxGekk/year-pattern-uuuu.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Code is not generated for LocalTableScanExec although proper situations.
If a LocalTableScanExec plan has the direct parent plan which supports WholeStageCodegen,
the LocalTableScanExec plan also should be within a WholeStageCodegen domain.
But code is not generated for LocalTableScanExec and InputAdapter is inserted for now.
```
val df1 = spark.createDataset(1 to 10).toDF
val df2 = spark.createDataset(1 to 10).toDF
val df3 = df1.join(df2, df1("value") === df2("value"))
df3.explain(true)
...
== Physical Plan ==
*(1) BroadcastHashJoin [value#1], [value#6], Inner, BuildRight
:- LocalTableScan [value#1] // LocalTableScanExec is not within a WholeStageCodegen domain
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [value#6]
```
```
scala> df3.queryExecution.executedPlan.children.head.children.head.getClass
res4: Class[_ <: org.apache.spark.sql.execution.SparkPlan] = class org.apache.spark.sql.execution.InputAdapter
```
For the current implementation of LocalTableScanExec, codegen is enabled in case `parent` is not null
but `parent` is set in `consume`, which is called after `insertInputAdapter` so it doesn't work as intended.
After applying this cnahge, we can get following plan, which means LocalTableScanExec is within a WholeStageCodegen domain.
```
== Physical Plan ==
*(1) BroadcastHashJoin [value#63], [value#68], Inner, BuildRight
:- *(1) LocalTableScan [value#63]
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint)))
+- LocalTableScan [value#68]
## How was this patch tested?
New test cases are added into WholeStageCodegenSuite.
Closes#25260 from sarutak/localtablescan-improvement.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
## What changes were proposed in this pull request?
Fix for ```SPARK-28441 (PythonUDF used in correlated scalar subquery causes UnsupportedOperationException)``` is in. Re-enable the commented out test for ```udf(max(udf(column))) ```
## How was this patch tested?
use existing test ```udf-except.sql```
Closes#25278 from huaxingao/spark-28277n.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
In case of CatalogFileIndex datasource table, sizeInBytes is always coming as default size in bytes, which is 8.0EB (Even when the user give fallBackToHdfsForStatsEnabled=true) . So, the datasource table which has CatalogFileIndex, always prefer SortMergeJoin, instead of BroadcastJoin, even though the size is below broadcast join threshold.
In this PR, In case of CatalogFileIndex table, if we enable "fallBackToHdfsForStatsEnabled=true", then the computeStatistics get the sizeInBytes from the hdfs and we get the actual size of the table. Hence, during join operation, when the table size is below broadcast threshold, it will prefer broadCastHashJoin instead of SortMergeJoin.
Added UT
Closes#22502 from shahidki31/SPARK-25474.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
`ObjectAggregationIterator` shows a directional info message to increase `spark.sql.objectHashAggregate.sortBased.fallbackThreshold` when the size of the in-memory hash map grows too large and it falls back to sort-based aggregation.
However, we don't know how much we need to increase. This PR adds the size of the current in-memory hash map size to the log message.
**BEFORE**
```
15:21:41.669 Executor task launch worker for task 0 INFO
ObjectAggregationIterator: Aggregation hash map reaches threshold capacity (2 entries), ...
```
**AFTER**
```
15:20:05.742 Executor task launch worker for task 0 INFO
ObjectAggregationIterator: Aggregation hash map size 2 reaches threshold capacity (2 entries), ...
```
## How was this patch tested?
Manual. For example, run `ObjectHashAggregateSuite.scala`'s `typed_count fallback to sort-based aggregation` and search the above message in `target/unit-tests.log`.
Closes#25276 from dongjoon-hyun/SPARK-28545.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR add support typed `interval` expression:
```sql
spark-sql> select interval 'interval 3 year 1 hour';
interval 3 years 1 hours
spark-sql>
```
Please note that this pr did not add a cast alias for `interval` type like [other types](2d74f14d74/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala (L529-L541)) because neither PostgreSQL nor Hive supports this syntax.
## How was this patch tested?
unit tests
Closes#25241 from wangyum/SPARK-28424.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
We should add `assume(shouldTestPythonUDFs)`. Maybe it's not a biggie in general but it can matter in other venders' testing base. For instance, if somebody launches a test in a minimal docker image, it might make the tests failed suddenly.
This skipping stuff isn't completely new in our test base. See `TestUtils.testCommandAvailable` for instance.
## How was this patch tested?
Manually tested.
Closes#25272 from HyukjinKwon/SPARK-28441.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Since for AQP the cost for joins can change between multiple runs, there is no reason that we have an idempotence enforcement on this optimizer batch. We thus make it `FixedPoint(1)` instead of `Once`.
## How was this patch tested?
Existing UTs.
Closes#25266 from yeshengm/SPARK-28530.
Lead-authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In the Catalyst optimizer, the batch subquery actually calls the optimizer recursively. Therefore it makes no sense to enforce idempotence on it and we change this batch to `FixedPoint(1)`.
## How was this patch tested?
Existing UTs.
Closes#25267 from yeshengm/SPARK-28532.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In SPARK-15370, We checked the expression at the root of the correlated subquery, in order to fix count bug. If a `PythonUDF` in in the checking path, evaluating it causes the failure as we can't statically evaluate `PythonUDF`. The Python UDF test added at SPARK-28277 shows this issue.
If we can statically evaluate the expression, we intercept NULL values coming from the outer join and replace them with the value that the subquery's expression like before, if it is not, we replace them with the `PythonUDF` expression, with statically evaluated parameters.
After this, the last query in `udf-except.sql` which throws `java.lang.UnsupportedOperationException` can be run:
```
SELECT t1.k
FROM t1
WHERE t1.v <= (SELECT udf(max(udf(t2.v)))
FROM t2
WHERE udf(t2.k) = udf(t1.k))
MINUS
SELECT t1.k
FROM t1
WHERE udf(t1.v) >= (SELECT min(udf(t2.v))
FROM t2
WHERE t2.k = t1.k)
-- !query 2 schema
struct<k:string>
-- !query 2 output
two
```
Note that this issue is also for other non-foldable expressions, like rand. As like PythonUDF, we can't call `eval` on this kind of expressions in optimization. The evaluation needs to defer to query runtime.
## How was this patch tested?
Added tests.
Closes#25204 from viirya/SPARK-28441.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
How to reproduce this issue:
```shell
build/sbt clean package -Phive -Phive-thriftserver -Phadoop-3.2
export SPARK_PREPEND_CLASSES=true
sbin/start-thriftserver.sh
[rootspark-3267648 spark]# bin/beeline -u jdbc:hive2://localhost:10000/default -e "select cast(1 as decimal(38, 18));"
Connecting to jdbc:hive2://localhost:10000/default
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 2.3.5)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: java.lang.ClassCastException: java.math.BigDecimal incompatible with org.apache.hadoop.hive.common.type.HiveDecimal (state=,code=0)
Closing: 0: jdbc:hive2://localhost:10000/default
```
This pr fix this issue.
## How was this patch tested?
unit tests
Closes#25217 from wangyum/SPARK-28463.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In adaptive query processing (AQE), query plans are optimized on the fly during execution. However, a few `Once` rules can be problematic for such optimization since they can either generate wrong plan/unnecessary intermediate plan nodes.
This PR enforces idempotence for "Once" batches that are supposed to run once. This is a key enabler for AQE re-optimization and can improve robustness for existing optimizer rules.
Once batches that are currently not idempotent are marked in a blacklist. We will submit followup PRs to fix idempotence of these rules.
## How was this patch tested?
Existing UTs. Failing Once rules are temporarily blacklisted.
Closes#25249 from yeshengm/idempotence-checker.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This PR adds some tests converted from 'union.sql' to test UDFs
<details><summary>Diff comparing to 'union.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/union.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-union.sql.out
index b023df825d..84b5e10dbe 100644
--- a/sql/core/src/test/resources/sql-tests/results/union.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-union.sql.out
-19,10 +19,10 struct<>
-- !query 2
-SELECT *
-FROM (SELECT * FROM t1
+SELECT udf(c1) as c1, udf(c2) as c2
+FROM (SELECT udf(c1) as c1, udf(c2) as c2 FROM t1
UNION ALL
- SELECT * FROM t1)
+ SELECT udf(c1) as c1, udf(c2) as c2 FROM t1)
-- !query 2 schema
struct<c1:int,c2:string>
-- !query 2 output
-33,12 +33,12 struct<c1:int,c2:string>
-- !query 3
-SELECT *
-FROM (SELECT * FROM t1
+SELECT udf(c1) as c1, udf(c2) as c2
+FROM (SELECT udf(c1) as c1, udf(c2) as c2 FROM t1
UNION ALL
- SELECT * FROM t2
+ SELECT udf(c1) as c1, udf(c2) as c2 FROM t2
UNION ALL
- SELECT * FROM t2)
+ SELECT udf(c1) as c1, udf(c2) as c2 FROM t2)
-- !query 3 schema
struct<c1:decimal(11,1),c2:string>
-- !query 3 output
-51,11 +51,11 struct<c1:decimal(11,1),c2:string>
-- !query 4
-SELECT a
-FROM (SELECT 0 a, 0 b
+SELECT udf(udf(a)) as a
+FROM (SELECT udf(0) a, udf(0) b
UNION ALL
- SELECT SUM(1) a, CAST(0 AS BIGINT) b
- UNION ALL SELECT 0 a, 0 b) T
+ SELECT udf(SUM(1)) a, udf(CAST(0 AS BIGINT)) b
+ UNION ALL SELECT udf(0) a, udf(0) b) T
-- !query 4 schema
struct<a:bigint>
-- !query 4 output
-89,13 +89,13 struct<>
-- !query 8
-SELECT 1 AS x,
- col
-FROM (SELECT col AS col
- FROM (SELECT p1.col AS col
+SELECT udf(1) AS x,
+ udf(col) as col
+FROM (SELECT udf(col) AS col
+ FROM (SELECT udf(p1.col) AS col
FROM p1 CROSS JOIN p2
UNION ALL
- SELECT col
+ SELECT udf(col)
FROM p3) T1) T2
-- !query 8 schema
struct<x:int,col:int>
-105,9 +105,9 struct<x:int,col:int>
-- !query 9
-SELECT map(1, 2), 'str'
+SELECT map(1, 2), udf('str') as str
UNION ALL
-SELECT map(1, 2, 3, NULL), 1
+SELECT map(1, 2, 3, NULL), udf(1)
-- !query 9 schema
struct<map(1, 2):map<int,int>,str:string>
-- !query 9 output
-116,9 +116,9 struct<map(1, 2):map<int,int>,str:string>
-- !query 10
-SELECT array(1, 2), 'str'
+SELECT array(1, 2), udf('str') as str
UNION ALL
-SELECT array(1, 2, 3, NULL), 1
+SELECT array(1, 2, 3, NULL), udf(1)
-- !query 10 schema
struct<array(1, 2):array<int>,str:string>
-- !query 10 output
```
</p>
</details>
## How was this patch tested?
Tested as guided in SPARK-27921.
Closes#25202 from yiheng/fix_28289.
Authored-by: Yiheng Wang <yihengw@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR fixes the log messages like `attempt 0stage 9.0` by adding a comma followed by space. These are all instances in `DataWritingSparkTask` which was introduced at 6d16b9885d. This should be fixed in `branch-2.4`, too.
```
19/07/25 18:35:01 INFO DataWritingSparkTask: Commit authorized for partition 65 (task 153, attempt 0stage 9.0)
19/07/25 18:35:01 INFO DataWritingSparkTask: Committed partition 65 (task 153, attempt 0stage 9.0)
```
## How was this patch tested?
This only changes log messages. Pass the Jenkins with the existing tests.
Closes#25257 from dongjoon-hyun/DataWritingSparkTask.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Support multiple catalogs in the following InsertTable use cases:
- INSERT INTO [TABLE] catalog.db.tbl
- INSERT OVERWRITE TABLE catalog.db.tbl
Support matrix:
Overwrite|Partitioned Table|Partition Clause |Partition Overwrite Mode|Action
---------|-----------------|-----------------|------------------------|-----
false|*|*|*|AppendData
true|no|(empty)|*|OverwriteByExpression(true)
true|yes|p1,p2 or p1 or p2 or (empty)|STATIC|OverwriteByExpression(true)
true|yes|p2,p2 or p1 or p2 or (empty)|DYNAMIC|OverwritePartitionsDynamic
true|yes|p1=23,p2=3|*|OverwriteByExpression(p1=23 and p2=3)
true|yes|p1=23,p2 or p1=23|STATIC|OverwriteByExpression(p1=23)
true|yes|p1=23,p2 or p1=23|DYNAMIC|OverwritePartitionsDynamic
Notes:
- Assume the partitioned table has 2 partitions: p1 and p2.
- `STATIC` is the default Partition Overwrite Mode for data source tables.
- DSv2 tables currently do not support `IfPartitionNotExists`.
## How was this patch tested?
New tests.
All existing catalyst and sql/core tests.
Closes#24832 from jzhuge/SPARK-27845-pr.
Lead-authored-by: Ryan Blue <blue@apache.org>
Co-authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
This PR adds some tests converted from window.sql to test UDFs. Please see the contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'xxx.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/window.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out
index 367dc4f513..9354d5e311 100644
--- a/sql/core/src/test/resources/sql-tests/results/window.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-window.sql.out
-21,10 +21,10 struct<>
-- !query 1
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val ROWS CURRENT ROW) FROM testData
-ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY cate ORDER BY udf(val) ROWS CURRENT ROW) FROM testData
+ORDER BY cate, udf(val)
-- !query 1 schema
-struct<val:int,cate:string,count(val) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST ROWS BETWEEN CURRENT ROW AND CURRENT ROW):bigint>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,count(val) OVER (PARTITION BY cate ORDER BY CAST(udf(cast(val as string)) AS INT) ASC NULLS FIRST ROWS BETWEEN CURRENT ROW AND CURRENT ROW):bigint>
-- !query 1 output
NULL NULL 0
3 NULL 1
-38,10 +38,10 NULL a 0
-- !query 2
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
-ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, sum(val) OVER(PARTITION BY cate ORDER BY udf(val)
+ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING) FROM testData ORDER BY cate, udf(val)
-- !query 2 schema
-struct<val:int,cate:string,sum(val) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING):bigint>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,sum(val) OVER (PARTITION BY cate ORDER BY CAST(udf(cast(val as string)) AS INT) ASC NULLS FIRST ROWS BETWEEN UNBOUNDED PRECEDING AND 1 FOLLOWING):bigint>
-- !query 2 output
NULL NULL 3
3 NULL 3
-55,20 +55,20 NULL a 1
-- !query 3
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
-ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY cate ORDER BY udf(val_long)
+ROWS BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY udf(cate), val_long
-- !query 3 schema
struct<>
-- !query 3 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 41
+cannot resolve 'ROWS BETWEEN CURRENT ROW AND 2147483648L FOLLOWING' due to data type mismatch: The data type of the upper bound 'bigint' does not match the expected data type 'int'.; line 1 pos 46
-- !query 4
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val RANGE 1 PRECEDING) FROM testData
-ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate) ORDER BY val RANGE 1 PRECEDING) FROM testData
+ORDER BY cate, udf(val)
-- !query 4 schema
-struct<val:int,cate:string,count(val) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST RANGE BETWEEN 1 PRECEDING AND CURRENT ROW):bigint>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,count(val) OVER (PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY val ASC NULLS FIRST RANGE BETWEEN 1 PRECEDING AND CURRENT ROW):bigint>
-- !query 4 output
NULL NULL 0
3 NULL 1
-82,10 +82,10 NULL a 0
-- !query 5
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val
-RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY udf(cate) ORDER BY val
+RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY udf(cate), val
-- !query 5 schema
-struct<val:int,cate:string,sum(val) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING):bigint>
+struct<val:int,CAST(udf(cast(cate as string)) AS STRING):string,sum(val) OVER (PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY val ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING):bigint>
-- !query 5 output
NULL NULL NULL
3 NULL 3
-99,10 +99,10 NULL a NULL
-- !query 6
-SELECT val_long, cate, sum(val_long) OVER(PARTITION BY cate ORDER BY val_long
-RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY cate, val_long
+SELECT val_long, udf(cate), sum(val_long) OVER(PARTITION BY udf(cate) ORDER BY val_long
+RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING) FROM testData ORDER BY udf(cate), val_long
-- !query 6 schema
-struct<val_long:bigint,cate:string,sum(val_long) OVER (PARTITION BY cate ORDER BY val_long ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING):bigint>
+struct<val_long:bigint,CAST(udf(cast(cate as string)) AS STRING):string,sum(val_long) OVER (PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY val_long ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 2147483648 FOLLOWING):bigint>
-- !query 6 output
NULL NULL NULL
1 NULL 1
-116,10 +116,10 NULL b NULL
-- !query 7
-SELECT val_double, cate, sum(val_double) OVER(PARTITION BY cate ORDER BY val_double
-RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY cate, val_double
+SELECT val_double, udf(cate), sum(val_double) OVER(PARTITION BY udf(cate) ORDER BY val_double
+RANGE BETWEEN CURRENT ROW AND 2.5 FOLLOWING) FROM testData ORDER BY udf(cate), val_double
-- !query 7 schema
-struct<val_double:double,cate:string,sum(val_double) OVER (PARTITION BY cate ORDER BY val_double ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND CAST(2.5 AS DOUBLE) FOLLOWING):double>
+struct<val_double:double,CAST(udf(cast(cate as string)) AS STRING):string,sum(val_double) OVER (PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY val_double ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND CAST(2.5 AS DOUBLE) FOLLOWING):double>
-- !query 7 output
NULL NULL NULL
1.0 NULL 1.0
-133,10 +133,10 NULL NULL NULL
-- !query 8
-SELECT val_date, cate, max(val_date) OVER(PARTITION BY cate ORDER BY val_date
-RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) FROM testData ORDER BY cate, val_date
+SELECT val_date, udf(cate), max(val_date) OVER(PARTITION BY udf(cate) ORDER BY val_date
+RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING) FROM testData ORDER BY udf(cate), val_date
-- !query 8 schema
-struct<val_date:date,cate:string,max(val_date) OVER (PARTITION BY cate ORDER BY val_date ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING):date>
+struct<val_date:date,CAST(udf(cast(cate as string)) AS STRING):string,max(val_date) OVER (PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY val_date ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 2 FOLLOWING):date>
-- !query 8 output
NULL NULL NULL
2017-08-01 NULL 2017-08-01
-150,11 +150,11 NULL NULL NULL
-- !query 9
-SELECT val_timestamp, cate, avg(val_timestamp) OVER(PARTITION BY cate ORDER BY val_timestamp
+SELECT val_timestamp, udf(cate), avg(val_timestamp) OVER(PARTITION BY udf(cate) ORDER BY val_timestamp
RANGE BETWEEN CURRENT ROW AND interval 23 days 4 hours FOLLOWING) FROM testData
-ORDER BY cate, val_timestamp
+ORDER BY udf(cate), val_timestamp
-- !query 9 schema
-struct<val_timestamp:timestamp,cate:string,avg(CAST(val_timestamp AS DOUBLE)) OVER (PARTITION BY cate ORDER BY val_timestamp ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND interval 3 weeks 2 days 4 hours FOLLOWING):double>
+struct<val_timestamp:timestamp,CAST(udf(cast(cate as string)) AS STRING):string,avg(CAST(val_timestamp AS DOUBLE)) OVER (PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY val_timestamp ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND interval 3 weeks 2 days 4 hours FOLLOWING):double>
-- !query 9 output
NULL NULL NULL
2017-07-31 17:00:00 NULL 1.5015456E9
-168,10 +168,10 NULL NULL NULL
-- !query 10
-SELECT val, cate, sum(val) OVER(PARTITION BY cate ORDER BY val DESC
+SELECT val, udf(cate), sum(val) OVER(PARTITION BY cate ORDER BY val DESC
RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
-- !query 10 schema
-struct<val:int,cate:string,sum(val) OVER (PARTITION BY cate ORDER BY val DESC NULLS LAST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING):bigint>
+struct<val:int,CAST(udf(cast(cate as string)) AS STRING):string,sum(val) OVER (PARTITION BY cate ORDER BY val DESC NULLS LAST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING):bigint>
-- !query 10 output
NULL NULL NULL
3 NULL 3
-185,58 +185,58 NULL a NULL
-- !query 11
-SELECT val, cate, count(val) OVER(PARTITION BY cate
-ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate)
+ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING) FROM testData ORDER BY cate, udf(val)
-- !query 11 schema
struct<>
-- !query 11 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING' due to data type mismatch: Window frame upper bound '1' does not follow the lower bound 'unboundedfollowing$()'.; line 1 pos 33
+cannot resolve 'ROWS BETWEEN UNBOUNDED FOLLOWING AND 1 FOLLOWING' due to data type mismatch: Window frame upper bound '1' does not follow the lower bound 'unboundedfollowing$()'.; line 1 pos 38
-- !query 12
-SELECT val, cate, count(val) OVER(PARTITION BY cate
-RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate)
+RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, udf(val)
-- !query 12 schema
struct<>
-- !query 12 output
org.apache.spark.sql.AnalysisException
-cannot resolve '(PARTITION BY testdata.`cate` RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame cannot be used in an unordered window specification.; line 1 pos 33
+cannot resolve '(PARTITION BY CAST(udf(cast(cate as string)) AS STRING) RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame cannot be used in an unordered window specification.; line 1 pos 38
-- !query 13
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val, cate
-RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate) ORDER BY udf(val), cate
+RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, udf(val)
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.AnalysisException
-cannot resolve '(PARTITION BY testdata.`cate` ORDER BY testdata.`val` ASC NULLS FIRST, testdata.`cate` ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions: val#x ASC NULLS FIRST,cate#x ASC NULLS FIRST; line 1 pos 33
+cannot resolve '(PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY CAST(udf(cast(val as string)) AS INT) ASC NULLS FIRST, testdata.`cate` ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: A range window frame with value boundaries cannot be used in a window specification with multiple order by expressions: cast(udf(cast(val#x as string)) as int) ASC NULLS FIRST,cate#x ASC NULLS FIRST; line 1 pos 38
-- !query 14
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY current_timestamp
-RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate) ORDER BY current_timestamp
+RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING) FROM testData ORDER BY cate, udf(val)
-- !query 14 schema
struct<>
-- !query 14 output
org.apache.spark.sql.AnalysisException
-cannot resolve '(PARTITION BY testdata.`cate` ORDER BY current_timestamp() ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: The data type 'timestamp' used in the order specification does not match the data type 'int' which is used in the range frame.; line 1 pos 33
+cannot resolve '(PARTITION BY CAST(udf(cast(cate as string)) AS STRING) ORDER BY current_timestamp() ASC NULLS FIRST RANGE BETWEEN CURRENT ROW AND 1 FOLLOWING)' due to data type mismatch: The data type 'timestamp' used in the order specification does not match the data type 'int' which is used in the range frame.; line 1 pos 38
-- !query 15
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val
-RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate) ORDER BY val
+RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING) FROM testData ORDER BY udf(cate), val
-- !query 15 schema
struct<>
-- !query 15 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING' due to data type mismatch: The lower bound of a window frame must be less than or equal to the upper bound; line 1 pos 33
+cannot resolve 'RANGE BETWEEN 1 FOLLOWING AND 1 PRECEDING' due to data type mismatch: The lower bound of a window frame must be less than or equal to the upper bound; line 1 pos 38
-- !query 16
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val
-RANGE BETWEEN CURRENT ROW AND current_date PRECEDING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate) ORDER BY udf(val)
+RANGE BETWEEN CURRENT ROW AND current_date PRECEDING) FROM testData ORDER BY cate, val(val)
-- !query 16 schema
struct<>
-- !query 16 output
-245,48 +245,48 org.apache.spark.sql.catalyst.parser.ParseException
Frame bound value must be a literal.(line 2, pos 30)
== SQL ==
-SELECT val, cate, count(val) OVER(PARTITION BY cate ORDER BY val
-RANGE BETWEEN CURRENT ROW AND current_date PRECEDING) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, count(val) OVER(PARTITION BY udf(cate) ORDER BY udf(val)
+RANGE BETWEEN CURRENT ROW AND current_date PRECEDING) FROM testData ORDER BY cate, val(val)
------------------------------^^^
-- !query 17
-SELECT val, cate,
-max(val) OVER w AS max,
-min(val) OVER w AS min,
-min(val) OVER w AS min,
-count(val) OVER w AS count,
-sum(val) OVER w AS sum,
-avg(val) OVER w AS avg,
-stddev(val) OVER w AS stddev,
-first_value(val) OVER w AS first_value,
-first_value(val, true) OVER w AS first_value_ignore_null,
-first_value(val, false) OVER w AS first_value_contain_null,
-last_value(val) OVER w AS last_value,
-last_value(val, true) OVER w AS last_value_ignore_null,
-last_value(val, false) OVER w AS last_value_contain_null,
+SELECT udf(val), cate,
+max(udf(val)) OVER w AS max,
+min(udf(val)) OVER w AS min,
+min(udf(val)) OVER w AS min,
+count(udf(val)) OVER w AS count,
+sum(udf(val)) OVER w AS sum,
+avg(udf(val)) OVER w AS avg,
+stddev(udf(val)) OVER w AS stddev,
+first_value(udf(val)) OVER w AS first_value,
+first_value(udf(val), true) OVER w AS first_value_ignore_null,
+first_value(udf(val), false) OVER w AS first_value_contain_null,
+last_value(udf(val)) OVER w AS last_value,
+last_value(udf(val), true) OVER w AS last_value_ignore_null,
+last_value(udf(val), false) OVER w AS last_value_contain_null,
rank() OVER w AS rank,
dense_rank() OVER w AS dense_rank,
cume_dist() OVER w AS cume_dist,
percent_rank() OVER w AS percent_rank,
ntile(2) OVER w AS ntile,
row_number() OVER w AS row_number,
-var_pop(val) OVER w AS var_pop,
-var_samp(val) OVER w AS var_samp,
-approx_count_distinct(val) OVER w AS approx_count_distinct,
-covar_pop(val, val_long) OVER w AS covar_pop,
-corr(val, val_long) OVER w AS corr,
-stddev_samp(val) OVER w AS stddev_samp,
-stddev_pop(val) OVER w AS stddev_pop,
-collect_list(val) OVER w AS collect_list,
-collect_set(val) OVER w AS collect_set,
-skewness(val_double) OVER w AS skewness,
-kurtosis(val_double) OVER w AS kurtosis
+var_pop(udf(val)) OVER w AS var_pop,
+var_samp(udf(val)) OVER w AS var_samp,
+approx_count_distinct(udf(val)) OVER w AS approx_count_distinct,
+covar_pop(udf(val), udf(val_long)) OVER w AS covar_pop,
+corr(udf(val), udf(val_long)) OVER w AS corr,
+stddev_samp(udf(val)) OVER w AS stddev_samp,
+stddev_pop(udf(val)) OVER w AS stddev_pop,
+collect_list(udf(val)) OVER w AS collect_list,
+collect_set(udf(val)) OVER w AS collect_set,
+skewness(udf(val_double)) OVER w AS skewness,
+kurtosis(udf(val_double)) OVER w AS kurtosis
FROM testData
-WINDOW w AS (PARTITION BY cate ORDER BY val)
-ORDER BY cate, val
+WINDOW w AS (PARTITION BY udf(cate) ORDER BY udf(val))
+ORDER BY cate, udf(val)
-- !query 17 schema
-struct<val:int,cate:string,max:int,min:int,min:int,count:bigint,sum:bigint,avg:double,stddev:double,first_value:int,first_value_ignore_null:int,first_value_contain_null:int,last_value:int,last_value_ignore_null:int,last_value_contain_null:int,rank:int,dense_rank:int,cume_dist:double,percent_rank:double,ntile:int,row_number:int,var_pop:double,var_samp:double,approx_count_distinct:bigint,covar_pop:double,corr:double,stddev_samp:double,stddev_pop:double,collect_list:array<int>,collect_set:array<int>,skewness:double,kurtosis:double>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,max:int,min:int,min:int,count:bigint,sum:bigint,avg:double,stddev:double,first_value:int,first_value_ignore_null:int,first_value_contain_null:int,last_value:int,last_value_ignore_null:int,last_value_contain_null:int,rank:int,dense_rank:int,cume_dist:double,percent_rank:double,ntile:int,row_number:int,var_pop:double,var_samp:double,approx_count_distinct:bigint,covar_pop:double,corr:double,stddev_samp:double,stddev_pop:double,collect_list:array<int>,collect_set:array<int>,skewness:double,kurtosis:double>
-- !query 17 output
NULL NULL NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL NULL NULL 1 1 0.5 0.0 1 1 NULL NULL 0 NULL NULL
NULL NULL [] [] NULL NULL
3 NULL 3 3 3 1 3 3.0 NaN NULL 3 NULL 3 3 3 2 2 1.0 1.0 2 2 0.0 NaN 1 0.0 NaN
NaN 0.0 [3] [3] NaN NaN
-300,9 +300,9 NULL a NULL NULL NULL 0 NULL NULL NULL NULL NULL NULL NULL NULL NULL 1 1 0.25 0.
-- !query 18
-SELECT val, cate, avg(null) OVER(PARTITION BY cate ORDER BY val) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, avg(null) OVER(PARTITION BY cate ORDER BY val) FROM testData ORDER BY cate, val
-- !query 18 schema
-struct<val:int,cate:string,avg(CAST(NULL AS DOUBLE)) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):double>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,avg(CAST(NULL AS DOUBLE)) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):double>
-- !query 18 output
NULL NULL NULL
3 NULL NULL
-316,7 +316,7 NULL a NULL
-- !query 19
-SELECT val, cate, row_number() OVER(PARTITION BY cate) FROM testData ORDER BY cate, val
+SELECT udf(val), cate, row_number() OVER(PARTITION BY cate) FROM testData ORDER BY cate, udf(val)
-- !query 19 schema
struct<>
-- !query 19 output
-325,9 +325,9 Window function row_number() requires window to be ordered, please add ORDER BY
-- !query 20
-SELECT val, cate, sum(val) OVER(), avg(val) OVER() FROM testData ORDER BY cate, val
+SELECT udf(val), cate, sum(val) OVER(), avg(val) OVER() FROM testData ORDER BY cate, val
-- !query 20 schema
-struct<val:int,cate:string,sum(CAST(val AS BIGINT)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):bigint,avg(CAST(val AS BIGINT)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):double>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,sum(CAST(val AS BIGINT)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):bigint,avg(CAST(val AS BIGINT)) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING):double>
-- !query 20 output
NULL NULL 13 1.8571428571428572
3 NULL 13 1.8571428571428572
-341,7 +341,7 NULL a 13 1.8571428571428572
-- !query 21
-SELECT val, cate,
+SELECT udf(val), cate,
first_value(false) OVER w AS first_value,
first_value(true, true) OVER w AS first_value_ignore_null,
first_value(false, false) OVER w AS first_value_contain_null,
-352,7 +352,7 FROM testData
WINDOW w AS ()
ORDER BY cate, val
-- !query 21 schema
-struct<val:int,cate:string,first_value:boolean,first_value_ignore_null:boolean,first_value_contain_null:boolean,last_value:boolean,last_value_ignore_null:boolean,last_value_contain_null:boolean>
+struct<CAST(udf(cast(val as string)) AS INT):int,cate:string,first_value:boolean,first_value_ignore_null:boolean,first_value_contain_null:boolean,last_value:boolean,last_value_ignore_null:boolean,last_value_contain_null:boolean>
-- !query 21 output
NULL NULL false true false false true false
3 NULL false true false false true false
-366,12 +366,12 NULL a false true false false true false
-- !query 22
-SELECT cate, sum(val) OVER (w)
+SELECT udf(cate), sum(val) OVER (w)
FROM testData
WHERE val is not null
WINDOW w AS (PARTITION BY cate ORDER BY val)
-- !query 22 schema
-struct<cate:string,sum(CAST(val AS BIGINT)) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):bigint>
+struct<CAST(udf(cast(cate as string)) AS STRING):string,sum(CAST(val AS BIGINT)) OVER (PARTITION BY cate ORDER BY val ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):bigint>
-- !query 22 output
NULL 3
a 2
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
Closes#25195 from younggyuchun/master.
Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
In the current implementation. complex types like Array/Map/StructType are allowed to upcast as StringType.
This is not safe casting. We should disallow it.
## How was this patch tested?
Update the existing test case
Closes#25242 from gengliangwang/fixUpCastStringType.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
changed the test according to steps mentioned in SPARK-27921
<details>
<summary>difference comparing to select_having.sql</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/select_having.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_having.sql.out
index 02536eb..f731d11 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/select_having.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_having.sql.out
-91,54 +91,54 struct<>
-- !query 11
-SELECT b, c FROM test_having
- GROUP BY b, c HAVING count(*) = 1 ORDER BY b, c
+SELECT udf(b), udf(c) FROM test_having
+ GROUP BY b, c HAVING udf(count(*)) = 1 ORDER BY udf(b), udf(c)
-- !query 11 schema
-struct<b:int,c:string>
+struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(c as string)) AS STRING):string>
-- !query 11 output
1 XXXX
3 bbbb
-- !query 12
-SELECT b, c FROM test_having
- GROUP BY b, c HAVING b = 3 ORDER BY b, c
+SELECT udf(b), udf(c) FROM test_having
+ GROUP BY b, c HAVING udf(b) = 3 ORDER BY udf(b), udf(c)
-- !query 12 schema
-struct<b:int,c:string>
+struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(c as string)) AS STRING):string>
-- !query 12 output
3 BBBB
3 bbbb
-- !query 13
-SELECT c, max(a) FROM test_having
- GROUP BY c HAVING count(*) > 2 OR min(a) = max(a)
+SELECT udf(c), max(udf(a)) FROM test_having
+ GROUP BY c HAVING udf(count(*)) > 2 OR udf(min(a)) = udf(max(a))
ORDER BY c
-- !query 13 schema
-struct<c:string,max(a):int>
+struct<CAST(udf(cast(c as string)) AS STRING):string,max(CAST(udf(cast(a as string)) AS INT)):int>
-- !query 13 output
XXXX 0
bbbb 5
-- !query 14
-SELECT min(a), max(a) FROM test_having HAVING min(a) = max(a)
+SELECT udf(udf(min(udf(a)))), udf(udf(max(udf(a)))) FROM test_having HAVING udf(udf(min(udf(a)))) = udf(udf(max(udf(a))))
-- !query 14 schema
-struct<min(a):int,max(a):int>
+struct<CAST(udf(cast(cast(udf(cast(min(cast(udf(cast(a as string)) as int)) as string)) as int) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(max(cast(udf(cast(a as string)) as int)) as string)) as int) as string)) AS INT):int>
-- !query 14 output
-- !query 15
-SELECT min(a), max(a) FROM test_having HAVING min(a) < max(a)
+SELECT udf(min(udf(a))), udf(udf(max(a))) FROM test_having HAVING udf(min(a)) < udf(max(udf(a)))
-- !query 15 schema
-struct<min(a):int,max(a):int>
+struct<CAST(udf(cast(min(cast(udf(cast(a as string)) as int)) as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(max(a) as string)) as int) as string)) AS INT):int>
-- !query 15 output
0 9
-- !query 16
-SELECT a FROM test_having HAVING min(a) < max(a)
+SELECT udf(a) FROM test_having HAVING udf(min(a)) < udf(max(a))
-- !query 16 schema
struct<>
-- !query 16 output
-147,16 +147,16 grouping expressions sequence is empty, and 'default.test_having.`a`' is not an
-- !query 17
-SELECT 1 AS one FROM test_having HAVING a > 1
+SELECT 1 AS one FROM test_having HAVING udf(a) > 1
-- !query 17 schema
struct<>
-- !query 17 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`a`' given input columns: [one]; line 1 pos 40
+cannot resolve '`a`' given input columns: [one]; line 1 pos 44
-- !query 18
-SELECT 1 AS one FROM test_having HAVING 1 > 2
+SELECT 1 AS one FROM test_having HAVING udf(udf(1) > udf(2))
-- !query 18 schema
struct<one:int>
-- !query 18 output
-164,7 +164,7 struct<one:int>
-- !query 19
-SELECT 1 AS one FROM test_having HAVING 1 < 2
+SELECT 1 AS one FROM test_having HAVING udf(udf(1) < udf(2))
-- !query 19 schema
struct<one:int>
-- !query 19 output
-172,7 +172,7 struct<one:int>
-- !query 20
-SELECT 1 AS one FROM test_having WHERE 1/a = 1 HAVING 1 < 2
+SELECT 1 AS one FROM test_having WHERE 1/udf(a) = 1 HAVING 1 < 2
-- !query 20 schema
struct<one:int>
-- !query 20 output
```
</p>
</details>
## How was this patch tested?
by:
```bash
sudo SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-select_having.sql"
```
Closes#25161 from shivusondur/jira28390.
Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
https://github.com/apache/spark/pull/7355 add support casting between IntervalType and StringType for scala interface:
```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.catalyst.expressions._
Cast(Literal("interval 3 month 1 hours"), CalendarIntervalType).eval()
res0: Any = interval 3 months 1 hours
```
But SQL interface does not support it:
```sql
scala> spark.sql("SELECT CAST('interval 3 month 1 hour' AS interval)").show
org.apache.spark.sql.catalyst.parser.ParseException:
DataType interval is not supported.(line 1, pos 41)
== SQL ==
SELECT CAST('interval 3 month 1 hour' AS interval)
-----------------------------------------^^^
at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitPrimitiveDataType$1(AstBuilder.scala:1931)
at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:108)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:1909)
at org.apache.spark.sql.catalyst.parser.AstBuilder.visitPrimitiveDataType(AstBuilder.scala:52)
...
```
This PR add supports accepting the `interval` keyword in the schema string. So that SQL interface can support this feature.
## How was this patch tested?
unit tests
Closes#25189 from wangyum/SPARK-28435.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR proposes to use `InheritableThreadLocal` instead of `ThreadLocal` for current epoch in `EpochTracker`. Python UDF needs threads to write out to and read it from Python processes and when there are new threads, previously set epoch is lost.
After this PR, Python UDFs can be used at Structured Streaming with the continuous mode.
## How was this patch tested?
The test cases were written on the top of https://github.com/apache/spark/pull/24945.
Unit tests were added.
Manual tests.
Closes#24946 from HyukjinKwon/SPARK-27234.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR adds some tests converted from 'pgSQL/select_implicit.sql' to test UDFs
<details><summary>Diff comparing to 'pgSQL/select_implicit.sql'</summary>
<p>
```diff
... diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/select_implicit.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_implicit.sql.out
index 0675820..e6a5995 100755
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/select_implicit.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-select_implicit.sql.out
-91,9 +91,11 struct<>
-- !query 11
-SELECT c, count(*) FROM test_missing_target GROUP BY test_missing_target.c ORDER BY c
+SELECT udf(c), udf(count(*)) FROM test_missing_target GROUP BY
+test_missing_target.c
+ORDER BY udf(c)
-- !query 11 schema
-struct<c:string,count(1):bigint>
+struct<CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 11 output
ABAB 2
BBBB 2
-104,9 +106,10 cccc 2
-- !query 12
-SELECT count(*) FROM test_missing_target GROUP BY test_missing_target.c ORDER BY c
+SELECT udf(count(*)) FROM test_missing_target GROUP BY test_missing_target.c
+ORDER BY udf(c)
-- !query 12 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 12 output
2
2
-117,18 +120,18 struct<count(1):bigint>
-- !query 13
-SELECT count(*) FROM test_missing_target GROUP BY a ORDER BY b
+SELECT udf(count(*)) FROM test_missing_target GROUP BY a ORDER BY udf(b)
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`b`' given input columns: [count(1)]; line 1 pos 61
+cannot resolve '`b`' given input columns: [CAST(udf(cast(count(1) as string)) AS BIGINT)]; line 1 pos 70
-- !query 14
-SELECT count(*) FROM test_missing_target GROUP BY b ORDER BY b
+SELECT udf(count(*)) FROM test_missing_target GROUP BY b ORDER BY udf(b)
-- !query 14 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 14 output
1
2
-137,10 +140,10 struct<count(1):bigint>
-- !query 15
-SELECT test_missing_target.b, count(*)
- FROM test_missing_target GROUP BY b ORDER BY b
+SELECT udf(test_missing_target.b), udf(count(*))
+ FROM test_missing_target GROUP BY b ORDER BY udf(b)
-- !query 15 schema
-struct<b:int,count(1):bigint>
+struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 15 output
1 1
2 2
-149,9 +152,9 struct<b:int,count(1):bigint>
-- !query 16
-SELECT c FROM test_missing_target ORDER BY a
+SELECT udf(c) FROM test_missing_target ORDER BY udf(a)
-- !query 16 schema
-struct<c:string>
+struct<CAST(udf(cast(c as string)) AS STRING):string>
-- !query 16 output
XXXX
ABAB
-166,9 +169,9 CCCC
-- !query 17
-SELECT count(*) FROM test_missing_target GROUP BY b ORDER BY b desc
+SELECT udf(count(*)) FROM test_missing_target GROUP BY b ORDER BY udf(b) desc
-- !query 17 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 17 output
4
3
-177,17 +180,17 struct<count(1):bigint>
-- !query 18
-SELECT count(*) FROM test_missing_target ORDER BY 1 desc
+SELECT udf(count(*)) FROM test_missing_target ORDER BY udf(1) desc
-- !query 18 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 18 output
10
-- !query 19
-SELECT c, count(*) FROM test_missing_target GROUP BY 1 ORDER BY 1
+SELECT udf(c), udf(count(*)) FROM test_missing_target GROUP BY 1 ORDER BY 1
-- !query 19 schema
-struct<c:string,count(1):bigint>
+struct<CAST(udf(cast(c as string)) AS STRING):string,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 19 output
ABAB 2
BBBB 2
-198,18 +201,18 cccc 2
-- !query 20
-SELECT c, count(*) FROM test_missing_target GROUP BY 3
+SELECT udf(c), udf(count(*)) FROM test_missing_target GROUP BY 3
-- !query 20 schema
struct<>
-- !query 20 output
org.apache.spark.sql.AnalysisException
-GROUP BY position 3 is not in select list (valid range is [1, 2]); line 1 pos 53
+GROUP BY position 3 is not in select list (valid range is [1, 2]); line 1 pos 63
-- !query 21
-SELECT count(*) FROM test_missing_target x, test_missing_target y
- WHERE x.a = y.a
- GROUP BY b ORDER BY b
+SELECT udf(count(*)) FROM test_missing_target x, test_missing_target y
+ WHERE udf(x.a) = udf(y.a)
+ GROUP BY b ORDER BY udf(b)
-- !query 21 schema
struct<>
-- !query 21 output
-218,10 +221,10 Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 10
-- !query 22
-SELECT a, a FROM test_missing_target
- ORDER BY a
+SELECT udf(a), udf(a) FROM test_missing_target
+ ORDER BY udf(a)
-- !query 22 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
-- !query 22 output
0 0
1 1
-236,10 +239,10 struct<a:int,a:int>
-- !query 23
-SELECT a/2, a/2 FROM test_missing_target
- ORDER BY a/2
+SELECT udf(udf(a)/2), udf(udf(a)/2) FROM test_missing_target
+ ORDER BY udf(udf(a)/2)
-- !query 23 schema
-struct<(a div 2):int,(a div 2):int>
+struct<CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS INT):int,CAST(udf(cast((cast(udf(cast(a as string)) as int) div 2) as string)) AS INT):int>
-- !query 23 output
0 0
0 0
-254,10 +257,10 struct<(a div 2):int,(a div 2):int>
-- !query 24
-SELECT a/2, a/2 FROM test_missing_target
- GROUP BY a/2 ORDER BY a/2
+SELECT udf(a/2), udf(a/2) FROM test_missing_target
+ GROUP BY a/2 ORDER BY udf(a/2)
-- !query 24 schema
-struct<(a div 2):int,(a div 2):int>
+struct<CAST(udf(cast((a div 2) as string)) AS INT):int,CAST(udf(cast((a div 2) as string)) AS INT):int>
-- !query 24 output
0 0
1 1
-267,11 +270,11 struct<(a div 2):int,(a div 2):int>
-- !query 25
-SELECT x.b, count(*) FROM test_missing_target x, test_missing_target y
- WHERE x.a = y.a
- GROUP BY x.b ORDER BY x.b
+SELECT udf(x.b), udf(count(*)) FROM test_missing_target x, test_missing_target y
+ WHERE udf(x.a) = udf(y.a)
+ GROUP BY x.b ORDER BY udf(x.b)
-- !query 25 schema
-struct<b:int,count(1):bigint>
+struct<CAST(udf(cast(b as string)) AS INT):int,CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 25 output
1 1
2 2
-280,11 +283,11 struct<b:int,count(1):bigint>
-- !query 26
-SELECT count(*) FROM test_missing_target x, test_missing_target y
- WHERE x.a = y.a
- GROUP BY x.b ORDER BY x.b
+SELECT udf(count(*)) FROM test_missing_target x, test_missing_target y
+ WHERE udf(x.a) = udf(y.a)
+ GROUP BY x.b ORDER BY udf(x.b)
-- !query 26 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 26 output
1
2
-293,22 +296,22 struct<count(1):bigint>
-- !query 27
-SELECT a%2, count(b) FROM test_missing_target
+SELECT a%2, udf(count(udf(b))) FROM test_missing_target
GROUP BY test_missing_target.a%2
-ORDER BY test_missing_target.a%2
+ORDER BY udf(test_missing_target.a%2)
-- !query 27 schema
-struct<(a % 2):int,count(b):bigint>
+struct<(a % 2):int,CAST(udf(cast(count(cast(udf(cast(b as string)) as int)) as string)) AS BIGINT):bigint>
-- !query 27 output
0 5
1 5
-- !query 28
-SELECT count(c) FROM test_missing_target
+SELECT udf(count(c)) FROM test_missing_target
GROUP BY lower(test_missing_target.c)
-ORDER BY lower(test_missing_target.c)
+ORDER BY udf(lower(test_missing_target.c))
-- !query 28 schema
-struct<count(c):bigint>
+struct<CAST(udf(cast(count(c) as string)) AS BIGINT):bigint>
-- !query 28 output
2
3
-317,18 +320,18 struct<count(c):bigint>
-- !query 29
-SELECT count(a) FROM test_missing_target GROUP BY a ORDER BY b
+SELECT udf(count(udf(a))) FROM test_missing_target GROUP BY a ORDER BY udf(b)
-- !query 29 schema
struct<>
-- !query 29 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`b`' given input columns: [count(a)]; line 1 pos 61
+cannot resolve '`b`' given input columns: [CAST(udf(cast(count(cast(udf(cast(a as string)) as int)) as string)) AS BIGINT)]; line 1 pos 75
-- !query 30
-SELECT count(b) FROM test_missing_target GROUP BY b/2 ORDER BY b/2
+SELECT udf(count(b)) FROM test_missing_target GROUP BY b/2 ORDER BY udf(b/2)
-- !query 30 schema
-struct<count(b):bigint>
+struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 30 output
1
5
-336,10 +339,10 struct<count(b):bigint>
-- !query 31
-SELECT lower(test_missing_target.c), count(c)
- FROM test_missing_target GROUP BY lower(c) ORDER BY lower(c)
+SELECT udf(lower(test_missing_target.c)), udf(count(udf(c)))
+ FROM test_missing_target GROUP BY lower(c) ORDER BY udf(lower(c))
-- !query 31 schema
-struct<lower(c):string,count(c):bigint>
+struct<CAST(udf(cast(lower(c) as string)) AS STRING):string,CAST(udf(cast(count(cast(udf(cast(c as string)) as string)) as string)) AS BIGINT):bigint>
-- !query 31 output
abab 2
bbbb 3
-348,9 +351,9 xxxx 1
-- !query 32
-SELECT a FROM test_missing_target ORDER BY upper(d)
+SELECT udf(a) FROM test_missing_target ORDER BY udf(upper(udf(d)))
-- !query 32 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 32 output
0
1
-365,19 +368,19 struct<a:int>
-- !query 33
-SELECT count(b) FROM test_missing_target
- GROUP BY (b + 1) / 2 ORDER BY (b + 1) / 2 desc
+SELECT udf(count(b)) FROM test_missing_target
+ GROUP BY (b + 1) / 2 ORDER BY udf((b + 1) / 2) desc
-- !query 33 schema
-struct<count(b):bigint>
+struct<CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 33 output
7
3
-- !query 34
-SELECT count(x.a) FROM test_missing_target x, test_missing_target y
- WHERE x.a = y.a
- GROUP BY b/2 ORDER BY b/2
+SELECT udf(count(udf(x.a))) FROM test_missing_target x, test_missing_target y
+ WHERE udf(x.a) = udf(y.a)
+ GROUP BY b/2 ORDER BY udf(b/2)
-- !query 34 schema
struct<>
-- !query 34 output
-386,11 +389,12 Reference 'b' is ambiguous, could be: x.b, y.b.; line 3 pos 10
-- !query 35
-SELECT x.b/2, count(x.b) FROM test_missing_target x, test_missing_target y
- WHERE x.a = y.a
- GROUP BY x.b/2 ORDER BY x.b/2
+SELECT udf(x.b/2), udf(count(udf(x.b))) FROM test_missing_target x,
+test_missing_target y
+ WHERE udf(x.a) = udf(y.a)
+ GROUP BY x.b/2 ORDER BY udf(x.b/2)
-- !query 35 schema
-struct<(b div 2):int,count(b):bigint>
+struct<CAST(udf(cast((b div 2) as string)) AS INT):int,CAST(udf(cast(count(cast(udf(cast(b as string)) as int)) as string)) AS BIGINT):bigint>
-- !query 35 output
0 1
1 5
-398,14 +402,14 struct<(b div 2):int,count(b):bigint>
-- !query 36
-SELECT count(b) FROM test_missing_target x, test_missing_target y
- WHERE x.a = y.a
+SELECT udf(count(udf(b))) FROM test_missing_target x, test_missing_target y
+ WHERE udf(x.a) = udf(y.a)
GROUP BY x.b/2
-- !query 36 schema
struct<>
-- !query 36 output
org.apache.spark.sql.AnalysisException
-Reference 'b' is ambiguous, could be: x.b, y.b.; line 1 pos 13
+Reference 'b' is ambiguous, could be: x.b, y.b.; line 1 pos 21
-- !query 37
```
</p>
</details>
## How was this patch tested?
Tested as Guided in SPARK-27921
Closes#25233 from Udbhav30/master.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Change the format of the build command in the README to start with a `./` prefix
./build/mvn -DskipTests clean package
This increases stylistic consistency across the README- all the other commands have a `./` prefix. Having a visible `./` prefix also makes it clear to the user that the shell command requires the current working directory to be at the repository root.
## How was this patch tested?
README.md was reviewed both in raw markdown and in the Github rendered landing page for stylistic consistency.
Closes#25231 from Mister-Meeseeks/master.
Lead-authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com>
Co-authored-by: Mister-Meeseeks <douglas.colkitt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The bug fixed by https://github.com/apache/spark/pull/24886 is caused by Hive's `loadDynamicPartitions`. It's better to keep the fix surgical and put it right before we call `loadDynamicPartitions`.
This also makes the fix safer, instead of analyzing all the callers of `saveAsHiveFile` and proving that they are safe.
## How was this patch tested?
N/A
Closes#25234 from cloud-fan/minor.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
query plan was designed to be immutable, but sometimes we do allow it to carry mutable states, because of the complexity of the SQL system. One example is `TreeNodeTag`. It's a state of `TreeNode` and can be carried over during copy and transform. The adaptive execution framework relies on it to link the logical and physical plans.
This leads to a problem: when we get `QueryExecution#analyzed`, the plan can be changed unexpectedly because it's mutable. I hit a real issue in https://github.com/apache/spark/pull/25107 : I use `TreeNodeTag` to carry dataset id in logical plans. However, the analyzed plan ends up with many duplicated dataset id tags in different nodes. It turns out that, the optimizer transforms the logical plan and add the tag to more nodes.
For example, the logical plan is `SubqueryAlias(Filter(...))`, and I expect only the `SubqueryAlais` has the dataset id tag. However, the optimizer removes `SubqueryAlias` and carries over the dataset id tag to `Filter`. When I go back to the analyzed plan, both `SubqueryAlias` and `Filter` has the dataset id tag, which breaks my assumption.
Since now query plan is mutable, I think it's better to limit the life cycle of a query plan instance. We can clone the query plan between analyzer, optimizer and planner, so that the life cycle is limited in one stage.
## How was this patch tested?
new test
Closes#25111 from cloud-fan/clone.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This PR change `CalendarIntervalType`'s readable string representation from `calendarinterval` to `interval`.
## How was this patch tested?
Existing UT
Closes#25225 from wangyum/SPARK-28469.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Fix CSV datasource to throw `com.univocity.parsers.common.TextParsingException` with large size message, which will make log output consume large disk space.
This issue is troublesome when sometimes we need parse CSV with large size column.
This PR proposes to set CSV parser/writer settings by `setErrorContentLength(1000)` to limit the error message length.
## How was this patch tested?
Manually.
```
val s = "a" * 40 * 1000000
Seq(s).toDF.write.mode("overwrite").csv("/tmp/bogdan/es4196.csv")
spark.read .option("maxCharsPerColumn", 30000000) .csv("/tmp/bogdan/es4196.csv").count
```
**Before:**
The thrown message will include error content of about 30MB size (The column size exceed the max value 30MB, so the error content include the whole parsed content, so it is 30MB).
**After:**
The thrown message will include error content like "...aaa...aa" (the number of 'a' is 1024), i.e. limit the content size to be 1024.
Closes#25184 from WeichenXu123/limit_csv_exception_size.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
New function `make_date()` takes 3 columns `year`, `month` and `day`, and makes new column of the `DATE` type. If values in the input columns are `null` or out of valid ranges, the function returns `null`. Valid ranges are:
- `year` - `[1, 9999]`
- `month` - `[1, 12]`
- `day` - `[1, 31]`
Also constructed date must be valid otherwise `make_date` returns `null`.
The function is implemented similarly to `make_date` in PostgreSQL: https://www.postgresql.org/docs/11/functions-datetime.html to maintain feature parity with it.
Here is an example:
```sql
select make_date(2013, 7, 15);
2013-07-15
```
## How was this patch tested?
Added new tests to `DateExpressionsSuite`.
Closes#25210 from MaxGekk/make_date-timestamp.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR adds some tests converted from `group-by.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'group-by.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
index 3a5df254f2..0118c05b1d 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-by.sql.out
-13,26 +13,26 struct<>
-- !query 1
-SELECT a, COUNT(b) FROM testData
+SELECT udf(a), udf(COUNT(b)) FROM testData
-- !query 1 schema
struct<>
-- !query 1 output
org.apache.spark.sql.AnalysisException
-grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(count(testdata.`b`) AS `count(b)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.;
+grouping expressions sequence is empty, and 'testdata.`a`' is not an aggregate function. Wrap '(CAST(udf(cast(count(b) as string)) AS BIGINT) AS `CAST(udf(cast(count(b) as string)) AS BIGINT)`)' in windowing function(s) or wrap 'testdata.`a`' in first() (or first_value) if you don't care which value you get.;
-- !query 2
-SELECT COUNT(a), COUNT(b) FROM testData
+SELECT COUNT(udf(a)), udf(COUNT(b)) FROM testData
-- !query 2 schema
-struct<count(a):bigint,count(b):bigint>
+struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 2 output
7 7
-- !query 3
-SELECT a, COUNT(b) FROM testData GROUP BY a
+SELECT udf(a), COUNT(udf(b)) FROM testData GROUP BY a
-- !query 3 schema
-struct<a:int,count(b):bigint>
+struct<CAST(udf(cast(a as string)) AS INT):int,count(CAST(udf(cast(b as string)) AS INT)):bigint>
-- !query 3 output
1 2
2 2
-41,7 +41,7 NULL 1
-- !query 4
-SELECT a, COUNT(b) FROM testData GROUP BY b
+SELECT udf(a), udf(COUNT(udf(b))) FROM testData GROUP BY b
-- !query 4 schema
struct<>
-- !query 4 output
-50,9 +50,9 expression 'testdata.`a`' is neither present in the group by, nor is it an aggre
-- !query 5
-SELECT COUNT(a), COUNT(b) FROM testData GROUP BY a
+SELECT COUNT(udf(a)), COUNT(udf(b)) FROM testData GROUP BY udf(a)
-- !query 5 schema
-struct<count(a):bigint,count(b):bigint>
+struct<count(CAST(udf(cast(a as string)) AS INT)):bigint,count(CAST(udf(cast(b as string)) AS INT)):bigint>
-- !query 5 output
0 1
2 2
-61,15 +61,15 struct<count(a):bigint,count(b):bigint>
-- !query 6
-SELECT 'foo', COUNT(a) FROM testData GROUP BY 1
+SELECT 'foo', COUNT(udf(a)) FROM testData GROUP BY 1
-- !query 6 schema
-struct<foo:string,count(a):bigint>
+struct<foo:string,count(CAST(udf(cast(a as string)) AS INT)):bigint>
-- !query 6 output
foo 7
-- !query 7
-SELECT 'foo' FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo' FROM testData WHERE a = 0 GROUP BY udf(1)
-- !query 7 schema
struct<foo:string>
-- !query 7 output
-77,25 +77,25 struct<foo:string>
-- !query 8
-SELECT 'foo', APPROX_COUNT_DISTINCT(a) FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo', udf(APPROX_COUNT_DISTINCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1
-- !query 8 schema
-struct<foo:string,approx_count_distinct(a):bigint>
+struct<foo:string,CAST(udf(cast(approx_count_distinct(cast(udf(cast(a as string)) as int), 0.05, 0, 0) as string)) AS BIGINT):bigint>
-- !query 8 output
-- !query 9
-SELECT 'foo', MAX(STRUCT(a)) FROM testData WHERE a = 0 GROUP BY 1
+SELECT 'foo', MAX(STRUCT(udf(a))) FROM testData WHERE a = 0 GROUP BY 1
-- !query 9 schema
-struct<foo:string,max(named_struct(a, a)):struct<a:int>>
+struct<foo:string,max(named_struct(col1, CAST(udf(cast(a as string)) AS INT))):struct<col1:int>>
-- !query 9 output
-- !query 10
-SELECT a + b, COUNT(b) FROM testData GROUP BY a + b
+SELECT udf(a + b), udf(COUNT(b)) FROM testData GROUP BY a + b
-- !query 10 schema
-struct<(a + b):int,count(b):bigint>
+struct<CAST(udf(cast((a + b) as string)) AS INT):int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 10 output
2 1
3 2
-105,7 +105,7 NULL 1
-- !query 11
-SELECT a + 2, COUNT(b) FROM testData GROUP BY a + 1
+SELECT udf(a + 2), udf(COUNT(b)) FROM testData GROUP BY a + 1
-- !query 11 schema
struct<>
-- !query 11 output
-114,37 +114,35 expression 'testdata.`a`' is neither present in the group by, nor is it an aggre
-- !query 12
-SELECT a + 1 + 1, COUNT(b) FROM testData GROUP BY a + 1
+SELECT udf(a + 1 + 1), udf(COUNT(b)) FROM testData GROUP BY udf(a + 1)
-- !query 12 schema
-struct<((a + 1) + 1):int,count(b):bigint>
+struct<>
-- !query 12 output
-3 2
-4 2
-5 2
-NULL 1
+org.apache.spark.sql.AnalysisException
+expression 'testdata.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
-- !query 13
-SELECT SKEWNESS(a), KURTOSIS(a), MIN(a), MAX(a), AVG(a), VARIANCE(a), STDDEV(a), SUM(a), COUNT(a)
+SELECT SKEWNESS(udf(a)), udf(KURTOSIS(a)), udf(MIN(a)), MAX(udf(a)), udf(AVG(udf(a))), udf(VARIANCE(a)), STDDEV(udf(a)), udf(SUM(a)), udf(COUNT(a))
FROM testData
-- !query 13 schema
-struct<skewness(CAST(a AS DOUBLE)):double,kurtosis(CAST(a AS DOUBLE)):double,min(a):int,max(a):int,avg(a):double,var_samp(CAST(a AS DOUBLE)):double,stddev_samp(CAST(a AS DOUBLE)):double,sum(a):bigint,count(a):bigint>
+struct<skewness(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(kurtosis(cast(a as double)) as string)) AS DOUBLE):double,CAST(udf(cast(min(a) as string)) AS INT):int,max(CAST(udf(cast(a as string)) AS INT)):int,CAST(udf(cast(avg(cast(cast(udf(cast(a as string)) as int) as bigint)) as string)) AS DOUBLE):double,CAST(udf(cast(var_samp(cast(a as double)) as string)) AS DOUBLE):double,stddev_samp(CAST(CAST(udf(cast(a as string)) AS INT) AS DOUBLE)):double,CAST(udf(cast(sum(cast(a as bigint)) as string)) AS BIGINT):bigint,CAST(udf(cast(count(a) as string)) AS BIGINT):bigint>
-- !query 13 output
-0.2723801058145729 -1.5069204152249134 1 3 2.142857142857143 0.8095238095238094 0.8997354108424372 15 7
-- !query 14
-SELECT COUNT(DISTINCT b), COUNT(DISTINCT b, c) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY a
+SELECT COUNT(DISTINCT udf(b)), udf(COUNT(DISTINCT b, c)) FROM (SELECT 1 AS a, 2 AS b, 3 AS c) GROUP BY a
-- !query 14 schema
-struct<count(DISTINCT b):bigint,count(DISTINCT b, c):bigint>
+struct<count(DISTINCT CAST(udf(cast(b as string)) AS INT)):bigint,CAST(udf(cast(count(distinct b, c) as string)) AS BIGINT):bigint>
-- !query 14 output
1 1
-- !query 15
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k
+SELECT a AS k, COUNT(udf(b)) FROM testData GROUP BY k
-- !query 15 schema
-struct<k:int,count(b):bigint>
+struct<k:int,count(CAST(udf(cast(b as string)) AS INT)):bigint>
-- !query 15 output
1 2
2 2
-153,21 +151,21 NULL 1
-- !query 16
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k HAVING k > 1
+SELECT a AS k, udf(COUNT(b)) FROM testData GROUP BY k HAVING k > 1
-- !query 16 schema
-struct<k:int,count(b):bigint>
+struct<k:int,CAST(udf(cast(count(b) as string)) AS BIGINT):bigint>
-- !query 16 output
2 2
3 2
-- !query 17
-SELECT COUNT(b) AS k FROM testData GROUP BY k
+SELECT udf(COUNT(b)) AS k FROM testData GROUP BY k
-- !query 17 schema
struct<>
-- !query 17 output
org.apache.spark.sql.AnalysisException
-aggregate functions are not allowed in GROUP BY, but found count(testdata.`b`);
+aggregate functions are not allowed in GROUP BY, but found CAST(udf(cast(count(b) as string)) AS BIGINT);
-- !query 18
-180,7 +178,7 struct<>
-- !query 19
-SELECT k AS a, COUNT(v) FROM testDataHasSameNameWithAlias GROUP BY a
+SELECT k AS a, udf(COUNT(udf(v))) FROM testDataHasSameNameWithAlias GROUP BY a
-- !query 19 schema
struct<>
-- !query 19 output
-197,32 +195,32 spark.sql.groupByAliases false
-- !query 21
-SELECT a AS k, COUNT(b) FROM testData GROUP BY k
+SELECT a AS k, udf(COUNT(udf(b))) FROM testData GROUP BY k
-- !query 21 schema
struct<>
-- !query 21 output
org.apache.spark.sql.AnalysisException
-cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 47
+cannot resolve '`k`' given input columns: [testdata.a, testdata.b]; line 1 pos 57
-- !query 22
-SELECT a, COUNT(1) FROM testData WHERE false GROUP BY a
+SELECT a, COUNT(udf(1)) FROM testData WHERE false GROUP BY a
-- !query 22 schema
-struct<a:int,count(1):bigint>
+struct<a:int,count(CAST(udf(cast(1 as string)) AS INT)):bigint>
-- !query 22 output
-- !query 23
-SELECT COUNT(1) FROM testData WHERE false
+SELECT udf(COUNT(1)) FROM testData WHERE false
-- !query 23 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 23 output
0
-- !query 24
-SELECT 1 FROM (SELECT COUNT(1) FROM testData WHERE false) t
+SELECT 1 FROM (SELECT udf(COUNT(1)) FROM testData WHERE false) t
-- !query 24 schema
struct<1:int>
-- !query 24 output
-232,7 +230,7 struct<1:int>
-- !query 25
SELECT 1 from (
SELECT 1 AS z,
- MIN(a.x)
+ udf(MIN(a.x))
FROM (select 1 as x) a
WHERE false
) b
-244,32 +242,32 struct<1:int>
-- !query 26
-SELECT corr(DISTINCT x, y), corr(DISTINCT y, x), count(*)
+SELECT corr(DISTINCT x, y), udf(corr(DISTINCT y, x)), count(*)
FROM (VALUES (1, 1), (2, 2), (2, 2)) t(x, y)
-- !query 26 schema
-struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,corr(DISTINCT CAST(y AS DOUBLE), CAST(x AS DOUBLE)):double,count(1):bigint>
+struct<corr(DISTINCT CAST(x AS DOUBLE), CAST(y AS DOUBLE)):double,CAST(udf(cast(corr(distinct cast(y as double), cast(x as double)) as string)) AS DOUBLE):double,count(1):bigint>
-- !query 26 output
1.0 1.0 3
-- !query 27
-SELECT 1 FROM range(10) HAVING true
+SELECT udf(1) FROM range(10) HAVING true
-- !query 27 schema
-struct<1:int>
+struct<CAST(udf(cast(1 as string)) AS INT):int>
-- !query 27 output
1
-- !query 28
-SELECT 1 FROM range(10) HAVING MAX(id) > 0
+SELECT udf(udf(1)) FROM range(10) HAVING MAX(id) > 0
-- !query 28 schema
-struct<1:int>
+struct<CAST(udf(cast(cast(udf(cast(1 as string)) as int) as string)) AS INT):int>
-- !query 28 output
1
-- !query 29
-SELECT id FROM range(10) HAVING id > 0
+SELECT udf(id) FROM range(10) HAVING id > 0
-- !query 29 schema
struct<>
-- !query 29 output
-291,33 +289,33 struct<>
-- !query 31
-SELECT every(v), some(v), any(v) FROM test_agg WHERE 1 = 0
+SELECT udf(every(v)), udf(some(v)), any(v) FROM test_agg WHERE 1 = 0
-- !query 31 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
-- !query 31 output
NULL NULL NULL
-- !query 32
-SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 4
+SELECT udf(every(udf(v))), some(v), any(v) FROM test_agg WHERE k = 4
-- !query 32 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<CAST(udf(cast(every(cast(udf(cast(v as string)) as boolean)) as string)) AS BOOLEAN):boolean,some(v):boolean,any(v):boolean>
-- !query 32 output
NULL NULL NULL
-- !query 33
-SELECT every(v), some(v), any(v) FROM test_agg WHERE k = 5
+SELECT every(v), udf(some(v)), any(v) FROM test_agg WHERE k = 5
-- !query 33 schema
-struct<every(v):boolean,some(v):boolean,any(v):boolean>
+struct<every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
-- !query 33 output
false true true
-- !query 34
-SELECT k, every(v), some(v), any(v) FROM test_agg GROUP BY k
+SELECT k, every(v), udf(some(v)), any(v) FROM test_agg GROUP BY k
-- !query 34 schema
-struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean>
+struct<k:int,every(v):boolean,CAST(udf(cast(some(v) as string)) AS BOOLEAN):boolean,any(v):boolean>
-- !query 34 output
1 false true true
2 true true true
-327,9 +325,9 struct<k:int,every(v):boolean,some(v):boolean,any(v):boolean>
-- !query 35
-SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) = false
+SELECT udf(k), every(v) FROM test_agg GROUP BY k HAVING every(v) = false
-- !query 35 schema
-struct<k:int,every(v):boolean>
+struct<CAST(udf(cast(k as string)) AS INT):int,every(v):boolean>
-- !query 35 output
1 false
3 false
-337,16 +335,16 struct<k:int,every(v):boolean>
-- !query 36
-SELECT k, every(v) FROM test_agg GROUP BY k HAVING every(v) IS NULL
+SELECT k, udf(every(v)) FROM test_agg GROUP BY k HAVING every(v) IS NULL
-- !query 36 schema
-struct<k:int,every(v):boolean>
+struct<k:int,CAST(udf(cast(every(v) as string)) AS BOOLEAN):boolean>
-- !query 36 output
4 NULL
-- !query 37
SELECT k,
- Every(v) AS every
+ udf(Every(v)) AS every
FROM test_agg
WHERE k = 2
AND v IN (SELECT Any(v)
-360,7 +358,7 struct<k:int,every:boolean>
-- !query 38
-SELECT k,
+SELECT udf(udf(k)),
Every(v) AS every
FROM test_agg
WHERE k = 2
-369,45 +367,45 WHERE k = 2
WHERE k = 1)
GROUP BY k
-- !query 38 schema
-struct<k:int,every:boolean>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,every:boolean>
-- !query 38 output
-- !query 39
-SELECT every(1)
+SELECT every(udf(1))
-- !query 39 schema
struct<>
-- !query 39 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'every(1)' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7
+cannot resolve 'every(CAST(udf(cast(1 as string)) AS INT))' due to data type mismatch: Input to function 'every' should have been boolean, but it's [int].; line 1 pos 7
-- !query 40
-SELECT some(1S)
+SELECT some(udf(1S))
-- !query 40 schema
struct<>
-- !query 40 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'some(1S)' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7
+cannot resolve 'some(CAST(udf(cast(1 as string)) AS SMALLINT))' due to data type mismatch: Input to function 'some' should have been boolean, but it's [smallint].; line 1 pos 7
-- !query 41
-SELECT any(1L)
+SELECT any(udf(1L))
-- !query 41 schema
struct<>
-- !query 41 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'any(1L)' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7
+cannot resolve 'any(CAST(udf(cast(1 as string)) AS BIGINT))' due to data type mismatch: Input to function 'any' should have been boolean, but it's [bigint].; line 1 pos 7
-- !query 42
-SELECT every("true")
+SELECT udf(every("true"))
-- !query 42 schema
struct<>
-- !query 42 output
org.apache.spark.sql.AnalysisException
-cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7
+cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 11
-- !query 43
-428,9 +426,9 struct<k:int,v:boolean,every(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST
-- !query 44
-SELECT k, v, some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
+SELECT k, udf(udf(v)), some(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
-- !query 44 schema
-struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
+struct<k:int,CAST(udf(cast(cast(udf(cast(v as string)) as boolean) as string)) AS BOOLEAN):boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
-- !query 44 output
1 false false
1 true true
-445,9 +443,9 struct<k:int,v:boolean,some(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST R
-- !query 45
-SELECT k, v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
+SELECT udf(udf(k)), v, any(v) OVER (PARTITION BY k ORDER BY v) FROM test_agg
-- !query 45 schema
-struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
+struct<CAST(udf(cast(cast(udf(cast(k as string)) as int) as string)) AS INT):int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW):boolean>
-- !query 45 output
1 false false
1 true true
-462,17 +460,17 struct<k:int,v:boolean,any(v) OVER (PARTITION BY k ORDER BY v ASC NULLS FIRST RA
-- !query 46
-SELECT count(*) FROM test_agg HAVING count(*) > 1L
+SELECT udf(count(*)) FROM test_agg HAVING count(*) > 1L
-- !query 46 schema
-struct<count(1):bigint>
+struct<CAST(udf(cast(count(1) as string)) AS BIGINT):bigint>
-- !query 46 output
10
-- !query 47
-SELECT k, max(v) FROM test_agg GROUP BY k HAVING max(v) = true
+SELECT k, udf(max(v)) FROM test_agg GROUP BY k HAVING max(v) = true
-- !query 47 schema
-struct<k:int,max(v):boolean>
+struct<k:int,CAST(udf(cast(max(v) as string)) AS BOOLEAN):boolean>
-- !query 47 output
1 true
2 true
-480,7 +478,7 struct<k:int,max(v):boolean>
-- !query 48
-SELECT * FROM (SELECT COUNT(*) AS cnt FROM test_agg) WHERE cnt > 1L
+SELECT * FROM (SELECT udf(COUNT(*)) AS cnt FROM test_agg) WHERE cnt > 1L
-- !query 48 schema
struct<cnt:bigint>
-- !query 48 output
-488,7 +486,7 struct<cnt:bigint>
-- !query 49
-SELECT count(*) FROM test_agg WHERE count(*) > 1L
+SELECT udf(count(*)) FROM test_agg WHERE count(*) > 1L
-- !query 49 schema
struct<>
-- !query 49 output
-500,7 +498,7 Invalid expressions: [count(1)];
-- !query 50
-SELECT count(*) FROM test_agg WHERE count(*) + 1L > 1L
+SELECT udf(count(*)) FROM test_agg WHERE count(*) + 1L > 1L
-- !query 50 schema
struct<>
-- !query 50 output
-512,7 +510,7 Invalid expressions: [count(1)];
-- !query 51
-SELECT count(*) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1
+SELECT udf(count(*)) FROM test_agg WHERE k = 1 or k = 2 or count(*) + 1L > 1L or max(k) > 1
-- !query 51 schema
struct<>
-- !query 51 output
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
Verified pandas & pyarrow versions:
```$python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import pyarrow
>>> pyarrow.__version__
'0.14.0'
>>> pandas.__version__
'0.24.2'
```
From the sql output it seems that sql statements are evaluated correctly given that udf returns a string and may change results as Null will be returned as None and will be counted in returned values.
Closes#25098 from skonto/group-by.sql.
Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Because `Encoder` is not thread safe, the user cannot reuse an `Encoder` in multiple `Dataset`s. However, creating an `Encoder` for a complicated class is slow due to Scala Reflection. To eliminate the cost of Scala Reflection, right now I usually use the private API `ExpressionEncoder.copy` as follows:
```scala
object FooEncoder {
private lazy val _encoder: ExpressionEncoder[Foo] = ExpressionEncoder[Foo]()
implicit def encoder: ExpressionEncoder[Foo] = _encoder.copy()
}
```
This PR proposes a new method `makeCopy` in `Encoder` so that the above codes can be rewritten using public APIs.
```scala
object FooEncoder {
private lazy val _encoder: Encoder[Foo] = Encoders.product[Foo]()
implicit def encoder: Encoder[Foo] = _encoder.makeCopy
}
```
The method name is consistent with `TreeNode.makeCopy`.
## How was this patch tested?
Jenkins
Closes#25209 from zsxwing/encoder-copy.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Implements the `REPLACE TABLE` and `REPLACE TABLE AS SELECT` logical plans. `REPLACE TABLE` is now a valid operation in spark-sql provided that the tables being modified are managed by V2 catalogs.
This also introduces an atomic mix-in that table catalogs can choose to implement. Table catalogs can now implement `TransactionalTableCatalog`. The semantics of this API are that table creation and replacement can be "staged" and then "committed".
On the execution of `REPLACE TABLE AS SELECT`, `REPLACE TABLE`, and `CREATE TABLE AS SELECT`, if the catalog implements transactional operations, the physical plan will use said functionality. Otherwise, these operations fall back on non-atomic variants. For `REPLACE TABLE` in particular, the usage of non-atomic operations can unfortunately lead to inconsistent state.
## How was this patch tested?
Unit tests - multiple additions to `DataSourceV2SQLSuite`.
Closes#24798 from mccheah/spark-27724.
Authored-by: mcheah <mcheah@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
When a `ScalaUDF` returns a value which overflows, currently it returns null regardless of the value of the config `spark.sql.decimalOperations.nullOnOverflow`.
The PR makes it respect the above-mentioned config and behave accordingly.
## How was this patch tested?
added UT
Closes#25144 from mgaido91/SPARK-28369.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This pr proposes to add a prefix '*' to non-nullable attribute names in PlanTestBase.comparePlans failures. In the current master, nullability mismatches might generate the same error message for left/right logical plans like this;
```
// This failure message was extracted from #24765
- constraints should be inferred from aliased literals *** FAILED ***
== FAIL: Plans do not match ===
!'Join Inner, (two#0 = a#0) 'Join Inner, (two#0 = a#0)
:- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0))
: +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0]
+- Project [2 AS two#0] +- Project [2 AS two#0]
+- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145)
```
With this pr, this error message is changed to one below;
```
- constraints should be inferred from aliased literals *** FAILED ***
== FAIL: Plans do not match ===
!'Join Inner, (*two#0 = a#0) 'Join Inner, (*two#0 = *a#0)
:- Filter (isnotnull(a#0) AND (2 <=> a#0)) :- Filter (isnotnull(a#0) AND (2 <=> a#0))
: +- LocalRelation <empty>, [a#0, b#0, c#0] : +- LocalRelation <empty>, [a#0, b#0, c#0]
+- Project [2 AS two#0] +- Project [2 AS two#0]
+- LocalRelation <empty>, [a#0, b#0, c#0] +- LocalRelation <empty>, [a#0, b#0, c#0] (PlanTest.scala:145)
```
## How was this patch tested?
N/A
Closes#25213 from maropu/MarkForNullability.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This pr is to remove the unnecessary test in DataFrameSuite.
## How was this patch tested?
N/A
Closes#25216 from maropu/SPARK-28189-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The optimize rule `PushDownPredicate` has been combined into `PushDownPredicates`, update the comment that references the old rule.
## How was this patch tested?
N/A
Closes#25207 from jiangxb1987/comment.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR adds some tests converted from `inline-table.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'inline-table.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/inline-table.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-inline-table.sql.out
index 4e80f0bda5..2cf24e50c8 100644
--- a/sql/core/src/test/resources/sql-tests/results/inline-table.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-inline-table.sql.out
-3,33 +3,33
-- !query 0
-select * from values ("one", 1)
+select udf(col1), udf(col2) from values ("one", 1)
-- !query 0 schema
-struct<col1:string,col2:int>
+struct<CAST(udf(cast(col1 as string)) AS STRING):string,CAST(udf(cast(col2 as string)) AS INT):int>
-- !query 0 output
one 1
-- !query 1
-select * from values ("one", 1) as data
+select udf(col1), udf(udf(col2)) from values ("one", 1) as data
-- !query 1 schema
-struct<col1:string,col2:int>
+struct<CAST(udf(cast(col1 as string)) AS STRING):string,CAST(udf(cast(cast(udf(cast(col2 as string)) as int) as string)) AS INT):int>
-- !query 1 output
one 1
-- !query 2
-select * from values ("one", 1) as data(a, b)
+select udf(a), b from values ("one", 1) as data(a, b)
-- !query 2 schema
-struct<a:string,b:int>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:int>
-- !query 2 output
one 1
-- !query 3
-select * from values 1, 2, 3 as data(a)
+select udf(a) from values 1, 2, 3 as data(a)
-- !query 3 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 3 output
1
2
-37,9 +37,9 struct<a:int>
-- !query 4
-select * from values ("one", 1), ("two", 2), ("three", null) as data(a, b)
+select udf(a), b from values ("one", 1), ("two", 2), ("three", null) as data(a, b)
-- !query 4 schema
-struct<a:string,b:int>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:int>
-- !query 4 output
one 1
three NULL
-47,107 +47,107 two 2
-- !query 5
-select * from values ("one", null), ("two", null) as data(a, b)
+select udf(a), b from values ("one", null), ("two", null) as data(a, b)
-- !query 5 schema
-struct<a:string,b:null>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:null>
-- !query 5 output
one NULL
two NULL
-- !query 6
-select * from values ("one", 1), ("two", 2L) as data(a, b)
+select udf(a), b from values ("one", 1), ("two", 2L) as data(a, b)
-- !query 6 schema
-struct<a:string,b:bigint>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:bigint>
-- !query 6 output
one 1
two 2
-- !query 7
-select * from values ("one", 1 + 0), ("two", 1 + 3L) as data(a, b)
+select udf(udf(a)), udf(b) from values ("one", 1 + 0), ("two", 1 + 3L) as data(a, b)
-- !query 7 schema
-struct<a:string,b:bigint>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as string) as string)) AS STRING):string,CAST(udf(cast(b as string)) AS BIGINT):bigint>
-- !query 7 output
one 1
two 4
-- !query 8
-select * from values ("one", array(0, 1)), ("two", array(2, 3)) as data(a, b)
+select udf(a), b from values ("one", array(0, 1)), ("two", array(2, 3)) as data(a, b)
-- !query 8 schema
-struct<a:string,b:array<int>>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:array<int>>
-- !query 8 output
one [0,1]
two [2,3]
-- !query 9
-select * from values ("one", 2.0), ("two", 3.0D) as data(a, b)
+select udf(a), b from values ("one", 2.0), ("two", 3.0D) as data(a, b)
-- !query 9 schema
-struct<a:string,b:double>
+struct<CAST(udf(cast(a as string)) AS STRING):string,b:double>
-- !query 9 output
one 2.0
two 3.0
-- !query 10
-select * from values ("one", rand(5)), ("two", 3.0D) as data(a, b)
+select udf(a), b from values ("one", rand(5)), ("two", 3.0D) as data(a, b)
-- !query 10 schema
struct<>
-- !query 10 output
org.apache.spark.sql.AnalysisException
-cannot evaluate expression rand(5) in inline table definition; line 1 pos 29
+cannot evaluate expression rand(5) in inline table definition; line 1 pos 37
-- !query 11
-select * from values ("one", 2.0), ("two") as data(a, b)
+select udf(a), udf(b) from values ("one", 2.0), ("two") as data(a, b)
-- !query 11 schema
struct<>
-- !query 11 output
org.apache.spark.sql.AnalysisException
-expected 2 columns but found 1 columns in row 1; line 1 pos 14
+expected 2 columns but found 1 columns in row 1; line 1 pos 27
-- !query 12
-select * from values ("one", array(0, 1)), ("two", struct(1, 2)) as data(a, b)
+select udf(a), udf(b) from values ("one", array(0, 1)), ("two", struct(1, 2)) as data(a, b)
-- !query 12 schema
struct<>
-- !query 12 output
org.apache.spark.sql.AnalysisException
-incompatible types found in column b for inline table; line 1 pos 14
+incompatible types found in column b for inline table; line 1 pos 27
-- !query 13
-select * from values ("one"), ("two") as data(a, b)
+select udf(a), udf(b) from values ("one"), ("two") as data(a, b)
-- !query 13 schema
struct<>
-- !query 13 output
org.apache.spark.sql.AnalysisException
-expected 2 columns but found 1 columns in row 0; line 1 pos 14
+expected 2 columns but found 1 columns in row 0; line 1 pos 27
-- !query 14
-select * from values ("one", random_not_exist_func(1)), ("two", 2) as data(a, b)
+select udf(a), udf(b) from values ("one", random_not_exist_func(1)), ("two", 2) as data(a, b)
-- !query 14 schema
struct<>
-- !query 14 output
org.apache.spark.sql.AnalysisException
-Undefined function: 'random_not_exist_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 29
+Undefined function: 'random_not_exist_func'. This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 42
-- !query 15
-select * from values ("one", count(1)), ("two", 2) as data(a, b)
+select udf(a), udf(b) from values ("one", count(1)), ("two", 2) as data(a, b)
-- !query 15 schema
struct<>
-- !query 15 output
org.apache.spark.sql.AnalysisException
-cannot evaluate expression count(1) in inline table definition; line 1 pos 29
+cannot evaluate expression count(1) in inline table definition; line 1 pos 42
-- !query 16
-select * from values (timestamp('1991-12-06 00:00:00.0'), array(timestamp('1991-12-06 01:00:00.0'), timestamp('1991-12-06 12:00:00.0'))) as data(a, b)
+select udf(a), b from values (timestamp('1991-12-06 00:00:00.0'), array(timestamp('1991-12-06 01:00:00.0'), timestamp('1991-12-06 12:00:00.0'))) as data(a, b)
-- !query 16 schema
-struct<a:timestamp,b:array<timestamp>>
+struct<CAST(udf(cast(a as string)) AS TIMESTAMP):timestamp,b:array<timestamp>>
-- !query 16 output
1991-12-06 00:00:00 [1991-12-06 01:00:00.0,1991-12-06 12:00:00.0]
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
Closes#25124 from imback82/inline-table-sql.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR adds some tests converted from group-analytics.sql to test UDFs. Please see contribution guide of this umbrella ticket - SPARK-27921.
<details><summary>Diff comparing to 'group-analytics.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
index 31e9e08e2c..3439a05727 100644
--- a/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-group-analytics.sql.out
-13,9 +13,9 struct<>
-- !query 1
-SELECT a + b, b, udf(SUM(a - b)) FROM testData GROUP BY a + b, b WITH CUBE
+SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH CUBE
-- !query 1 schema
-struct<(a + b):int,b:int,CAST(udf(cast(sum(cast((a - b) as bigint)) as string)) AS BIGINT):bigint>
+struct<(a + b):int,b:int,sum((a - b)):bigint>
-- !query 1 output
2 1 0
2 NULL 0
-33,9 +33,9 NULL NULL 3
-- !query 2
-SELECT a, udf(b), SUM(b) FROM testData GROUP BY a, b WITH CUBE
+SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH CUBE
-- !query 2 schema
-struct<a:int,CAST(udf(cast(b as string)) AS INT):int,sum(b):bigint>
+struct<a:int,b:int,sum(b):bigint>
-- !query 2 output
1 1 1
1 2 2
-52,9 +52,9 NULL NULL 9
-- !query 3
-SELECT udf(a + b), b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP
+SELECT a + b, b, SUM(a - b) FROM testData GROUP BY a + b, b WITH ROLLUP
-- !query 3 schema
-struct<CAST(udf(cast((a + b) as string)) AS INT):int,b:int,sum((a - b)):bigint>
+struct<(a + b):int,b:int,sum((a - b)):bigint>
-- !query 3 output
2 1 0
2 NULL 0
-70,9 +70,9 NULL NULL 3
-- !query 4
-SELECT a, b, udf(SUM(b)) FROM testData GROUP BY a, b WITH ROLLUP
+SELECT a, b, SUM(b) FROM testData GROUP BY a, b WITH ROLLUP
-- !query 4 schema
-struct<a:int,b:int,CAST(udf(cast(sum(cast(b as bigint)) as string)) AS BIGINT):bigint>
+struct<a:int,b:int,sum(b):bigint>
-- !query 4 output
1 1 1
1 2 2
-97,7 +97,7 struct<>
-- !query 6
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY udf(course), year
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY ROLLUP(course, year) ORDER BY course, year
-- !query 6 schema
struct<course:string,year:int,sum(earnings):bigint>
-- !query 6 output
-111,7 +111,7 dotNET 2013 48000
-- !query 7
-SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, udf(year)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY CUBE(course, year) ORDER BY course, year
-- !query 7 schema
struct<course:string,year:int,sum(earnings):bigint>
-- !query 7 output
-127,9 +127,9 dotNET 2013 48000
-- !query 8
-SELECT course, udf(year), SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course, year)
-- !query 8 schema
-struct<course:string,CAST(udf(cast(year as string)) AS INT):int,sum(earnings):bigint>
+struct<course:string,year:int,sum(earnings):bigint>
-- !query 8 output
Java NULL 50000
NULL 2012 35000
-138,26 +138,26 dotNET NULL 63000
-- !query 9
-SELECT course, year, udf(SUM(earnings)) FROM courseSales GROUP BY course, year GROUPING SETS(course)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(course)
-- !query 9 schema
-struct<course:string,year:int,CAST(udf(cast(sum(cast(earnings as bigint)) as string)) AS BIGINT):bigint>
+struct<course:string,year:int,sum(earnings):bigint>
-- !query 9 output
Java NULL 50000
dotNET NULL 63000
-- !query 10
-SELECT udf(course), year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year)
+SELECT course, year, SUM(earnings) FROM courseSales GROUP BY course, year GROUPING SETS(year)
-- !query 10 schema
-struct<CAST(udf(cast(course as string)) AS STRING):string,year:int,sum(earnings):bigint>
+struct<course:string,year:int,sum(earnings):bigint>
-- !query 10 output
NULL 2012 35000
NULL 2013 78000
-- !query 11
-SELECT course, udf(SUM(earnings)) AS sum FROM courseSales
-GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, udf(sum)
+SELECT course, SUM(earnings) AS sum FROM courseSales
+GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum
-- !query 11 schema
struct<course:string,sum:bigint>
-- !query 11 output
-173,7 +173,7 dotNET 63000
-- !query 12
SELECT course, SUM(earnings) AS sum, GROUPING_ID(course, earnings) FROM courseSales
-GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY udf(course), sum
+GROUP BY course, earnings GROUPING SETS((), (course), (course, earnings)) ORDER BY course, sum
-- !query 12 schema
struct<course:string,sum:bigint,grouping_id(course, earnings):int>
-- !query 12 output
-188,10 +188,10 dotNET 63000 1
-- !query 13
-SELECT udf(course), udf(year), GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales
+SELECT course, year, GROUPING(course), GROUPING(year), GROUPING_ID(course, year) FROM courseSales
GROUP BY CUBE(course, year)
-- !query 13 schema
-struct<CAST(udf(cast(course as string)) AS STRING):string,CAST(udf(cast(year as string)) AS INT):int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int>
+struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint,grouping_id(course, year):int>
-- !query 13 output
Java 2012 0 0 0
Java 2013 0 0 0
-205,7 +205,7 dotNET NULL 0 1 1
-- !query 14
-SELECT course, udf(year), GROUPING(course) FROM courseSales GROUP BY course, year
+SELECT course, year, GROUPING(course) FROM courseSales GROUP BY course, year
-- !query 14 schema
struct<>
-- !query 14 output
-214,7 +214,7 grouping() can only be used with GroupingSets/Cube/Rollup;
-- !query 15
-SELECT course, udf(year), GROUPING_ID(course, year) FROM courseSales GROUP BY course, year
+SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY course, year
-- !query 15 schema
struct<>
-- !query 15 output
-223,7 +223,7 grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 16
-SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, udf(year)
+SELECT course, year, grouping__id FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year
-- !query 16 schema
struct<course:string,year:int,grouping__id:int>
-- !query 16 output
-240,7 +240,7 NULL NULL 3
-- !query 17
SELECT course, year FROM courseSales GROUP BY CUBE(course, year)
-HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, udf(year)
+HAVING GROUPING(year) = 1 AND GROUPING_ID(course, year) > 0 ORDER BY course, year
-- !query 17 schema
struct<course:string,year:int>
-- !query 17 output
-250,7 +250,7 dotNET NULL
-- !query 18
-SELECT course, udf(year) FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0
+SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING(course) > 0
-- !query 18 schema
struct<>
-- !query 18 output
-259,7 +259,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 19
-SELECT course, udf(udf(year)) FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0
+SELECT course, year FROM courseSales GROUP BY course, year HAVING GROUPING_ID(course) > 0
-- !query 19 schema
struct<>
-- !query 19 output
-268,9 +268,9 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 20
-SELECT udf(course), year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0
+SELECT course, year FROM courseSales GROUP BY CUBE(course, year) HAVING grouping__id > 0
-- !query 20 schema
-struct<CAST(udf(cast(course as string)) AS STRING):string,year:int>
+struct<course:string,year:int>
-- !query 20 output
Java NULL
NULL 2012
-281,7 +281,7 dotNET NULL
-- !query 21
SELECT course, year, GROUPING(course), GROUPING(year) FROM courseSales GROUP BY CUBE(course, year)
-ORDER BY GROUPING(course), GROUPING(year), course, udf(year)
+ORDER BY GROUPING(course), GROUPING(year), course, year
-- !query 21 schema
struct<course:string,year:int,grouping(course):tinyint,grouping(year):tinyint>
-- !query 21 output
-298,7 +298,7 NULL NULL 1 1
-- !query 22
SELECT course, year, GROUPING_ID(course, year) FROM courseSales GROUP BY CUBE(course, year)
-ORDER BY GROUPING(course), GROUPING(year), course, udf(year)
+ORDER BY GROUPING(course), GROUPING(year), course, year
-- !query 22 schema
struct<course:string,year:int,grouping_id(course, year):int>
-- !query 22 output
-314,7 +314,7 NULL NULL 3
-- !query 23
-SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING(course)
+SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING(course)
-- !query 23 schema
struct<>
-- !query 23 output
-323,7 +323,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 24
-SELECT course, udf(year) FROM courseSales GROUP BY course, udf(year) ORDER BY GROUPING_ID(course)
+SELECT course, year FROM courseSales GROUP BY course, year ORDER BY GROUPING_ID(course)
-- !query 24 schema
struct<>
-- !query 24 output
-332,7 +332,7 grouping()/grouping_id() can only be used with GroupingSets/Cube/Rollup;
-- !query 25
-SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, udf(course), year
+SELECT course, year FROM courseSales GROUP BY CUBE(course, year) ORDER BY grouping__id, course, year
-- !query 25 schema
struct<course:string,year:int>
-- !query 25 output
-348,7 +348,7 NULL NULL
-- !query 26
-SELECT udf(a + b) AS k1, udf(b) AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2)
+SELECT a + b AS k1, b AS k2, SUM(a - b) FROM testData GROUP BY CUBE(k1, k2)
-- !query 26 schema
struct<k1:int,k2:int,sum((a - b)):bigint>
-- !query 26 output
-368,7 +368,7 NULL NULL 3
-- !query 27
-SELECT udf(udf(a + b)) AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b)
+SELECT a + b AS k, b, SUM(a - b) FROM testData GROUP BY ROLLUP(k, b)
-- !query 27 schema
struct<k:int,b:int,sum((a - b)):bigint>
-- !query 27 output
-386,9 +386,9 NULL NULL 3
-- !query 28
-SELECT udf(a + b), udf(udf(b)) AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k)
+SELECT a + b, b AS k, SUM(a - b) FROM testData GROUP BY a + b, k GROUPING SETS(k)
-- !query 28 schema
-struct<CAST(udf(cast((a + b) as string)) AS INT):int,k:int,sum((a - b)):bigint>
+struct<(a + b):int,k:int,sum((a - b)):bigint>
-- !query 28 output
NULL 1 3
NULL 2 0
```
</p>
</details>
## How was this patch tested?
Tested as guided in SPARK-27921.
Verified pandas & pyarrow versions:
```$python3
Python 3.6.8 (default, Jan 14 2019, 11:02:34)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
>>> import pyarrow
>>> pyarrow.__version__
'0.14.0'
>>> pandas.__version__
'0.24.2'
```
From the sql output it seems that sql statements are evaluated correctly given that udf returns a string and may change results as Null will be returned as None and will be counted in returned values.
Closes#25196 from skonto/group-analytics.sql.
Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR removes a few hardware-dependent assertions which can cause a failure in `aarch64`.
**x86_64**
```
rootdonotdel-openlab-allinone-l00242678:/home/ubuntu# uname -a
Linux donotdel-openlab-allinone-l00242678 4.4.0-154-generic #181-Ubuntu SMP Tue Jun 25 05:29:03 UTC
2019 x86_64 x86_64 x86_64 GNU/Linux
scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res0: Int = -4194304
scala> floatToRawIntBits(Float.NaN)
res1: Int = 2143289344
```
**aarch64**
```
[rootarm-huangtianhua spark]# uname -a
Linux arm-huangtianhua 4.14.0-49.el7a.aarch64 #1 SMP Tue Apr 10 17:22:26 UTC 2018 aarch64 aarch64 aarch64 GNU/Linux
scala> import java.lang.Float.floatToRawIntBits
import java.lang.Float.floatToRawIntBits
scala> floatToRawIntBits(0.0f/0.0f)
res1: Int = 2143289344
scala> floatToRawIntBits(Float.NaN)
res2: Int = 2143289344
```
## How was this patch tested?
Pass the Jenkins (This removes the test coverage).
Closes#25186 from huangtianhua/special-test-case-for-aarch64.
Authored-by: huangtianhua <huangtianhua@huawei.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
SPARK-28199 (#24996) hid implementations of Triggers into `private[sql]` and encourage end users to use `Trigger.xxx` methods instead.
As I got some post review comment on 7548a8826d (r34366934) we could remove annotations which are meant to be used with public API.
## How was this patch tested?
N/A
Closes#25200 from HeartSaVioR/SPARK-28199-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR adds some tests converted from `join-empty-relation.sql` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'join-empty-relation.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/join-empty-relation.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-join-empty-relation.sql.out
index 857073a827..e79d01fb14 100644
--- a/sql/core/src/test/resources/sql-tests/results/join-empty-relation.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-join-empty-relation.sql.out
-27,111 +27,111 struct<>
-- !query 3
-SELECT * FROM t1 INNER JOIN empty_table
+SELECT udf(t1.a), udf(empty_table.a) FROM t1 INNER JOIN empty_table ON (udf(t1.a) = udf(udf(empty_table.a)))
-- !query 3 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
-- !query 3 output
-- !query 4
-SELECT * FROM t1 CROSS JOIN empty_table
+SELECT udf(t1.a), udf(udf(empty_table.a)) FROM t1 CROSS JOIN empty_table ON (udf(udf(t1.a)) = udf(empty_table.a))
-- !query 4 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 4 output
-- !query 5
-SELECT * FROM t1 LEFT OUTER JOIN empty_table
+SELECT udf(udf(t1.a)), empty_table.a FROM t1 LEFT OUTER JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
-- !query 5 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,a:int>
-- !query 5 output
1 NULL
-- !query 6
-SELECT * FROM t1 RIGHT OUTER JOIN empty_table
+SELECT udf(t1.a), udf(empty_table.a) FROM t1 RIGHT OUTER JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
-- !query 6 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
-- !query 6 output
-- !query 7
-SELECT * FROM t1 FULL OUTER JOIN empty_table
+SELECT udf(t1.a), empty_table.a FROM t1 FULL OUTER JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
-- !query 7 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,a:int>
-- !query 7 output
1 NULL
-- !query 8
-SELECT * FROM t1 LEFT SEMI JOIN empty_table
+SELECT udf(udf(t1.a)) FROM t1 LEFT SEMI JOIN empty_table ON (udf(t1.a) = udf(udf(empty_table.a)))
-- !query 8 schema
-struct<a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 8 output
-- !query 9
-SELECT * FROM t1 LEFT ANTI JOIN empty_table
+SELECT udf(t1.a) FROM t1 LEFT ANTI JOIN empty_table ON (udf(t1.a) = udf(empty_table.a))
-- !query 9 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 9 output
1
-- !query 10
-SELECT * FROM empty_table INNER JOIN t1
+SELECT udf(empty_table.a), udf(t1.a) FROM empty_table INNER JOIN t1 ON (udf(udf(empty_table.a)) = udf(t1.a))
-- !query 10 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
-- !query 10 output
-- !query 11
-SELECT * FROM empty_table CROSS JOIN t1
+SELECT udf(empty_table.a), udf(udf(t1.a)) FROM empty_table CROSS JOIN t1 ON (udf(empty_table.a) = udf(udf(t1.a)))
-- !query 11 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 11 output
-- !query 12
-SELECT * FROM empty_table LEFT OUTER JOIN t1
+SELECT udf(udf(empty_table.a)), udf(t1.a) FROM empty_table LEFT OUTER JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
-- !query 12 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int,CAST(udf(cast(a as string)) AS INT):int>
-- !query 12 output
-- !query 13
-SELECT * FROM empty_table RIGHT OUTER JOIN t1
+SELECT empty_table.a, udf(t1.a) FROM empty_table RIGHT OUTER JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
-- !query 13 schema
-struct<a:int,a:int>
+struct<a:int,CAST(udf(cast(a as string)) AS INT):int>
-- !query 13 output
NULL 1
-- !query 14
-SELECT * FROM empty_table FULL OUTER JOIN t1
+SELECT empty_table.a, udf(udf(t1.a)) FROM empty_table FULL OUTER JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
-- !query 14 schema
-struct<a:int,a:int>
+struct<a:int,CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 14 output
NULL 1
-- !query 15
-SELECT * FROM empty_table LEFT SEMI JOIN t1
+SELECT udf(udf(empty_table.a)) FROM empty_table LEFT SEMI JOIN t1 ON (udf(empty_table.a) = udf(udf(t1.a)))
-- !query 15 schema
-struct<a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 15 output
-- !query 16
-SELECT * FROM empty_table LEFT ANTI JOIN t1
+SELECT empty_table.a FROM empty_table LEFT ANTI JOIN t1 ON (udf(empty_table.a) = udf(t1.a))
-- !query 16 schema
struct<a:int>
-- !query 16 output
-139,56 +139,56 struct<a:int>
-- !query 17
-SELECT * FROM empty_table INNER JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table INNER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(udf(empty_table2.a)))
-- !query 17 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 17 output
-- !query 18
-SELECT * FROM empty_table CROSS JOIN empty_table
+SELECT udf(udf(empty_table.a)) FROM empty_table CROSS JOIN empty_table AS empty_table2 ON (udf(udf(empty_table.a)) = udf(empty_table2.a))
-- !query 18 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 18 output
-- !query 19
-SELECT * FROM empty_table LEFT OUTER JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table LEFT OUTER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
-- !query 19 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 19 output
-- !query 20
-SELECT * FROM empty_table RIGHT OUTER JOIN empty_table
+SELECT udf(udf(empty_table.a)) FROM empty_table RIGHT OUTER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(udf(empty_table2.a)))
-- !query 20 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 20 output
-- !query 21
-SELECT * FROM empty_table FULL OUTER JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table FULL OUTER JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
-- !query 21 schema
-struct<a:int,a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 21 output
-- !query 22
-SELECT * FROM empty_table LEFT SEMI JOIN empty_table
+SELECT udf(udf(empty_table.a)) FROM empty_table LEFT SEMI JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
-- !query 22 schema
-struct<a:int>
+struct<CAST(udf(cast(cast(udf(cast(a as string)) as int) as string)) AS INT):int>
-- !query 22 output
-- !query 23
-SELECT * FROM empty_table LEFT ANTI JOIN empty_table
+SELECT udf(empty_table.a) FROM empty_table LEFT ANTI JOIN empty_table AS empty_table2 ON (udf(empty_table.a) = udf(empty_table2.a))
-- !query 23 schema
-struct<a:int>
+struct<CAST(udf(cast(a as string)) AS INT):int>
-- !query 23 output
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
Closes#25127 from imback82/join-empty-relation-sql.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Performance issue using explode was found when a complex field contains huge array is to get duplicated as the number of exploded array elements. Given example:
```scala
val df = spark.sparkContext.parallelize(Seq(("1",
Array.fill(M)({
val i = math.random
(i.toString, (i + 1).toString, (i + 2).toString, (i + 3).toString)
})))).toDF("col", "arr")
.selectExpr("col", "struct(col, arr) as st")
.selectExpr("col", "st.col as col1", "explode(st.arr) as arr_col")
```
The explode causes `st` to be duplicated as many as the exploded elements.
Benchmarks it:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
[info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz
[info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] generate big nested struct array wholestage off 52668 53162 699 0.0 877803.4 1.0X
[info] generate big nested struct array wholestage on 47261 49093 1125 0.0 787690.2 1.1X
[info]
```
The query plan:
```
== Physical Plan ==
Project [col#508, st#512.col AS col1#515, arr_col#519]
+- Generate explode(st#512.arr), [col#508, st#512], false, [arr_col#519]
+- Project [_1#503 AS col#508, named_struct(col, _1#503, arr, _2#504) AS st#512]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._1, true, false) AS _1#503, mapobjects(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), if (isnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue84, MapObjects_loopIsNull84, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#504]
+- Scan[obj#534]
```
This patch takes nested column pruning approach to prune unnecessary nested fields. It adds a projection of the needed nested fields as aliases on the child of `Generate`, and substitutes them by alias attributes on the projection on top of `Generate`.
Benchmarks it after the change:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_202-b08 on Mac OS X 10.14.4
[info] Intel(R) Core(TM) i7-8750H CPU 2.20GHz
[info] generate big nested struct array: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] generate big nested struct array wholestage off 311 331 28 0.2 5188.6 1.0X
[info] generate big nested struct array wholestage on 297 312 15 0.2 4947.3 1.0X
[info]
```
The query plan:
```
== Physical Plan ==
Project [col#592, _gen_alias_608#608 AS col1#599, arr_col#603]
+- Generate explode(st#596.arr), [col#592, _gen_alias_608#608], false, [arr_col#603]
+- Project [_1#587 AS col#592, named_struct(col, _1#587, arr, _2#588) AS st#596, _1#587 AS _gen_alias_608#608]
+- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(in
put[0, scala.Tuple2, true]))._1, true, false) AS _1#587, mapobjects(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4),
if (isnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))) null else named_struct(_1, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._1, true, false), _2, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._2, true, false), _3, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._3, true, false), _4, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(lambdavariable(MapObjects_loopValue102, MapObjects_loopIsNull102, ObjectType(class scala.Tuple4), true))._4, true, false)), knownnotnull(assertnotnull(input[0, scala.Tuple2, true]))._2, None) AS _2#588]
+- Scan[obj#586]
```
This behavior is controlled by a SQL config `spark.sql.optimizer.expression.nestedPruning.enabled`.
## How was this patch tested?
Added benchmark.
Closes#24637 from viirya/SPARK-27707.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR adds some tests converted from ```except.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'except.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/except.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out
index c9b712d4d2..27ca7ea226 100644
--- a/sql/core/src/test/resources/sql-tests/results/except.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-except.sql.out
-30,16 +30,16 struct<>
-- !query 2
-SELECT * FROM t1 EXCEPT SELECT * FROM t2
+SELECT udf(k), udf(v) FROM t1 EXCEPT SELECT udf(k), udf(v) FROM t2
-- !query 2 schema
-struct<k:string,v:int>
+struct<CAST(udf(cast(k as string)) AS STRING):string,CAST(udf(cast(v as string)) AS INT):int>
-- !query 2 output
three 3
two 2
-- !query 3
-SELECT * FROM t1 EXCEPT SELECT * FROM t1 where v <> 1 and v <> 2
+SELECT * FROM t1 EXCEPT SELECT * FROM t1 where udf(v) <> 1 and v <> udf(2)
-- !query 3 schema
struct<k:string,v:int>
-- !query 3 output
-49,7 +49,7 two 2
-- !query 4
-SELECT * FROM t1 where v <> 1 and v <> 22 EXCEPT SELECT * FROM t1 where v <> 2 and v >= 3
+SELECT * FROM t1 where udf(v) <> 1 and v <> udf(22) EXCEPT SELECT * FROM t1 where udf(v) <> 2 and v >= udf(3)
-- !query 4 schema
struct<k:string,v:int>
-- !query 4 output
-59,7 +59,7 two 2
-- !query 5
SELECT t1.* FROM t1, t2 where t1.k = t2.k
EXCEPT
-SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != 'one'
+SELECT t1.* FROM t1, t2 where t1.k = t2.k and t1.k != udf('one')
-- !query 5 schema
struct<k:string,v:int>
-- !query 5 output
-68,7 +68,7 one NULL
-- !query 6
-SELECT * FROM t2 where v >= 1 and v <> 22 EXCEPT SELECT * FROM t1
+SELECT * FROM t2 where v >= udf(1) and udf(v) <> 22 EXCEPT SELECT * FROM t1
-- !query 6 schema
struct<k:string,v:int>
-- !query 6 output
-77,9 +77,9 one 5
-- !query 7
-SELECT (SELECT min(k) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
+SELECT (SELECT min(udf(k)) FROM t2 WHERE t2.k = t1.k) min_t2 FROM t1
MINUS
-SELECT (SELECT min(k) FROM t2) abs_min_t2 FROM t1 WHERE t1.k = 'one'
+SELECT (SELECT udf(min(k)) FROM t2) abs_min_t2 FROM t1 WHERE t1.k = udf('one')
-- !query 7 schema
struct<min_t2:string>
-- !query 7 output
-90,16 +90,17 two
-- !query 8
SELECT t1.k
FROM t1
-WHERE t1.v <= (SELECT max(t2.v)
+WHERE t1.v <= (SELECT udf(max(udf(t2.v)))
FROM t2
- WHERE t2.k = t1.k)
+ WHERE udf(t2.k) = udf(t1.k))
MINUS
SELECT t1.k
FROM t1
-WHERE t1.v >= (SELECT min(t2.v)
+WHERE udf(t1.v) >= (SELECT min(udf(t2.v))
FROM t2
WHERE t2.k = t1.k)
-- !query 8 schema
-struct<k:string>
+struct<>
-- !query 8 output
-two
+java.lang.UnsupportedOperationException
+Cannot evaluate expression: udf(cast(null as string))
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921.](https://issues.apache.org/jira/browse/SPARK-27921)
Closes#25101 from huaxingao/spark-28277.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR adds some tests converted from ```outer-join.sql``` to test UDFs. Please see contribution guide of this umbrella ticket - [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
<details><summary>Diff comparing to 'outer-join.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/outer-join.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-outer-join.sql.out
index 5db3bae5d0..819f786070 100644
--- a/sql/core/src/test/resources/sql-tests/results/outer-join.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-outer-join.sql.out
-24,17 +24,17 struct<>
-- !query 2
SELECT
- (SUM(COALESCE(t1.int_col1, t2.int_col0))),
- ((COALESCE(t1.int_col1, t2.int_col0)) * 2)
+ (udf(SUM(udf(COALESCE(t1.int_col1, t2.int_col0))))),
+ (udf(COALESCE(t1.int_col1, t2.int_col0)) * 2)
FROM t1
RIGHT JOIN t2
- ON (t2.int_col0) = (t1.int_col1)
-GROUP BY GREATEST(COALESCE(t2.int_col1, 109), COALESCE(t1.int_col1, -449)),
+ ON udf(t2.int_col0) = udf(t1.int_col1)
+GROUP BY udf(GREATEST(COALESCE(udf(t2.int_col1), 109), COALESCE(t1.int_col1, udf(-449)))),
COALESCE(t1.int_col1, t2.int_col0)
-HAVING (SUM(COALESCE(t1.int_col1, t2.int_col0)))
- > ((COALESCE(t1.int_col1, t2.int_col0)) * 2)
+HAVING (udf(SUM(COALESCE(udf(t1.int_col1), udf(t2.int_col0)))))
+ > (udf(COALESCE(t1.int_col1, t2.int_col0)) * 2)
-- !query 2 schema
-struct<sum(coalesce(int_col1, int_col0)):bigint,(coalesce(int_col1, int_col0) * 2):int>
+struct<CAST(udf(cast(sum(cast(cast(udf(cast(coalesce(int_col1, int_col0) as string)) as int) as bigint)) as string)) AS BIGINT):bigint,(CAST(udf(cast(coalesce(int_col1, int_col0) as string)) AS INT) * 2):int>
-- !query 2 output
-367 -734
-507 -1014
-70,10 +70,10 spark.sql.crossJoin.enabled true
SELECT *
FROM (
SELECT
- COALESCE(t2.int_col1, t1.int_col1) AS int_col
+ udf(COALESCE(udf(t2.int_col1), udf(t1.int_col1))) AS int_col
FROM t1
LEFT JOIN t2 ON false
-) t where (t.int_col) is not null
+) t where (udf(t.int_col)) is not null
-- !query 6 schema
struct<int_col:int>
-- !query 6 output
```
</p>
</details>
## How was this patch tested?
Tested as guided in [SPARK-27921](https://issues.apache.org/jira/browse/SPARK-27921).
Closes#25103 from huaxingao/spark-28285.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
This PR adds some tests converted from 'udaf.sql' to test UDFs
<details><summary>Diff comparing to 'udaf.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/udaf.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out
index f4455bb717..e1747f4667 100644
--- a/sql/core/src/test/resources/sql-tests/results/udaf.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/udf-udaf.sql.out
-3,6 +3,8
-- !query 0
+-- This test file was converted from udaf.sql.
+
CREATE OR REPLACE TEMPORARY VIEW t1 AS SELECT * FROM VALUES
(1), (2), (3), (4)
as t1(int_col1)
-21,15 +23,15 struct<>
-- !query 2
-SELECT default.myDoubleAvg(int_col1) as my_avg from t1
+SELECT default.myDoubleAvg(udf(int_col1)) as my_avg, udf(default.myDoubleAvg(udf(int_col1))) as my_avg2, udf(default.myDoubleAvg(int_col1)) as my_avg3 from t1
-- !query 2 schema
-struct<my_avg:double>
+struct<my_avg:double,my_avg2:double,my_avg3:double>
-- !query 2 output
-102.5
+102.5 102.5 102.5
-- !query 3
-SELECT default.myDoubleAvg(int_col1, 3) as my_avg from t1
+SELECT default.myDoubleAvg(udf(int_col1), udf(3)) as my_avg from t1
-- !query 3 schema
struct<>
-- !query 3 output
-46,12 +48,12 struct<>
-- !query 5
-SELECT default.udaf1(int_col1) as udaf1 from t1
+SELECT default.udaf1(udf(int_col1)) as udaf1, udf(default.udaf1(udf(int_col1))) as udaf2, udf(default.udaf1(int_col1)) as udaf3 from t1
-- !query 5 schema
struct<>
-- !query 5 output
org.apache.spark.sql.AnalysisException
-Can not load class 'test.non.existent.udaf' when registering the function 'default.udaf1', please make sure it is on the classpath; line 1 pos 7
+Can not load class 'test.non.existent.udaf' when registering the function 'default.udaf1', please make sure it is on the classpath; line 1 pos 94
-- !query 6
```
</p>
</details>
## How was this patch tested?
Tested as guided in SPARK-27921.
Closes#25194 from vinodkc/br_Fix_SPARK-27921_3.
Authored-by: Vinod KC <vinod.kc.in@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
The `DateTimeUtils.timestampAddInterval` method was rewritten by using Java 8 time API. To add months and microseconds, I used the `plusMonths()` and `plus()` methods of `ZonedDateTime`. Also the signature of `timestampAddInterval()` was changed to accept an `ZoneId` instance instead of `TimeZone`. Using `ZoneId` allows to avoid the conversion `TimeZone` -> `ZoneId` on every invoke of `timestampAddInterval()`.
## How was this patch tested?
By existing test suites `DateExpressionsSuite`, `TypeCoercionSuite` and `CollectionExpressionsSuite`.
Closes#25173 from MaxGekk/timestamp-add-interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>