Commit graph

7483 commits

Author SHA1 Message Date
yangjie01 1be9e7e40b [SPAKR-33801][CORE][SQL] Fix compilation warnings about 'Unicode escapes in triple quoted strings are deprecated'
### What changes were proposed in this pull request?
There are total 15 compilation warnings about `Unicode escapes in triple quoted strings are deprecated` in Spark code now:
```
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2930: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2931: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2932: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2933: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2934: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2935: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2936: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2937: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala:82: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:32: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:79: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:97: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:101: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:76: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
[WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:83: Unicode escapes in triple quoted strings are deprecated, use the literal character instead
```

This pr try to fix these warnnings.

### Why are the changes needed?
Cleanup compilation warnings about `Unicode escapes in triple quoted strings are deprecated`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30926 from LuciferYang/SPARK-33801.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 15:29:09 +09:00
yangjie01 e6f019836c [SPARK-33532][SQL] Add comments to a unreachable branch in SpecificParquetRecordReaderBase.initialize method
### What changes were proposed in this pull request?
This pr mainly adds a comment for the 'rowgroupoffsets! = null' branch in `SpecificParquetRecordReaderBase.init(InputSplit, TaskAttemptContext)` to indicate that spark read parquet process will not enter this branch after SPARK-13883 and SPARK-13989.  It is not deleted because PARQUET-131 wants to move `SpecificParquetRecordReaderBase` into the parquet-mr project.

### Why are the changes needed?
Add a useful comment.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30484 from LuciferYang/SPARK-33532.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-28 14:07:50 +09:00
yangjie01 37ae0a6086 [SPARK-33560][TEST-MAVEN][BUILD] Add "unused-import" check to Maven compilation process
### What changes were proposed in this pull request?

Similar to SPARK-33441, this pr add `unused-import` check to Maven compilation process. After this pr `unused-import` will trigger Maven compilation error.

For Scala 2.13 profile, this pr also left TODO(SPARK-33499) similar to SPARK-33441 because `scala.language.higherKinds` no longer needs to be imported explicitly since Scala 2.13.1

### Why are the changes needed?
Let Maven build also check for unused imports as compilation error.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action

- Local manual test:add an unused import intentionally to trigger maven compilation error.

Closes #30784 from LuciferYang/SPARK-33560.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-26 17:40:19 -06:00
kozakana 2553d53dc8 [SPARK-33897][SQL] Can't set option 'cross' in join method
### What changes were proposed in this pull request?

[The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti."
However, I get the following error when I set the cross option.

```
scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b")))
df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C")))
df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string]

scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show()
java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross
  at scala.Predef$.require(Predef.scala:281)
  at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106)
  at org.apache.spark.sql.Dataset.join(Dataset.scala:1025)
  ... 53 elided
```

### Why are the changes needed?

The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException.

### Does this PR introduce _any_ user-facing change?

Accepting this PR fix will behave the same as the documentation.

### How was this patch tested?

There is already a test for [JoinTypes](1b9fd67904/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala), but I can't find a test for the join option itself.

Closes #30803 from kozakana/allow_cross_option.

Authored-by: kozakana <goki727@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-26 16:30:50 +09:00
angerszhu 10b6466e91 [SPARK-33084][CORE][SQL] Add jar support ivy path
### What changes were proposed in this pull request?
Support add jar with ivy path

### Why are the changes needed?
Since submit app can support ivy, add jar we can also support ivy now.

### Does this PR introduce _any_ user-facing change?
User can add jar with sql like
```
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true
add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false
```

core api
```
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true")
sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false")
```

#### Doc Update snapshot
![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png)

### How was this patch tested?
Added UT

Closes #29966 from AngersZhuuuu/support-add-jar-ivy.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-12-25 09:07:48 +09:00
Takeshi Yamamuro 65a9ac2ff4 [SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec
### What changes were proposed in this pull request?

This pr intends to support code generation for `HashAggregateExec` with filters.

Quick benchmark results:
```
$ ./bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=1 -v

scala> spark.range(100000000).selectExpr("id % 3 as k1", "id % 5 as k2", "rand() as v1", "rand() as v2").write.saveAsTable("t")
scala> sql("SELECT k1, k2, AVG(v1) FILTER (WHERE v2 > 0.5) FROM t GROUP BY k1, k2").write.format("noop").mode("overwrite").save()

>> Before this PR
Elapsed time: 16.170697619s

>> After this PR
Elapsed time: 6.7825313s
```

The query above is compiled into code below;

```
...
/* 285 */   private void agg_doAggregate_avg_0(boolean agg_exprIsNull_2_0, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_0, double agg_expr_2_0) throws java.io.IOException {
/* 286 */     // evaluate aggregate function for avg
/* 287 */     boolean agg_isNull_10 = true;
/* 288 */     double agg_value_12 = -1.0;
/* 289 */     boolean agg_isNull_11 = agg_unsafeRowAggBuffer_0.isNullAt(0);
/* 290 */     double agg_value_13 = agg_isNull_11 ?
/* 291 */     -1.0 : (agg_unsafeRowAggBuffer_0.getDouble(0));
/* 292 */     if (!agg_isNull_11) {
/* 293 */       agg_agg_isNull_12_0 = true;
/* 294 */       double agg_value_14 = -1.0;
/* 295 */       do {
/* 296 */         if (!agg_exprIsNull_2_0) {
/* 297 */           agg_agg_isNull_12_0 = false;
/* 298 */           agg_value_14 = agg_expr_2_0;
/* 299 */           continue;
/* 300 */         }
/* 301 */
/* 302 */         if (!false) {
/* 303 */           agg_agg_isNull_12_0 = false;
/* 304 */           agg_value_14 = 0.0D;
/* 305 */           continue;
/* 306 */         }
/* 307 */
/* 308 */       } while (false);
/* 309 */
/* 310 */       agg_isNull_10 = false; // resultCode could change nullability.
/* 311 */
/* 312 */       agg_value_12 = agg_value_13 + agg_value_14;
/* 313 */
/* 314 */     }
/* 315 */     boolean agg_isNull_15 = false;
/* 316 */     long agg_value_17 = -1L;
/* 317 */     if (!false && agg_exprIsNull_2_0) {
/* 318 */       boolean agg_isNull_18 = agg_unsafeRowAggBuffer_0.isNullAt(1);
/* 319 */       long agg_value_20 = agg_isNull_18 ?
/* 320 */       -1L : (agg_unsafeRowAggBuffer_0.getLong(1));
/* 321 */       agg_isNull_15 = agg_isNull_18;
/* 322 */       agg_value_17 = agg_value_20;
/* 323 */     } else {
/* 324 */       boolean agg_isNull_19 = true;
/* 325 */       long agg_value_21 = -1L;
/* 326 */       boolean agg_isNull_20 = agg_unsafeRowAggBuffer_0.isNullAt(1);
/* 327 */       long agg_value_22 = agg_isNull_20 ?
/* 328 */       -1L : (agg_unsafeRowAggBuffer_0.getLong(1));
/* 329 */       if (!agg_isNull_20) {
/* 330 */         agg_isNull_19 = false; // resultCode could change nullability.
/* 331 */
/* 332 */         agg_value_21 = agg_value_22 + 1L;
/* 333 */
/* 334 */       }
/* 335 */       agg_isNull_15 = agg_isNull_19;
/* 336 */       agg_value_17 = agg_value_21;
/* 337 */     }
/* 338 */     // update unsafe row buffer
/* 339 */     if (!agg_isNull_10) {
/* 340 */       agg_unsafeRowAggBuffer_0.setDouble(0, agg_value_12);
/* 341 */     } else {
/* 342 */       agg_unsafeRowAggBuffer_0.setNullAt(0);
/* 343 */     }
/* 344 */
/* 345 */     if (!agg_isNull_15) {
/* 346 */       agg_unsafeRowAggBuffer_0.setLong(1, agg_value_17);
/* 347 */     } else {
/* 348 */       agg_unsafeRowAggBuffer_0.setNullAt(1);
/* 349 */     }
/* 350 */   }
...
```

### Why are the changes needed?

For high performance.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27019 from maropu/AggregateFilterCodegen.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-24 14:44:16 -08:00
Kent Yao 29cca68e9e [SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE
### What changes were proposed in this pull request?

Display char/varchar in
  - DESC table
  - DESC column
  - SHOW CREATE TABLE

### Why are the changes needed?

show the correct definition for users

### Does this PR introduce _any_ user-facing change?

yes, char/varchar  column's will print char/varchar instead of string
### How was this patch tested?

new tests

Closes #30908 from yaooqinn/SPARK-33892.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:56:02 +00:00
Max Gekk 54a67842e6 [SPARK-33881][SQL][TESTS] Check null and empty string as partition values in DS v1 and v2 tests
### What changes were proposed in this pull request?
Add tests to check handling `null` and `''` (empty string) as partition values in commands `SHOW PARTITIONS`, `ALTER TABLE .. ADD PARTITION`, `ALTER TABLE .. DROP PARTITION`.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.ShowPartitionsSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite"
```

Closes #30893 from MaxGekk/partition-value-empty-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:54:53 +00:00
gengjiaan 3e9821edfd [SPARK-33443][SQL] LEAD/LAG should support [ IGNORE NULLS | RESPECT NULLS ]
### What changes were proposed in this pull request?
The mainstream database support `[ IGNORE NULLS | RESPECT NULLS ]` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE`.
But the current implement of `LEAD`/`LAG` don't support this syntax.

**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LEAD.html#GUID-0A0481F1-E98F-4535-A739-FCCA8D1B5B77

**Presto**
https://prestodb.io/docs/current/functions/window.html

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LEAD.html

**DB2**
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm

**Teradata**
https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w

**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/lead.html
https://docs.snowflake.com/en/sql-reference/functions/lag.html

### Why are the changes needed?
Support `[ IGNORE NULLS | RESPECT NULLS ]` for `LEAD`/`LAG` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
Jenkins test.

Closes #30387 from beliefer/SPARK-33443.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:13:48 +00:00
Yuming Wang 32d4a2b062 [SPARK-33861][SQL] Simplify conditional in predicate
### What changes were proposed in this pull request?

This pr simplify conditional in predicate, after this change we can push down the filter to datasource:

Expression | After simplify
-- | --
IF(cond, trueVal, false)                   | AND(cond, trueVal)
IF(cond, trueVal, true)                    | OR(NOT(cond), trueVal)
IF(cond, false, falseVal)                  | AND(NOT(cond), elseVal)
IF(cond, true, falseVal)                   | OR(cond, elseVal)
CASE WHEN cond THEN trueVal ELSE false END | AND(cond, trueVal)
CASE WHEN cond THEN trueVal END            | AND(cond, trueVal)
CASE WHEN cond THEN trueVal ELSE null END  | AND(cond, trueVal)
CASE WHEN cond THEN trueVal ELSE true END  | OR(NOT(cond), trueVal)
CASE WHEN cond THEN false ELSE elseVal END | AND(NOT(cond), elseVal)
CASE WHEN cond THEN false END              | false
CASE WHEN cond THEN true ELSE elseVal END  | OR(cond, elseVal)
CASE WHEN cond THEN true END               | cond

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30865 from wangyum/SPARK-33861.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-24 08:10:28 +00:00
Yuanjian Li 86c1cfc579 [SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API
### What changes were proposed in this pull request?
Follow up work for #30521, document the following behaviors in the API doc:

- Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table.
- Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created.

### Why are the changes needed?
We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638)

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Document only.

Closes #30885 from xuanyuanking/SPARK-33659.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-24 12:44:37 +09:00
Takuya UESHIN 5c9b421c37 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends
### What changes were proposed in this pull request?

This is a retry of #30177.

This is not a complete fix, but it would take long time to complete (#30242).
As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases.

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.

### Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

```py
spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()
```

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests, and manually.

Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-23 14:48:01 -08:00
Yuming Wang 7ffcfcf7db [SPARK-33847][SQL] Simplify CaseWhen if elseValue is None
### What changes were proposed in this pull request?

1. Enhance `ReplaceNullWithFalseInPredicate` to replace None of elseValue inside `CaseWhen` with `FalseLiteral` if all branches are `FalseLiteral` . The use case is:
```sql
create table t1 using parquet as select id from range(10);
explain select id from t1 where (CASE WHEN id = 1 THEN 'a' WHEN id = 3 THEN 'b' end) = 'c';
```

Before this pr:
```
== Physical Plan ==
*(1) Filter CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

```

After this pr:
```
== Physical Plan ==
LocalTableScan <empty>, [id#1L]
```

2. Enhance `SimplifyConditionals` if elseValue is None and all outputs are null.

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30852 from wangyum/SPARK-33847.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-23 14:35:46 +00:00
Max Gekk 303df64b46 [SPARK-33889][SQL] Fix NPE from SHOW PARTITIONS on V2 tables
### What changes were proposed in this pull request?
At `ShowPartitionsExec.run()`, check that a row returned by `listPartitionIdentifiers()` contains a `null` field, and convert it to `"null"`.

### Why are the changes needed?
Because `SHOW PARTITIONS` throws NPE on V2 table with `null` partition values.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new UT to `v2.ShowPartitionsSuite`.

Closes #30904 from MaxGekk/fix-npe-show-partitions.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-23 14:34:01 +00:00
Max Gekk cc23581e26 [SPARK-33858][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests
### What changes were proposed in this pull request?
1. Move the `ALTER TABLE .. RENAME PARTITION` parsing tests to `AlterTableRenamePartitionParserSuite`
2. Place the v1 tests for `ALTER TABLE .. RENAME PARTITION` from `DDLSuite` to `v1.AlterTableRenamePartitionSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to `v2.AlterTableRenamePartitionSuite`, so, the tests will run for V1, Hive V1 and V2 DS.

### Why are the changes needed?
- The unification will allow to run common `ALTER TABLE .. RENAME PARTITION` tests for both DSv1 and Hive DSv1, DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite"
```

Closes #30863 from MaxGekk/unify-rename-partition-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-23 12:19:07 +00:00
ulysses-you f421c172d9 [SPARK-33497][SQL] Override maxRows in some LogicalPlan
### What changes were proposed in this pull request?

This PR aims to override maxRows method in these follow `LogicalPlan`:
* `ReturnAnswer`
* `Join`
* `Range`
* `Sample`
* `RepartitionOperation`
* `Deduplicate`
* `LocalRelation`
* `Window`

### Why are the changes needed?

1. Logically, we know the max rows info with these `LogicalPlan`.
2. Before this PR, we already have some max rows with `LogicalPlan`, so we can eliminate limit with more case if we expand more.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #30443 from ulysses-you/SPARK-33497.

Lead-authored-by: ulysses-you <youxiduo@weidian.com>
Co-authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-23 09:20:49 +00:00
Max Gekk 34bfb3a31d [SPARK-33787][SQL] Allow partition purge for v2 tables
### What changes were proposed in this pull request?
1. Add new methods `purgePartition()`/`purgePartitions()` to the interfaces `SupportsPartitionManagement`/`SupportsAtomicPartitionManagement`.
2. Default implementation of new methods throw the exception `UnsupportedOperationException`.
3. Add tests for new methods to `SupportsPartitionManagementSuite`/`SupportsAtomicPartitionManagementSuite`.
4. Add `ALTER TABLE .. DROP PARTITION` tests for DS v1 and v2.

Closes #30776
Closes #30821

### Why are the changes needed?
Currently, the `PURGE` option that user can set in `ALTER TABLE .. DROP PARTITION` is completely ignored. We should pass this flag to the catalog implementation, so, the catalog should decide how to handle the flag.

### Does this PR introduce _any_ user-facing change?
The changes can impact on behavior of `ALTER TABLE .. DROP PARTITION` for v2 tables.

### How was this patch tested?
By running the affected test suites, for instance:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite"
```

Closes #30886 from MaxGekk/purge-partition.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-23 09:09:48 +00:00
Kent Yao 2287f56a3e [SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns
### What changes were proposed in this pull request?

```sql
spark-sql> select * from t10 where c0='abcd';
20/12/22 15:43:38 ERROR SparkSQLDriver: Failed in [select * from t10 where c0='abcd']
scala.MatchError: CharType(10) (of class org.apache.spark.sql.types.CharType)
	at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:815)
	at org.apache.spark.sql.catalyst.expressions.CastBase.cast$lzycompute(Cast.scala:842)
	at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:842)
	at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:844)
	at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476)
	at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.$anonfun$toRow$2(interface.scala:164)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at org.apache.spark.sql.types.StructType.map(StructType.scala:102)
	at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.toRow(interface.scala:158)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3(ExternalCatalogUtils.scala:157)
	at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3$adapted(ExternalCatalogUtils.scala:156)
```
c0 is a partition column, it fails in the partition pruning rule

In this PR, we relace char/varchar w/ string type before the CAST happends

### Why are the changes needed?

bugfix, see the case above

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

yes, new tests

Closes #30887 from yaooqinn/SPARK-33879.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-23 16:14:27 +09:00
Takeshi Yamamuro ea37717f7c [SPARK-32106][SQL][FOLLOWUP] Fix flaky tests in transform.sql
### What changes were proposed in this pull request?

This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests):
- https://github.com/apache/spark/runs/1592987501
- https://github.com/apache/spark/runs/1593196242
- https://github.com/apache/spark/runs/1595496305
- https://github.com/apache/spark/runs/1596309555

This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g.,
```
# https://github.com/apache/spark/runs/1592987501
Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2

# https://github.com/apache/spark/runs/1593196242
Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2
```
The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it.

This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #30896 from maropu/SPARK-32106-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-23 13:50:05 +09:00
Wenchen Fan ec1560af25 [SPARK-33364][SQL][FOLLOWUP] Refine the catalog v2 API to purge a table
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30267

Inspired by https://github.com/apache/spark/pull/30886, it's better to have 2 methods `def dropTable` and `def purgeTable`, than `def dropTable(ident)` and `def dropTable(ident, purge)`.

### Why are the changes needed?

1. make the APIs orthogonal. Previously, `def dropTable(ident, purge)` calls `def dropTable(ident)` and is a superset.
2. simplifies the catalog implementation a little bit. Now the `if (purge) ... else ...` check is done at the Spark side.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

existing tests

Closes #30890 from cloud-fan/purgeTable.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-23 11:47:13 +09:00
Erik Krogen 303b8c8773 [SPARK-23862][SQL] Support Java enums from Scala Dataset API
### What changes were proposed in this pull request?
Add support for Java Enums (`java.lang.Enum`) from the Scala typed Dataset APIs. This involves adding an implicit for `Encoder` creation in `SQLImplicits`, and updating `ScalaReflection` to handle Java Enums on the serialization and deserialization pathways.

Enums are mapped to a `StringType` which is just the name of the Enum value.

### Why are the changes needed?
In [SPARK-21255](https://issues.apache.org/jira/browse/SPARK-21255), support for (de)serialization of Java Enums was added, but only when called from Java code. It is common for Scala code to rely on Java libraries that are out of control of the Scala developer. Today, if there is a dependency on some Java code which defines an Enum, it would be necessary to define a corresponding Scala class. This change brings closer feature parity between Scala and Java APIs.

### Does this PR introduce _any_ user-facing change?
Yes, previously something like:
```
val ds = Seq(MyJavaEnum.VALUE1, MyJavaEnum.VALUE2).toDS
// or
val ds = Seq(CaseClass(MyJavaEnum.VALUE1), CaseClass(MyJavaEnum.VALUE2)).toDS
```
would fail. Now, it will succeed.

### How was this patch tested?
Additional unit tests are added in `DatasetSuite`. Tests include validating top-level enums, enums inside of case classes, enums inside of arrays, and validating that the Enum is stored as the expected string.

Closes #30877 from xkrogen/xkrogen-SPARK-23862-scalareflection-java-enums.

Lead-authored-by: Erik Krogen <xkrogen@apache.org>
Co-authored-by: Fangshi Li <fli@linkedin.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-22 09:55:33 -08:00
Kent Yao 6da5cdf1db [SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location
### What changes were proposed in this pull request?
This PR adds the length check to the existing ApplyCharPadding rule. Tables will have external locations when users execute
SET LOCATION or CREATE TABLE ... LOCATION. If the location contains over length values we should FAIL ON READ.

### Why are the changes needed?

```sql
spark-sql> INSERT INTO t2 VALUES ('1', 'b12345');
Time taken: 0.141 seconds
spark-sql> alter table t set location '/tmp/hive_one/t2';
Time taken: 0.095 seconds
spark-sql> select * from t;
1 b1234
```
the above case should fail rather than implicitly applying truncation

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #30882 from yaooqinn/SPARK-33876.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-22 14:24:12 +00:00
Max Gekk 84bf07bbd7 [SPARK-33878][SQL][TESTS] Fix resolving of spark_catalog in v1 Hive catalog tests
### What changes were proposed in this pull request?
1. Recognize `spark_catalog` as the default session catalog in the checks of `TestHiveQueryExecution`.
2. Move v2 and v1 in-memory catalog test `"SPARK-33305: DROP TABLE should also invalidate cache"` to the common trait `command/DropTableSuiteBase`, and run it with v1 Hive external catalog.

### Why are the changes needed?
To run In-memory catalog tests in Hive catalog.

### Does this PR introduce _any_ user-facing change?
No, the changes influence only on tests.

### How was this patch tested?
By running the affected test suites for `DROP TABLE`:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite"
```

Closes #30883 from MaxGekk/fix-spark_catalog-hive-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-22 12:37:16 +00:00
Anton Okolnychyi 7bbcbb84c2 [SPARK-33784][SQL] Rename dataSourceRewriteRules batch
### What changes were proposed in this pull request?

This PR tries to rename `dataSourceRewriteRules` into something more generic.

### Why are the changes needed?

These changes are needed to address the post-review discussion [here](https://github.com/apache/spark/pull/30558#discussion_r533885837).

### Does this PR introduce _any_ user-facing change?

Yes but the changes haven't been released yet.

### How was this patch tested?

Existing tests.

Closes #30808 from aokolnychyi/spark-33784.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-22 08:29:22 +00:00
Anton Okolnychyi 2562183987 [SPARK-33808][SQL] DataSource V2: Build logical writes in the optimizer
### What changes were proposed in this pull request?

This PR adds logic to build logical writes introduced in SPARK-33779.

Note: This PR contains a subset of changes discussed in PR #29066.

### Why are the changes needed?

These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #30806 from aokolnychyi/spark-33808.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-22 08:23:56 +00:00
yangjie01 b88745565b [SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Orc
### What changes were proposed in this pull request?
Orc support filter push down optimization, but this optimization will read file meta from external storage even if filters is empty.

This pr add a extra `filters.nonEmpty` when `spark.sql.orc.filterPushdown` is true

### Why are the changes needed?
Orc filters push down operation should only triggered when `filters.nonEmpty` is true

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30663 from LuciferYang/pushdownfilter-when-filter-nonempty.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 20:24:23 -08:00
Kent Yao f5fd10b1bc [SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar
### What changes were proposed in this pull request?

Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change
For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns

For v2 table,
char/varchar to string,
char(x) to char(x),
char(x)/varchar(x) to varchar(y) if x <=y are valid cases,
other changes are invalid

### Why are the changes needed?

Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

new test

Closes #30833 from yaooqinn/SPARK-33834.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-22 03:07:26 +00:00
angerszhu 7466031632 [SPARK-32106][SQL] Implement script transform in sql/core
### What changes were proposed in this pull request?

 * Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec`
 * Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data
 * Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode)
 * Add `SparkScriptTransformationSuite` test spark spec case
 * add test in `SQLQueryTestSuite`

And we will close #29085 .

### Why are the changes needed?
Support user use Script Transform without Hive

### Does this PR introduce _any_ user-facing change?
User can use Script Transformation without hive in no serde mode.
Such as :
**default no serde **
```
SELECT TRANSFORM(a, b, c)
USING 'cat' AS (a int, b string, c long)
FROM testData
```
**no serde with spec ROW FORMAT DELIMITED**
```
SELECT TRANSFORM(a, b, c)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\u0002'
MAP KEYS TERMINATED BY '\u0003'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'null'
USING 'cat' AS (a, b, c)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY '\u0004'
MAP KEYS TERMINATED BY '\u0005'
LINES TERMINATED BY '\n'
NULL DEFINED AS 'NULL'
FROM testData
```

### How was this patch tested?
Added UT

Closes #29414 from AngersZhuuuu/SPARK-32106-MINOR.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-12-22 11:37:59 +09:00
Yuming Wang 1c77605682 [SPARK-33848][SQL] Push the UnaryExpression into (if / case) branches
### What changes were proposed in this pull request?

This pr push the `UnaryExpression` into (if / case) branches. The use case is:
```sql
create table t1 using parquet as select id from range(10);
explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3;
```

Before this pr:
```
== Physical Plan ==
*(1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)
+- *(1) ColumnarToRow
   +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint>

```

After this pr:
```
== Physical Plan ==
LocalTableScan <empty>, [id#1L]
```

This change can also improve this case:
a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30853 from wangyum/SPARK-33848.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 10:25:23 -08:00
Takeshi Yamamuro 69aa727ff4 [SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions
### What changes were proposed in this pull request?

This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions.
New groups below are added in this PR:
 - binary_funcs
 - bitwise_funcs
 - collection_funcs
 - predicate_funcs
 - conditional_funcs
 - conversion_funcs
 - csv_funcs
 - generator_funcs
 - hash_funcs
 - lambda_funcs
 - math_funcs
 - misc_funcs
 - string_funcs
 - struct_funcs
 - xml_funcs

A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in `hash.scala` are categorized into `hash_funcs`. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes:
 - All the aggregate functions are categorized into `agg_funcs`.
 - `array_funcs` and `map_funcs` are  sub-groups of `collection_funcs`. For example, `array_contains` is used only for arrays, so it is assigned to `array_funcs`. On the other hand, `reverse` is used for both arrays and strings, so it is assigned to `collection_funcs`.
 - Some functions logically belong to multiple groups. In this case, these functions are categorized based on the file that they belong to. For example, `schema_of_csv` can be grouped into both `csv_funcs` and `struct_funcs` in terms of input types, but it is assigned to `csv_funcs` because it belongs to the `csvExpressions.scala` file that holds the other CSV-related functions.
 - Functions in `nullExpressions.scala`, `complexTypeCreator.scala`, `randomExpressions.scala`, and `regexExpressions.scala` are categorized based on their functionalities. For example:
   - `isnull` in `nullExpressions`  is assigned to `predicate_funcs` because this is a predicate function.
   - `array` in `complexTypeCreator.scala` is assigned to `array_funcs`based on its output type (The other functions in `array_funcs` are categorized based on their input types though).

A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master):
|group|name|class|
|-----|----|-----|
|agg_funcs|any|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr|
|agg_funcs|approx_count_distinct|org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus|
|agg_funcs|approx_percentile|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile|
|agg_funcs|avg|org.apache.spark.sql.catalyst.expressions.aggregate.Average|
|agg_funcs|bit_and|org.apache.spark.sql.catalyst.expressions.aggregate.BitAndAgg|
|agg_funcs|bit_or|org.apache.spark.sql.catalyst.expressions.aggregate.BitOrAgg|
|agg_funcs|bit_xor|org.apache.spark.sql.catalyst.expressions.aggregate.BitXorAgg|
|agg_funcs|bool_and|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd|
|agg_funcs|bool_or|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr|
|agg_funcs|collect_list|org.apache.spark.sql.catalyst.expressions.aggregate.CollectList|
|agg_funcs|collect_set|org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet|
|agg_funcs|corr|org.apache.spark.sql.catalyst.expressions.aggregate.Corr|
|agg_funcs|count_if|org.apache.spark.sql.catalyst.expressions.aggregate.CountIf|
|agg_funcs|count_min_sketch|org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg|
|agg_funcs|count|org.apache.spark.sql.catalyst.expressions.aggregate.Count|
|agg_funcs|covar_pop|org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation|
|agg_funcs|covar_samp|org.apache.spark.sql.catalyst.expressions.aggregate.CovSample|
|agg_funcs|cube|org.apache.spark.sql.catalyst.expressions.Cube|
|agg_funcs|every|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd|
|agg_funcs|first_value|org.apache.spark.sql.catalyst.expressions.aggregate.First|
|agg_funcs|first|org.apache.spark.sql.catalyst.expressions.aggregate.First|
|agg_funcs|grouping_id|org.apache.spark.sql.catalyst.expressions.GroupingID|
|agg_funcs|grouping|org.apache.spark.sql.catalyst.expressions.Grouping|
|agg_funcs|kurtosis|org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis|
|agg_funcs|last_value|org.apache.spark.sql.catalyst.expressions.aggregate.Last|
|agg_funcs|last|org.apache.spark.sql.catalyst.expressions.aggregate.Last|
|agg_funcs|max_by|org.apache.spark.sql.catalyst.expressions.aggregate.MaxBy|
|agg_funcs|max|org.apache.spark.sql.catalyst.expressions.aggregate.Max|
|agg_funcs|mean|org.apache.spark.sql.catalyst.expressions.aggregate.Average|
|agg_funcs|min_by|org.apache.spark.sql.catalyst.expressions.aggregate.MinBy|
|agg_funcs|min|org.apache.spark.sql.catalyst.expressions.aggregate.Min|
|agg_funcs|percentile_approx|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile|
|agg_funcs|percentile|org.apache.spark.sql.catalyst.expressions.aggregate.Percentile|
|agg_funcs|rollup|org.apache.spark.sql.catalyst.expressions.Rollup|
|agg_funcs|skewness|org.apache.spark.sql.catalyst.expressions.aggregate.Skewness|
|agg_funcs|some|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr|
|agg_funcs|stddev_pop|org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop|
|agg_funcs|stddev_samp|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp|
|agg_funcs|stddev|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp|
|agg_funcs|std|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp|
|agg_funcs|sum|org.apache.spark.sql.catalyst.expressions.aggregate.Sum|
|agg_funcs|var_pop|org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop|
|agg_funcs|var_samp|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp|
|agg_funcs|variance|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp|
|array_funcs|array_contains|org.apache.spark.sql.catalyst.expressions.ArrayContains|
|array_funcs|array_distinct|org.apache.spark.sql.catalyst.expressions.ArrayDistinct|
|array_funcs|array_except|org.apache.spark.sql.catalyst.expressions.ArrayExcept|
|array_funcs|array_intersect|org.apache.spark.sql.catalyst.expressions.ArrayIntersect|
|array_funcs|array_join|org.apache.spark.sql.catalyst.expressions.ArrayJoin|
|array_funcs|array_max|org.apache.spark.sql.catalyst.expressions.ArrayMax|
|array_funcs|array_min|org.apache.spark.sql.catalyst.expressions.ArrayMin|
|array_funcs|array_position|org.apache.spark.sql.catalyst.expressions.ArrayPosition|
|array_funcs|array_remove|org.apache.spark.sql.catalyst.expressions.ArrayRemove|
|array_funcs|array_repeat|org.apache.spark.sql.catalyst.expressions.ArrayRepeat|
|array_funcs|array_union|org.apache.spark.sql.catalyst.expressions.ArrayUnion|
|array_funcs|arrays_overlap|org.apache.spark.sql.catalyst.expressions.ArraysOverlap|
|array_funcs|arrays_zip|org.apache.spark.sql.catalyst.expressions.ArraysZip|
|array_funcs|array|org.apache.spark.sql.catalyst.expressions.CreateArray|
|array_funcs|flatten|org.apache.spark.sql.catalyst.expressions.Flatten|
|array_funcs|sequence|org.apache.spark.sql.catalyst.expressions.Sequence|
|array_funcs|shuffle|org.apache.spark.sql.catalyst.expressions.Shuffle|
|array_funcs|slice|org.apache.spark.sql.catalyst.expressions.Slice|
|array_funcs|sort_array|org.apache.spark.sql.catalyst.expressions.SortArray|
|bitwise_funcs|&|org.apache.spark.sql.catalyst.expressions.BitwiseAnd|
|bitwise_funcs|^|org.apache.spark.sql.catalyst.expressions.BitwiseXor|
|bitwise_funcs|bit_count|org.apache.spark.sql.catalyst.expressions.BitwiseCount|
|bitwise_funcs|shiftrightunsigned|org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned|
|bitwise_funcs|shiftright|org.apache.spark.sql.catalyst.expressions.ShiftRight|
|bitwise_funcs|~|org.apache.spark.sql.catalyst.expressions.BitwiseNot|
|collection_funcs|cardinality|org.apache.spark.sql.catalyst.expressions.Size|
|collection_funcs|concat|org.apache.spark.sql.catalyst.expressions.Concat|
|collection_funcs|reverse|org.apache.spark.sql.catalyst.expressions.Reverse|
|collection_funcs|size|org.apache.spark.sql.catalyst.expressions.Size|
|conditional_funcs|coalesce|org.apache.spark.sql.catalyst.expressions.Coalesce|
|conditional_funcs|ifnull|org.apache.spark.sql.catalyst.expressions.IfNull|
|conditional_funcs|if|org.apache.spark.sql.catalyst.expressions.If|
|conditional_funcs|nanvl|org.apache.spark.sql.catalyst.expressions.NaNvl|
|conditional_funcs|nullif|org.apache.spark.sql.catalyst.expressions.NullIf|
|conditional_funcs|nvl2|org.apache.spark.sql.catalyst.expressions.Nvl2|
|conditional_funcs|nvl|org.apache.spark.sql.catalyst.expressions.Nvl|
|conditional_funcs|when|org.apache.spark.sql.catalyst.expressions.CaseWhen|
|conversion_funcs|bigint|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|binary|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|boolean|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|cast|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|date|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|decimal|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|double|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|float|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|int|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|smallint|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|string|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|timestamp|org.apache.spark.sql.catalyst.expressions.Cast|
|conversion_funcs|tinyint|org.apache.spark.sql.catalyst.expressions.Cast|
|csv_funcs|from_csv|org.apache.spark.sql.catalyst.expressions.CsvToStructs|
|csv_funcs|schema_of_csv|org.apache.spark.sql.catalyst.expressions.SchemaOfCsv|
|csv_funcs|to_csv|org.apache.spark.sql.catalyst.expressions.StructsToCsv|
|datetime_funcs|add_months|org.apache.spark.sql.catalyst.expressions.AddMonths|
|datetime_funcs|current_date|org.apache.spark.sql.catalyst.expressions.CurrentDate|
|datetime_funcs|current_timestamp|org.apache.spark.sql.catalyst.expressions.CurrentTimestamp|
|datetime_funcs|current_timezone|org.apache.spark.sql.catalyst.expressions.CurrentTimeZone|
|datetime_funcs|date_add|org.apache.spark.sql.catalyst.expressions.DateAdd|
|datetime_funcs|date_format|org.apache.spark.sql.catalyst.expressions.DateFormatClass|
|datetime_funcs|date_from_unix_date|org.apache.spark.sql.catalyst.expressions.DateFromUnixDate|
|datetime_funcs|date_part|org.apache.spark.sql.catalyst.expressions.DatePart|
|datetime_funcs|date_sub|org.apache.spark.sql.catalyst.expressions.DateSub|
|datetime_funcs|date_trunc|org.apache.spark.sql.catalyst.expressions.TruncTimestamp|
|datetime_funcs|datediff|org.apache.spark.sql.catalyst.expressions.DateDiff|
|datetime_funcs|dayofmonth|org.apache.spark.sql.catalyst.expressions.DayOfMonth|
|datetime_funcs|dayofweek|org.apache.spark.sql.catalyst.expressions.DayOfWeek|
|datetime_funcs|dayofyear|org.apache.spark.sql.catalyst.expressions.DayOfYear|
|datetime_funcs|day|org.apache.spark.sql.catalyst.expressions.DayOfMonth|
|datetime_funcs|extract|org.apache.spark.sql.catalyst.expressions.Extract|
|datetime_funcs|from_unixtime|org.apache.spark.sql.catalyst.expressions.FromUnixTime|
|datetime_funcs|from_utc_timestamp|org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp|
|datetime_funcs|hour|org.apache.spark.sql.catalyst.expressions.Hour|
|datetime_funcs|last_day|org.apache.spark.sql.catalyst.expressions.LastDay|
|datetime_funcs|make_date|org.apache.spark.sql.catalyst.expressions.MakeDate|
|datetime_funcs|make_interval|org.apache.spark.sql.catalyst.expressions.MakeInterval|
|datetime_funcs|make_timestamp|org.apache.spark.sql.catalyst.expressions.MakeTimestamp|
|datetime_funcs|minute|org.apache.spark.sql.catalyst.expressions.Minute|
|datetime_funcs|months_between|org.apache.spark.sql.catalyst.expressions.MonthsBetween|
|datetime_funcs|month|org.apache.spark.sql.catalyst.expressions.Month|
|datetime_funcs|next_day|org.apache.spark.sql.catalyst.expressions.NextDay|
|datetime_funcs|now|org.apache.spark.sql.catalyst.expressions.Now|
|datetime_funcs|quarter|org.apache.spark.sql.catalyst.expressions.Quarter|
|datetime_funcs|second|org.apache.spark.sql.catalyst.expressions.Second|
|datetime_funcs|timestamp_micros|org.apache.spark.sql.catalyst.expressions.MicrosToTimestamp|
|datetime_funcs|timestamp_millis|org.apache.spark.sql.catalyst.expressions.MillisToTimestamp|
|datetime_funcs|timestamp_seconds|org.apache.spark.sql.catalyst.expressions.SecondsToTimestamp|
|datetime_funcs|to_date|org.apache.spark.sql.catalyst.expressions.ParseToDate|
|datetime_funcs|to_timestamp|org.apache.spark.sql.catalyst.expressions.ParseToTimestamp|
|datetime_funcs|to_unix_timestamp|org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp|
|datetime_funcs|to_utc_timestamp|org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp|
|datetime_funcs|trunc|org.apache.spark.sql.catalyst.expressions.TruncDate|
|datetime_funcs|unix_date|org.apache.spark.sql.catalyst.expressions.UnixDate|
|datetime_funcs|unix_micros|org.apache.spark.sql.catalyst.expressions.UnixMicros|
|datetime_funcs|unix_millis|org.apache.spark.sql.catalyst.expressions.UnixMillis|
|datetime_funcs|unix_seconds|org.apache.spark.sql.catalyst.expressions.UnixSeconds|
|datetime_funcs|unix_timestamp|org.apache.spark.sql.catalyst.expressions.UnixTimestamp|
|datetime_funcs|weekday|org.apache.spark.sql.catalyst.expressions.WeekDay|
|datetime_funcs|weekofyear|org.apache.spark.sql.catalyst.expressions.WeekOfYear|
|datetime_funcs|year|org.apache.spark.sql.catalyst.expressions.Year|
|generator_funcs|explode_outer|org.apache.spark.sql.catalyst.expressions.Explode|
|generator_funcs|explode|org.apache.spark.sql.catalyst.expressions.Explode|
|generator_funcs|inline_outer|org.apache.spark.sql.catalyst.expressions.Inline|
|generator_funcs|inline|org.apache.spark.sql.catalyst.expressions.Inline|
|generator_funcs|posexplode_outer|org.apache.spark.sql.catalyst.expressions.PosExplode|
|generator_funcs|posexplode|org.apache.spark.sql.catalyst.expressions.PosExplode|
|generator_funcs|stack|org.apache.spark.sql.catalyst.expressions.Stack|
|hash_funcs|crc32|org.apache.spark.sql.catalyst.expressions.Crc32|
|hash_funcs|hash|org.apache.spark.sql.catalyst.expressions.Murmur3Hash|
|hash_funcs|md5|org.apache.spark.sql.catalyst.expressions.Md5|
|hash_funcs|sha1|org.apache.spark.sql.catalyst.expressions.Sha1|
|hash_funcs|sha2|org.apache.spark.sql.catalyst.expressions.Sha2|
|hash_funcs|sha|org.apache.spark.sql.catalyst.expressions.Sha1|
|hash_funcs|xxhash64|org.apache.spark.sql.catalyst.expressions.XxHash64|
|json_funcs|from_json|org.apache.spark.sql.catalyst.expressions.JsonToStructs|
|json_funcs|get_json_object|org.apache.spark.sql.catalyst.expressions.GetJsonObject|
|json_funcs|json_array_length|org.apache.spark.sql.catalyst.expressions.LengthOfJsonArray|
|json_funcs|json_object_keys|org.apache.spark.sql.catalyst.expressions.JsonObjectKeys|
|json_funcs|json_tuple|org.apache.spark.sql.catalyst.expressions.JsonTuple|
|json_funcs|schema_of_json|org.apache.spark.sql.catalyst.expressions.SchemaOfJson|
|json_funcs|to_json|org.apache.spark.sql.catalyst.expressions.StructsToJson|
|lambda_funcs|aggregate|org.apache.spark.sql.catalyst.expressions.ArrayAggregate|
|lambda_funcs|array_sort|org.apache.spark.sql.catalyst.expressions.ArraySort|
|lambda_funcs|exists|org.apache.spark.sql.catalyst.expressions.ArrayExists|
|lambda_funcs|filter|org.apache.spark.sql.catalyst.expressions.ArrayFilter|
|lambda_funcs|forall|org.apache.spark.sql.catalyst.expressions.ArrayForAll|
|lambda_funcs|map_filter|org.apache.spark.sql.catalyst.expressions.MapFilter|
|lambda_funcs|map_zip_with|org.apache.spark.sql.catalyst.expressions.MapZipWith|
|lambda_funcs|transform_keys|org.apache.spark.sql.catalyst.expressions.TransformKeys|
|lambda_funcs|transform_values|org.apache.spark.sql.catalyst.expressions.TransformValues|
|lambda_funcs|transform|org.apache.spark.sql.catalyst.expressions.ArrayTransform|
|lambda_funcs|zip_with|org.apache.spark.sql.catalyst.expressions.ZipWith|
|map_funcs|element_at|org.apache.spark.sql.catalyst.expressions.ElementAt|
|map_funcs|map_concat|org.apache.spark.sql.catalyst.expressions.MapConcat|
|map_funcs|map_entries|org.apache.spark.sql.catalyst.expressions.MapEntries|
|map_funcs|map_from_arrays|org.apache.spark.sql.catalyst.expressions.MapFromArrays|
|map_funcs|map_from_entries|org.apache.spark.sql.catalyst.expressions.MapFromEntries|
|map_funcs|map_keys|org.apache.spark.sql.catalyst.expressions.MapKeys|
|map_funcs|map_values|org.apache.spark.sql.catalyst.expressions.MapValues|
|map_funcs|map|org.apache.spark.sql.catalyst.expressions.CreateMap|
|map_funcs|str_to_map|org.apache.spark.sql.catalyst.expressions.StringToMap|
|math_funcs|%|org.apache.spark.sql.catalyst.expressions.Remainder|
|math_funcs|*|org.apache.spark.sql.catalyst.expressions.Multiply|
|math_funcs|+|org.apache.spark.sql.catalyst.expressions.Add|
|math_funcs|-|org.apache.spark.sql.catalyst.expressions.Subtract|
|math_funcs|/|org.apache.spark.sql.catalyst.expressions.Divide|
|math_funcs|abs|org.apache.spark.sql.catalyst.expressions.Abs|
|math_funcs|acosh|org.apache.spark.sql.catalyst.expressions.Acosh|
|math_funcs|acos|org.apache.spark.sql.catalyst.expressions.Acos|
|math_funcs|asinh|org.apache.spark.sql.catalyst.expressions.Asinh|
|math_funcs|asin|org.apache.spark.sql.catalyst.expressions.Asin|
|math_funcs|atan2|org.apache.spark.sql.catalyst.expressions.Atan2|
|math_funcs|atanh|org.apache.spark.sql.catalyst.expressions.Atanh|
|math_funcs|atan|org.apache.spark.sql.catalyst.expressions.Atan|
|math_funcs|bin|org.apache.spark.sql.catalyst.expressions.Bin|
|math_funcs|bround|org.apache.spark.sql.catalyst.expressions.BRound|
|math_funcs|cbrt|org.apache.spark.sql.catalyst.expressions.Cbrt|
|math_funcs|ceiling|org.apache.spark.sql.catalyst.expressions.Ceil|
|math_funcs|ceil|org.apache.spark.sql.catalyst.expressions.Ceil|
|math_funcs|conv|org.apache.spark.sql.catalyst.expressions.Conv|
|math_funcs|cosh|org.apache.spark.sql.catalyst.expressions.Cosh|
|math_funcs|cos|org.apache.spark.sql.catalyst.expressions.Cos|
|math_funcs|cot|org.apache.spark.sql.catalyst.expressions.Cot|
|math_funcs|degrees|org.apache.spark.sql.catalyst.expressions.ToDegrees|
|math_funcs|div|org.apache.spark.sql.catalyst.expressions.IntegralDivide|
|math_funcs|expm1|org.apache.spark.sql.catalyst.expressions.Expm1|
|math_funcs|exp|org.apache.spark.sql.catalyst.expressions.Exp|
|math_funcs|e|org.apache.spark.sql.catalyst.expressions.EulerNumber|
|math_funcs|factorial|org.apache.spark.sql.catalyst.expressions.Factorial|
|math_funcs|floor|org.apache.spark.sql.catalyst.expressions.Floor|
|math_funcs|greatest|org.apache.spark.sql.catalyst.expressions.Greatest|
|math_funcs|hex|org.apache.spark.sql.catalyst.expressions.Hex|
|math_funcs|hypot|org.apache.spark.sql.catalyst.expressions.Hypot|
|math_funcs|least|org.apache.spark.sql.catalyst.expressions.Least|
|math_funcs|ln|org.apache.spark.sql.catalyst.expressions.Log|
|math_funcs|log10|org.apache.spark.sql.catalyst.expressions.Log10|
|math_funcs|log1p|org.apache.spark.sql.catalyst.expressions.Log1p|
|math_funcs|log2|org.apache.spark.sql.catalyst.expressions.Log2|
|math_funcs|log|org.apache.spark.sql.catalyst.expressions.Logarithm|
|math_funcs|mod|org.apache.spark.sql.catalyst.expressions.Remainder|
|math_funcs|negative|org.apache.spark.sql.catalyst.expressions.UnaryMinus|
|math_funcs|pi|org.apache.spark.sql.catalyst.expressions.Pi|
|math_funcs|pmod|org.apache.spark.sql.catalyst.expressions.Pmod|
|math_funcs|positive|org.apache.spark.sql.catalyst.expressions.UnaryPositive|
|math_funcs|power|org.apache.spark.sql.catalyst.expressions.Pow|
|math_funcs|pow|org.apache.spark.sql.catalyst.expressions.Pow|
|math_funcs|radians|org.apache.spark.sql.catalyst.expressions.ToRadians|
|math_funcs|randn|org.apache.spark.sql.catalyst.expressions.Randn|
|math_funcs|random|org.apache.spark.sql.catalyst.expressions.Rand|
|math_funcs|rand|org.apache.spark.sql.catalyst.expressions.Rand|
|math_funcs|rint|org.apache.spark.sql.catalyst.expressions.Rint|
|math_funcs|round|org.apache.spark.sql.catalyst.expressions.Round|
|math_funcs|shiftleft|org.apache.spark.sql.catalyst.expressions.ShiftLeft|
|math_funcs|signum|org.apache.spark.sql.catalyst.expressions.Signum|
|math_funcs|sign|org.apache.spark.sql.catalyst.expressions.Signum|
|math_funcs|sinh|org.apache.spark.sql.catalyst.expressions.Sinh|
|math_funcs|sin|org.apache.spark.sql.catalyst.expressions.Sin|
|math_funcs|sqrt|org.apache.spark.sql.catalyst.expressions.Sqrt|
|math_funcs|tanh|org.apache.spark.sql.catalyst.expressions.Tanh|
|math_funcs|tan|org.apache.spark.sql.catalyst.expressions.Tan|
|math_funcs|unhex|org.apache.spark.sql.catalyst.expressions.Unhex|
|math_funcs|width_bucket|org.apache.spark.sql.catalyst.expressions.WidthBucket|
|misc_funcs|assert_true|org.apache.spark.sql.catalyst.expressions.AssertTrue|
|misc_funcs|current_catalog|org.apache.spark.sql.catalyst.expressions.CurrentCatalog|
|misc_funcs|current_database|org.apache.spark.sql.catalyst.expressions.CurrentDatabase|
|misc_funcs|input_file_block_length|org.apache.spark.sql.catalyst.expressions.InputFileBlockLength|
|misc_funcs|input_file_block_start|org.apache.spark.sql.catalyst.expressions.InputFileBlockStart|
|misc_funcs|input_file_name|org.apache.spark.sql.catalyst.expressions.InputFileName|
|misc_funcs|java_method|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection|
|misc_funcs|monotonically_increasing_id|org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID|
|misc_funcs|raise_error|org.apache.spark.sql.catalyst.expressions.RaiseError|
|misc_funcs|reflect|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection|
|misc_funcs|spark_partition_id|org.apache.spark.sql.catalyst.expressions.SparkPartitionID|
|misc_funcs|typeof|org.apache.spark.sql.catalyst.expressions.TypeOf|
|misc_funcs|uuid|org.apache.spark.sql.catalyst.expressions.Uuid|
|misc_funcs|version|org.apache.spark.sql.catalyst.expressions.SparkVersion|
|predicate_funcs|!|org.apache.spark.sql.catalyst.expressions.Not|
|predicate_funcs|<=>|org.apache.spark.sql.catalyst.expressions.EqualNullSafe|
|predicate_funcs|<=|org.apache.spark.sql.catalyst.expressions.LessThanOrEqual|
|predicate_funcs|<|org.apache.spark.sql.catalyst.expressions.LessThan|
|predicate_funcs|==|org.apache.spark.sql.catalyst.expressions.EqualTo|
|predicate_funcs|=|org.apache.spark.sql.catalyst.expressions.EqualTo|
|predicate_funcs|>=|org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual|
|predicate_funcs|>|org.apache.spark.sql.catalyst.expressions.GreaterThan|
|predicate_funcs|and|org.apache.spark.sql.catalyst.expressions.And|
|predicate_funcs|in|org.apache.spark.sql.catalyst.expressions.In|
|predicate_funcs|isnan|org.apache.spark.sql.catalyst.expressions.IsNaN|
|predicate_funcs|isnotnull|org.apache.spark.sql.catalyst.expressions.IsNotNull|
|predicate_funcs|isnull|org.apache.spark.sql.catalyst.expressions.IsNull|
|predicate_funcs|like|org.apache.spark.sql.catalyst.expressions.Like|
|predicate_funcs|not|org.apache.spark.sql.catalyst.expressions.Not|
|predicate_funcs|or|org.apache.spark.sql.catalyst.expressions.Or|
|predicate_funcs|regexp_like|org.apache.spark.sql.catalyst.expressions.RLike|
|predicate_funcs|rlike|org.apache.spark.sql.catalyst.expressions.RLike|
|string_funcs|ascii|org.apache.spark.sql.catalyst.expressions.Ascii|
|string_funcs|base64|org.apache.spark.sql.catalyst.expressions.Base64|
|string_funcs|bit_length|org.apache.spark.sql.catalyst.expressions.BitLength|
|string_funcs|char_length|org.apache.spark.sql.catalyst.expressions.Length|
|string_funcs|character_length|org.apache.spark.sql.catalyst.expressions.Length|
|string_funcs|char|org.apache.spark.sql.catalyst.expressions.Chr|
|string_funcs|chr|org.apache.spark.sql.catalyst.expressions.Chr|
|string_funcs|concat_ws|org.apache.spark.sql.catalyst.expressions.ConcatWs|
|string_funcs|decode|org.apache.spark.sql.catalyst.expressions.Decode|
|string_funcs|elt|org.apache.spark.sql.catalyst.expressions.Elt|
|string_funcs|encode|org.apache.spark.sql.catalyst.expressions.Encode|
|string_funcs|find_in_set|org.apache.spark.sql.catalyst.expressions.FindInSet|
|string_funcs|format_number|org.apache.spark.sql.catalyst.expressions.FormatNumber|
|string_funcs|format_string|org.apache.spark.sql.catalyst.expressions.FormatString|
|string_funcs|initcap|org.apache.spark.sql.catalyst.expressions.InitCap|
|string_funcs|instr|org.apache.spark.sql.catalyst.expressions.StringInstr|
|string_funcs|lcase|org.apache.spark.sql.catalyst.expressions.Lower|
|string_funcs|left|org.apache.spark.sql.catalyst.expressions.Left|
|string_funcs|length|org.apache.spark.sql.catalyst.expressions.Length|
|string_funcs|levenshtein|org.apache.spark.sql.catalyst.expressions.Levenshtein|
|string_funcs|locate|org.apache.spark.sql.catalyst.expressions.StringLocate|
|string_funcs|lower|org.apache.spark.sql.catalyst.expressions.Lower|
|string_funcs|lpad|org.apache.spark.sql.catalyst.expressions.StringLPad|
|string_funcs|ltrim|org.apache.spark.sql.catalyst.expressions.StringTrimLeft|
|string_funcs|octet_length|org.apache.spark.sql.catalyst.expressions.OctetLength|
|string_funcs|overlay|org.apache.spark.sql.catalyst.expressions.Overlay|
|string_funcs|parse_url|org.apache.spark.sql.catalyst.expressions.ParseUrl|
|string_funcs|position|org.apache.spark.sql.catalyst.expressions.StringLocate|
|string_funcs|printf|org.apache.spark.sql.catalyst.expressions.FormatString|
|string_funcs|regexp_extract_all|org.apache.spark.sql.catalyst.expressions.RegExpExtractAll|
|string_funcs|regexp_extract|org.apache.spark.sql.catalyst.expressions.RegExpExtract|
|string_funcs|regexp_replace|org.apache.spark.sql.catalyst.expressions.RegExpReplace|
|string_funcs|repeat|org.apache.spark.sql.catalyst.expressions.StringRepeat|
|string_funcs|replace|org.apache.spark.sql.catalyst.expressions.StringReplace|
|string_funcs|right|org.apache.spark.sql.catalyst.expressions.Right|
|string_funcs|rpad|org.apache.spark.sql.catalyst.expressions.StringRPad|
|string_funcs|rtrim|org.apache.spark.sql.catalyst.expressions.StringTrimRight|
|string_funcs|sentences|org.apache.spark.sql.catalyst.expressions.Sentences|
|string_funcs|soundex|org.apache.spark.sql.catalyst.expressions.SoundEx|
|string_funcs|space|org.apache.spark.sql.catalyst.expressions.StringSpace|
|string_funcs|split|org.apache.spark.sql.catalyst.expressions.StringSplit|
|string_funcs|substring_index|org.apache.spark.sql.catalyst.expressions.SubstringIndex|
|string_funcs|substring|org.apache.spark.sql.catalyst.expressions.Substring|
|string_funcs|substr|org.apache.spark.sql.catalyst.expressions.Substring|
|string_funcs|translate|org.apache.spark.sql.catalyst.expressions.StringTranslate|
|string_funcs|trim|org.apache.spark.sql.catalyst.expressions.StringTrim|
|string_funcs|ucase|org.apache.spark.sql.catalyst.expressions.Upper|
|string_funcs|unbase64|org.apache.spark.sql.catalyst.expressions.UnBase64|
|string_funcs|upper|org.apache.spark.sql.catalyst.expressions.Upper|
|struct_funcs|named_struct|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct|
|struct_funcs|struct|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct|
|window_funcs|cume_dist|org.apache.spark.sql.catalyst.expressions.CumeDist|
|window_funcs|dense_rank|org.apache.spark.sql.catalyst.expressions.DenseRank|
|window_funcs|lag|org.apache.spark.sql.catalyst.expressions.Lag|
|window_funcs|lead|org.apache.spark.sql.catalyst.expressions.Lead|
|window_funcs|nth_value|org.apache.spark.sql.catalyst.expressions.NthValue|
|window_funcs|ntile|org.apache.spark.sql.catalyst.expressions.NTile|
|window_funcs|percent_rank|org.apache.spark.sql.catalyst.expressions.PercentRank|
|window_funcs|rank|org.apache.spark.sql.catalyst.expressions.Rank|
|window_funcs|row_number|org.apache.spark.sql.catalyst.expressions.RowNumber|
|xml_funcs|xpath_boolean|org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean|
|xml_funcs|xpath_double|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble|
|xml_funcs|xpath_float|org.apache.spark.sql.catalyst.expressions.xml.XPathFloat|
|xml_funcs|xpath_int|org.apache.spark.sql.catalyst.expressions.xml.XPathInt|
|xml_funcs|xpath_long|org.apache.spark.sql.catalyst.expressions.xml.XPathLong|
|xml_funcs|xpath_number|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble|
|xml_funcs|xpath_short|org.apache.spark.sql.catalyst.expressions.xml.XPathShort|
|xml_funcs|xpath_string|org.apache.spark.sql.catalyst.expressions.xml.XPathString|
|xml_funcs|xpath|org.apache.spark.sql.catalyst.expressions.xml.XPathList|

Closes #30040

NOTE: An original author of this PR is tanelk, so the credit should be given to tanelk.

### Why are the changes needed?

For better documents.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a test to check if exprs have a group tag in `ExpressionInfoSuite`.

Closes #30867 from maropu/pr30040.

Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 04:24:04 -08:00
Wenchen Fan b4bea1aa89 [SPARK-28863][SQL][FOLLOWUP] Make sure optimized plan will not be re-analyzed
### What changes were proposed in this pull request?

It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly.

This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true.

### Why are the changes needed?

make the code simpler and safer

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests.

Closes #30777 from cloud-fan/ds.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-21 20:59:33 +09:00
Max Gekk cdd1752ad1 [SPARK-33862][SQL] Throw PartitionAlreadyExistsException if the target partition exists while renaming
### What changes were proposed in this pull request?
Throw `PartitionAlreadyExistsException` from `ALTER TABLE .. RENAME TO PARTITION` for a table from Hive V1 External Catalog in the case when the target partition already exists.

### Why are the changes needed?
1. To have the same behavior of V1 In-Memory and Hive External Catalog.
2. To not propagate internal Hive's exceptions to users.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes, the partition renaming command throws `PartitionAlreadyExistsException` for tables from the Hive catalog.

### How was this patch tested?
Added new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite"
```

Closes #30866 from MaxGekk/throw-PartitionAlreadyExistsException.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 03:37:30 -08:00
Kousuke Saruta f4e1069bb8
[SPARK-33853][SQL] EXPLAIN CODEGEN and BenchmarkQueryTest don't show subquery code
### What changes were proposed in this pull request?

This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries.

The following example is about `EXPLAIN CODEGEN`.
```
spark.conf.set("spark.sql.adaptive.enabled", "false")
val df = spark.range(1, 100)
df.createTempView("df")
spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN")

scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN")
Found 1 WholeStageCodegen subtrees.
== Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) ==
*(1) Project [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L]
:  +- Subquery scalar-subquery#3, [id=#24]
:     +- *(2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L])
:        +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#20]
:           +- *(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L])
:              +- *(1) Range (1, 100, step=1, splits=12)
+- *(1) Scan OneRowRelation[]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private scala.collection.Iterator rdd_input_0;
/* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 011 */
/* 012 */   public GeneratedIteratorForCodegenStage1(Object[] references) {
/* 013 */     this.references = references;
/* 014 */   }
/* 015 */
/* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 017 */     partitionIndex = index;
/* 018 */     this.inputs = inputs;
/* 019 */     rdd_input_0 = inputs[0];
/* 020 */     project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 021 */
/* 022 */   }
/* 023 */
/* 024 */   private void project_doConsume_0() throws java.io.IOException {
/* 025 */     // common sub-expressions
/* 026 */
/* 027 */     project_mutableStateArray_0[0].reset();
/* 028 */
/* 029 */     if (false) {
/* 030 */       project_mutableStateArray_0[0].setNullAt(0);
/* 031 */     } else {
/* 032 */       project_mutableStateArray_0[0].write(0, 1L);
/* 033 */     }
/* 034 */     append((project_mutableStateArray_0[0].getRow()));
/* 035 */
/* 036 */   }
/* 037 */
/* 038 */   protected void processNext() throws java.io.IOException {
/* 039 */     while ( rdd_input_0.hasNext()) {
/* 040 */       InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next();
/* 041 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 042 */       project_doConsume_0();
/* 043 */       if (shouldStop()) return;
/* 044 */     }
/* 045 */   }
/* 046 */
/* 047 */ }
```

After this change, the corresponding code for subqueries are shown.
```
Found 3 WholeStageCodegen subtrees.
== Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) ==
*(1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L])
+- *(1) Range (1, 100, step=1, splits=12)

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=1
/* 006 */ final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private boolean agg_initAgg_0;
/* 010 */   private boolean agg_bufIsNull_0;
/* 011 */   private long agg_bufValue_0;
/* 012 */   private boolean range_initRange_0;
/* 013 */   private long range_nextIndex_0;
/* 014 */   private TaskContext range_taskContext_0;
/* 015 */   private InputMetrics range_inputMetrics_0;
/* 016 */   private long range_batchEnd_0;
/* 017 */   private long range_numElementsTodo_0;
/* 018 */   private boolean agg_agg_isNull_2_0;
/* 019 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3];
/* 020 */
/* 021 */   public GeneratedIteratorForCodegenStage1(Object[] references) {
/* 022 */     this.references = references;
/* 023 */   }
/* 024 */
/* 025 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 026 */     partitionIndex = index;
/* 027 */     this.inputs = inputs;
/* 028 */
/* 029 */     range_taskContext_0 = TaskContext.get();
/* 030 */     range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics();
/* 031 */     range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 032 */     range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 033 */     range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 034 */
/* 035 */   }
/* 036 */
/* 037 */   private void agg_doAggregateWithoutKey_0() throws java.io.IOException {
/* 038 */     // initialize aggregation buffer
/* 039 */     agg_bufIsNull_0 = true;
/* 040 */     agg_bufValue_0 = -1L;
/* 041 */
/* 042 */     // initialize Range
/* 043 */     if (!range_initRange_0) {
/* 044 */       range_initRange_0 = true;
/* 045 */       initRange(partitionIndex);
/* 046 */     }
/* 047 */
/* 048 */     while (true) {
/* 049 */       if (range_nextIndex_0 == range_batchEnd_0) {
/* 050 */         long range_nextBatchTodo_0;
/* 051 */         if (range_numElementsTodo_0 > 1000L) {
/* 052 */           range_nextBatchTodo_0 = 1000L;
/* 053 */           range_numElementsTodo_0 -= 1000L;
/* 054 */         } else {
/* 055 */           range_nextBatchTodo_0 = range_numElementsTodo_0;
/* 056 */           range_numElementsTodo_0 = 0;
/* 057 */           if (range_nextBatchTodo_0 == 0) break;
/* 058 */         }
/* 059 */         range_batchEnd_0 += range_nextBatchTodo_0 * 1L;
/* 060 */       }
/* 061 */
/* 062 */       int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L);
/* 063 */       for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) {
/* 064 */         long range_value_0 = ((long)range_localIdx_0 * 1L) + range_nextIndex_0;
/* 065 */
/* 066 */         agg_doConsume_0(range_value_0);
/* 067 */
/* 068 */         // shouldStop check is eliminated
/* 069 */       }
/* 070 */       range_nextIndex_0 = range_batchEnd_0;
/* 071 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(range_localEnd_0);
/* 072 */       range_inputMetrics_0.incRecordsRead(range_localEnd_0);
/* 073 */       range_taskContext_0.killTaskIfInterrupted();
/* 074 */     }
/* 075 */
/* 076 */   }
/* 077 */
/* 078 */   private void initRange(int idx) {
/* 079 */     java.math.BigInteger index = java.math.BigInteger.valueOf(idx);
/* 080 */     java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L);
/* 081 */     java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L);
/* 082 */     java.math.BigInteger step = java.math.BigInteger.valueOf(1L);
/* 083 */     java.math.BigInteger start = java.math.BigInteger.valueOf(1L);
/* 084 */     long partitionEnd;
/* 085 */
/* 086 */     java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start);
/* 087 */     if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
/* 088 */       range_nextIndex_0 = Long.MAX_VALUE;
/* 089 */     } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
/* 090 */       range_nextIndex_0 = Long.MIN_VALUE;
/* 091 */     } else {
/* 092 */       range_nextIndex_0 = st.longValue();
/* 093 */     }
/* 094 */     range_batchEnd_0 = range_nextIndex_0;
/* 095 */
/* 096 */     java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice)
/* 097 */     .multiply(step).add(start);
/* 098 */     if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) {
/* 099 */       partitionEnd = Long.MAX_VALUE;
/* 100 */     } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) {
/* 101 */       partitionEnd = Long.MIN_VALUE;
/* 102 */     } else {
/* 103 */       partitionEnd = end.longValue();
/* 104 */     }
/* 105 */
/* 106 */     java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract(
/* 107 */       java.math.BigInteger.valueOf(range_nextIndex_0));
/* 108 */     range_numElementsTodo_0  = startToEnd.divide(step).longValue();
/* 109 */     if (range_numElementsTodo_0 < 0) {
/* 110 */       range_numElementsTodo_0 = 0;
/* 111 */     } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) {
/* 112 */       range_numElementsTodo_0++;
/* 113 */     }
/* 114 */   }
/* 115 */
/* 116 */   private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException {
/* 117 */     // do aggregate
/* 118 */     // common sub-expressions
/* 119 */
/* 120 */     // evaluate aggregate functions and update aggregation buffers
/* 121 */
/* 122 */     agg_agg_isNull_2_0 = true;
/* 123 */     long agg_value_2 = -1L;
/* 124 */
/* 125 */     if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 ||
/* 126 */         agg_value_2 > agg_bufValue_0)) {
/* 127 */       agg_agg_isNull_2_0 = false;
/* 128 */       agg_value_2 = agg_bufValue_0;
/* 129 */     }
/* 130 */
/* 131 */     if (!false && (agg_agg_isNull_2_0 ||
/* 132 */         agg_value_2 > agg_expr_0_0)) {
/* 133 */       agg_agg_isNull_2_0 = false;
/* 134 */       agg_value_2 = agg_expr_0_0;
/* 135 */     }
/* 136 */
/* 137 */     agg_bufIsNull_0 = agg_agg_isNull_2_0;
/* 138 */     agg_bufValue_0 = agg_value_2;
/* 139 */
/* 140 */   }
/* 141 */
/* 142 */   protected void processNext() throws java.io.IOException {
/* 143 */     while (!agg_initAgg_0) {
/* 144 */       agg_initAgg_0 = true;
/* 145 */       long agg_beforeAgg_0 = System.nanoTime();
/* 146 */       agg_doAggregateWithoutKey_0();
/* 147 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] /* aggTime */).add((System.nanoTime() - agg_beforeAgg_0) / 1000000);
/* 148 */
/* 149 */       // output the result
/* 150 */
/* 151 */       ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] /* numOutputRows */).add(1);
/* 152 */       range_mutableStateArray_0[2].reset();
/* 153 */
/* 154 */       range_mutableStateArray_0[2].zeroOutNullBytes();
/* 155 */
/* 156 */       if (agg_bufIsNull_0) {
/* 157 */         range_mutableStateArray_0[2].setNullAt(0);
/* 158 */       } else {
/* 159 */         range_mutableStateArray_0[2].write(0, agg_bufValue_0);
/* 160 */       }
/* 161 */       append((range_mutableStateArray_0[2].getRow()));
/* 162 */     }
/* 163 */   }
/* 164 */
/* 165 */ }
```

### Why are the changes needed?

For better debuggability.

### Does this PR introduce _any_ user-facing change?

Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`.

### How was this patch tested?

New test.

Closes #30859 from sarutak/explain-codegen-subqueries.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-12-21 03:29:00 -08:00
Max Gekk b313a1e9e6 [SPARK-33849][SQL][TESTS] Unify v1 and v2 DROP TABLE tests
### What changes were proposed in this pull request?
1. Move the `DROP TABLE` parsing tests to `DropTableParserSuite`
2. Place the v1 tests for `DROP TABLE` from `DDLSuite` and v2 tests from `DataSourceV2SQLSuite` to the common trait `DropTableSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS.

### Why are the changes needed?
- The unification will allow to run common `DROP TABLE` tests for both DSv1 and Hive DSv1, DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableParserSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite"
```

Closes #30854 from MaxGekk/unify-drop-table-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-21 08:34:12 +00:00
Terry Kim 1c7b79c057 [SPARK-33856][SQL] Migrate ALTER TABLE ... RENAME TO PARTITION to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `ALTER TABLE ... RENAME TO PARTITION` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `ALTER TABLE ... RENAME TO PARTITION` is not supported for v2 tables.

### Why are the changes needed?

The PR makes the resolution consistent behavior consistent. For example,
```
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") // works fine assuming id=1 exists.
```
, but after this PR:
```
sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)")
org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RENAME TO PARTITION' expects a table; line 1 pos 0
```
, which is the consistent behavior with other commands.

### Does this PR introduce _any_ user-facing change?

After this PR, `ALTER TABLE` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`.

### How was this patch tested?

Updated existing tests.

Closes #30862 from imback82/alter_table_rename_partition_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-21 04:58:56 +00:00
Kousuke Saruta 3c8be3983c [SPARK-33850][SQL][FOLLOWUP] Improve and cleanup the test code
### What changes were proposed in this pull request?

This PR mainly improves and cleans up the test code introduced in #30855 based on the comment.
The test code is actually taken from another test `explain formatted - check presence of subquery in case of DPP` so this PR cleans the code too ( removed unnecessary `withTable`).

### Why are the changes needed?

To keep the test code clean.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`ExplainSuite` passes.

Closes #30861 from sarutak/followup-SPARK-33850.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-12-21 09:40:42 +09:00
Kousuke Saruta 70da86a085
[SPARK-33850][SQL] EXPLAIN FORMATTED doesn't show the plan for subqueries if AQE is enabled
### What changes were proposed in this pull request?

This PR fixes an issue that when AQE is enabled, EXPLAIN FORMATTED doesn't show the plan for subqueries.

```scala
val df = spark.range(1, 100)
df.createTempView("df")
spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("FORMATTED")

== Physical Plan ==
AdaptiveSparkPlan (3)
+- Project (2)
 +- Scan OneRowRelation (1)

(1) Scan OneRowRelation
Output: []
Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0)

(2) Project
Output [1]: [Subquery subquery#3, [id=#20] AS scalarsubquery()#5L]
Input: []

(3) AdaptiveSparkPlan
Output [1]: [scalarsubquery()#5L]
Arguments: isFinalPlan=false
```

After this change, the plan for the subquerie is shown.
```scala
== Physical Plan ==
* Project (2)
+- * Scan OneRowRelation (1)

(1) Scan OneRowRelation [codegen id : 1]
Output: []
Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0)

(2) Project [codegen id : 1]
Output [1]: [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L]
Input: []

===== Subqueries =====

Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#3, [id=#24]
* HashAggregate (6)
+- Exchange (5)
   +- * HashAggregate (4)
      +- * Range (3)

(3) Range [codegen id : 1]
Output [1]: [id#0L]
Arguments: Range (1, 100, step=1, splits=Some(12))

(4) HashAggregate [codegen id : 1]
Input [1]: [id#0L]
Keys: []
Functions [1]: [partial_min(id#0L)]
Aggregate Attributes [1]: [min#7L]
Results [1]: [min#8L]

(5) Exchange
Input [1]: [min#8L]
Arguments: SinglePartition, ENSURE_REQUIREMENTS, [id=#20]

(6) HashAggregate [codegen id : 2]
Input [1]: [min#8L]
Keys: []
Functions [1]: [min(id#0L)]
Aggregate Attributes [1]: [min(id#0L)#4L]
Results [1]: [min(id#0L)#4L AS v#2L]
```

### Why are the changes needed?

For better debuggability.

### Does this PR introduce _any_ user-facing change?

Yes. Users can see the formatted plan for subqueries.

### How was this patch tested?

New test.

Closes #30855 from sarutak/fix-aqe-explain.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-19 14:10:20 -08:00
Ammar Al-Batool 37c4cd8f05 [MINOR][DOCS] Fix typos in ScalaDocs for DataStreamWriter#foreachBatch
The title is pretty self-explanatory.

### What changes were proposed in this pull request?

Fixing typos in the docs for `foreachBatch` functions.

### Why are the changes needed?

To fix typos in JavaDoc/ScalaDoc.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Yes.

Closes #30782 from ammar1x/patch-1.

Lead-authored-by: Ammar Al-Batool <ammar.albatool@gmail.com>
Co-authored-by: Ammar Al-Batool <ammar.al-batool@disneystreaming.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-19 14:53:40 -06:00
Terry Kim 06075d849e
[SPARK-33829][SQL] Renaming v2 tables should recreate the cache
### What changes were proposed in this pull request?

Currently, renaming v2 tables does not invalidate/recreate the cache, leading to an incorrect behavior (cache not being used) when v2 tables are renamed. This PR fixes the behavior.

### Why are the changes needed?

Fixing a bug since the cache associated with the renamed table is not being cleaned up/recreated.

### Does this PR introduce _any_ user-facing change?

Yes, now when a v2 table is renamed, cache is correctly updated.

### How was this patch tested?

Added a new test

Closes #30825 from imback82/rename_recreate_cache_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-19 08:32:58 -08:00
Wenchen Fan de234eec8f [SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property
### What changes were proposed in this pull request?

Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big.

This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats.

### Why are the changes needed?

To be able to analyze table when histogram data is big.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing test and new tests

Closes #30809 from cloud-fan/cbo.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-19 14:35:28 +09:00
gengjiaan 6dca2e5d35 [SPARK-33599][SQL] Group exception messages in catalyst/analysis
### What changes were proposed in this pull request?
This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis`.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #30717 from beliefer/SPARK-33599.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-18 14:12:35 +00:00
gengjiaan f239128802 [SPARK-33597][SQL] Support REGEXP_LIKE for consistent with mainstream databases
### What changes were proposed in this pull request?
There are a lot of mainstream databases support regex function `REGEXP_LIKE`.
Currently, Spark supports `RLike` and we just need add a new alias `REGEXP_LIKE` for it.
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19
**Presto**
https://prestodb.io/docs/current/functions/regexp.html
**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_____5
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html

**Additional modifications**

1. Because test case named `check outputs of expression examples` in ExpressionInfoSuite executes the example SQL of built-in function, so the below SQL be executed:
`SELECT '%SystemDrive%\Users\John' regexp_like '%SystemDrive%\\Users.*'`
But Spark SQL not supports this syntax yet.
2. Another reason: `SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';`  is an SQL syntax, not the usecase for function `RLike`.
As the above reason, this PR changes the example SQL of `RLike`.

### Why are the changes needed?
No

### Does this PR introduce _any_ user-facing change?
Make the behavior of Spark SQL consistent with mainstream databases.

### How was this patch tested?
Jenkins test

Closes #30543 from beliefer/SPARK-33597.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-18 13:47:31 +00:00
angerszhu 0603913c66
[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value
### What changes were proposed in this pull request?

Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT
```scala
test("Parquet vector reader incorrect with binary partition value") {
  Seq(false, true).foreach(tag => {
    withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) {
      withTable("t1") {
        sql(
          """CREATE TABLE t1(name STRING, id BINARY, part BINARY)
            | USING PARQUET PARTITIONED BY (part)""".stripMargin)
        sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')")
        if (tag) {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", ""))
        } else {
          checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"),
            Row("a", "Spark SQL", "Spark SQL"))
        }
      }
    }
  })
}
```

### Why are the changes needed?
Fix data incorrect issue

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #30824 from AngersZhuuuu/SPARK-33593.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-18 00:01:13 -08:00
Terry Kim 0f1a18370a [SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe
### What changes were proposed in this pull request?

This PR proposes to update `CACHE TABLE` to use a `LogicalPlan` when caching a query to avoid creating a `DataFrame` as suggested here: https://github.com/apache/spark/pull/30743#discussion_r543123190

For reference, `UNCACHE TABLE` also uses `LogicalPlan`: 0c12900120/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (L91-L98)

### Why are the changes needed?

To avoid creating an unnecessary dataframe and make it consistent with `uncacheQuery` used in `UNCACHE TABLE`.

### Does this PR introduce _any_ user-facing change?

No, just internal changes.

### How was this patch tested?

Existing tests since this is an internal refactoring change.

Closes #30815 from imback82/cache_with_logical_plan.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-18 04:30:15 +00:00
Takeshi Yamamuro 51ef4430dc
[SPARK-33822][SQL] Use the CastSupport.cast method in HashJoin
### What changes were proposed in this pull request?

This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](031c5ef280)):
```
java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path.
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
  at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
  at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
  at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
  at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
  at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
  at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
  at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321)
  at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397)
  at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185)
  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  ...
```

I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows:
```
+- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183]
  +- BroadcastQueryStage 2
    +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963]
```
A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`:
1e85707738/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (L47-L50)

The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked that q5 passed.

Closes #30818 from maropu/BugfixInAQE.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-17 16:16:05 -08:00
allisonwang-db 1e85707738 [SPARK-33697][SQL] RemoveRedundantProjects should require column ordering by default
### What changes were proposed in this pull request?
This PR changes the rule `RemoveRedundantProjects` from by default passing column ordering requirements from parent nodes to always require column orders regardless of the requirements from parent nodes unless otherwise specified. More specifically, instead of excluding a few nodes like GenerateExec, UnionExec that are known to require children columns to be ordered, the rule now includes a whitelist of nodes that allow passing through the ordering requirements from their parents.

### Why are the changes needed?
Currently, this rule passes through ordering requirements from parents directly to children except for a few excluded nodes. This incorrectly removes the necessary project nodes below a UnionExec since it is not excluded. An earlier PR also fixed a similar issue for GenerateExec (SPARK-32861). In order to prevent similar issues, the rule should be changed to always require column ordering except for a few specific nodes that we know for sure can pass through the requirements.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit tests

Closes #30659 from allisonwang-db/spark-33697-remove-project-union.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-17 05:47:44 +00:00
Terry Kim 0c19497222 [SPARK-33815][SQL] Migrate ALTER TABLE ... SET [SERDE|SERDEPROPERTIES] to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `ALTER TABLE ... SET [SERDE|SERDEPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `ALTER TABLE ... SET [SERDE|SERDEPROPERTIES]` is not supported for v2 tables.

### Why are the changes needed?

The PR makes the resolution consistent behavior consistent. For example,
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("ALTER TABLE t SET SERDE 'serdename'") // works fine
```
, but after this PR:
```
sql("ALTER TABLE t SET SERDE 'serdename'")
org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... SET [SERDE|SERDEPROPERTIES\' expects a table; line 1 pos 0
```
, which is the consistent behavior with other commands.

### Does this PR introduce _any_ user-facing change?

After this PR, `t` in the above example is resolved to a temp view first instead of `spark_catalog.test.t`.

### How was this patch tested?

Updated existing tests.

Closes #30813 from imback82/alter_table_serde_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-17 05:25:51 +00:00
Kent Yao 728a1298af [SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions
### What changes were proposed in this pull request?

It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing.

For example
```
insert into table src select * from values (1), (2), (3) t(a) distribute by 1
```

Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary.

When users repeat the insert statement daily, hourly, or minutely, it causes small file issues.

```
spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1;

 kentyaohulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s
/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/
├── [          0]  _SUCCESS
├── [        298]  part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet
└── [        426]  part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet
```

To avoid this, there are some options you can take.

1. use `distribute by null`, let the data go to the partition 0
2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce
3. using hints instead of `distribute by`
4. set spark.sql.shuffle.partitions to 1

In this PR, we set the partition number to 1 in this particular case.

### Why are the changes needed?

1. avoid small file issues
2. avoid unnecessary empty tasks when no adaptive execution

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #30800 from yaooqinn/SPARK-33806.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-16 14:09:28 -08:00
Terry Kim 8666d1c39c [SPARK-33800][SQL] Remove command name in AnalysisException message when a relation is not resolved
### What changes were proposed in this pull request?

Based on the discussion https://github.com/apache/spark/pull/30743#discussion_r543124594, this PR proposes to remove the command name in AnalysisException message when a relation is not resolved.

For some of the commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier, when the identifier cannot be resolved, the exception will be something like `Table or view not found for 'SHOW TBLPROPERTIES': badtable`. The command name (`SHOW TBLPROPERTIES` in this case) should be dropped to be consistent with other existing commands.

### Why are the changes needed?

To make the exception message consistent.

### Does this PR introduce _any_ user-facing change?

Yes, the exception message will be changed from
```
Table or view not found for 'SHOW TBLPROPERTIES': badtable
```
to
```
Table or view not found: badtable
```
for commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier.

### How was this patch tested?

Updated existing tests.

Closes #30794 from imback82/remove_cmd_from_exception_msg.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-16 15:56:50 +00:00
Kent Yao 205d8e40bc [SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first
### What changes were proposed in this pull request?

As a follow-up of https://github.com/apache/spark/pull/30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting.

### Why are the changes needed?

to make reset command saner.
### Does this PR introduce _any_ user-facing change?

yes, RESET will respect session initials first not always go to the system defaults

### How was this patch tested?

add new tests

Closes #30642 from yaooqinn/SPARK-32991-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-16 14:36:38 +00:00