Commit graph

4885 commits

Author SHA1 Message Date
Max Gekk 8b97b19ffa [SPARK-33706][SQL] Require fully specified partition identifier in partitionExists()
### What changes were proposed in this pull request?
1. Check that the partition identifier passed to `SupportsPartitionManagement.partitionExists()` is fully specified (specifies all values of partition fields).
2. Remove the custom implementation of `partitionExists()` from `InMemoryPartitionTable`, and re-use the default implementation from `SupportsPartitionManagement`.

### Why are the changes needed?
The method is supposed to check existence of one partition but currently it can return `true` for partially specified partition. This can lead to incorrect commands behavior, for instance the commands could modify or place data in the middle of partition path.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running existing test suites:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *SupportsPartitionManagementSuite"
```

Closes #30667 from MaxGekk/check-len-partitionExists.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-11 12:48:40 +00:00
Terry Kim 8f5db716fa [SPARK-33654][SQL] Migrate CACHE TABLE to use UnresolvedRelation to resolve identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `CACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022.

### Why are the changes needed?

To resolve the table in the analyzer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #30598 from imback82/cache_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-11 12:39:58 +00:00
Kousuke Saruta 8377aca60a [SPARK-33527][SQL][FOLLOWUP] Fix the scala 2.13 build failure
### What changes were proposed in this pull request?

This PR fixes the Scala 2.13 build failure brought by #30479 .

### Why are the changes needed?

To pass Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Should be done byGitHub Actions.

Closes #30727 from sarutak/fix-scala213-build-failure.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-11 01:53:41 -08:00
Josh Soref c05f6f98b6 [MINOR][SQL] Spelling: enabled - legacy_setops_precedence_enbled
### What changes were proposed in this pull request?

Replace `legacy_setops_precedence_enbled` with `legacy_setops_precedence_enabled`

Alternatively, `legacy_setops_precedence_enabled` could be added, and `legacy_setops_precedence_enbled` retained, and if set the code could honor it and warn about the deprecated spelling.

### Why are the changes needed?

`enabled` is misspelled in `legacy_setops_precedence_enbled`

### Does this PR introduce _any_ user-facing change?

Yes.

It would break current consumers.
Examples include:
* https://www.programmersought.com/article/87752082924/
* 125d873c38/fugue_sql/_antlr/fugue_sqlLexer.py
* https://github.com/search?q=legacy_setops_precedence_enbled&type=code

### How was this patch tested?

It's been included in #30323 for a while (and is now split out here)

Closes #30677 from jsoref/spelling-enabled.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-11 06:49:45 +00:00
gengjiaan 24d7e45d31 [SPARK-33527][SQL] Extend the function of decode so as consistent with mainstream databases
### What changes were proposed in this pull request?
In Spark, decode(bin, charset) - Decodes the first argument using the second argument character set.

Unfortunately this is NOT what any other SQL vendor understands `DECODE` to do.
`DECODE` generally is a short hand for a simple case expression:

```
SELECT DECODE(c1, 1, 'Hello', 2, 'World', '!') FROM (VALUES (1), (2), (3)) AS T(c1)
=>
(Hello),
(World)
(!)
```
There are some mainstream database support the syntax.
**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/DECODE.html#GUID-39341D91-3442-4730-BD34-D3CF5D4701CE
**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/String/DECODE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CString%20Functions%7C_____10
**DB2**
https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1447.htm
**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_DECODE_expression.html
**Pig**
https://pig.apache.org/docs/latest/api/org/apache/pig/piggybank/evaluation/decode/Decode.html
**Teradata**
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/jtCpCycpEaXESG4d63kMjg
**Snowflake**
https://docs.snowflake.com/en/sql-reference/functions/decode.html

### Why are the changes needed?
It is very useful.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Jenkins test.

Closes #30479 from beliefer/SPARK-33527.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-11 05:52:33 +00:00
Linhong Liu 1554977670 [SPARK-33692][SQL] View should use captured catalog and namespace to lookup function
### What changes were proposed in this pull request?
Using the view captured catalog and namespace to lookup function, so the view
referred functions won't be overridden by newly created function with the same name,
but different database or function type (i.e. temporary function)

### Why are the changes needed?
bug fix, without this PR, changing database or create a temporary function with
the same name may cause failure when querying a view.

### Does this PR introduce _any_ user-facing change?
Yes, bug fix.

### How was this patch tested?
newly added and existing test cases.

Closes #30662 from linhongliu-db/SPARK-33692.

Lead-authored-by: Linhong Liu <linhong.liu@databricks.com>
Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-10 09:14:07 +00:00
gengjiaan cef28c2c51 [SPARK-32670][SQL][FOLLOWUP] Group exception messages in Catalyst Analyzer in one file
### What changes were proposed in this pull request?
This PR follows up https://github.com/apache/spark/pull/29497.
Because https://github.com/apache/spark/pull/29497 just give us an example to group all `AnalysisExcpetion` in Analyzer into QueryCompilationErrors.
This PR group other `AnalysisExcpetion` into QueryCompilationErrors.

### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.

### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.

### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.

Closes #30564 from beliefer/SPARK-32670-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-10 08:38:24 +00:00
Terry Kim b112e2bfa6 [SPARK-33714][SQL] Migrate ALTER VIEW ... SET/UNSET TBLPROPERTIES commands to use UnresolvedView to resolve the identifier
### What changes were proposed in this pull request?

This PR adds `allowTemp` flag to `UnresolvedView` so that `Analyzer` can check whether to resolve temp views or not.

This PR also migrates `ALTER VIEW ... SET/UNSET TBLPROPERTIES` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

To use `UnresolvedView` for view resolution.

One benefit is that the exception message is better for `ALTER VIEW ... SET/UNSET TBLPROPERTIES`. Before, if a temp view is passed, you will just get `NoSuchTableException` with `Table or view 'tmpView' not found in database 'default'`. But with this PR, you will get more description exception message: `tmpView is a temp view. ALTER VIEW ... SET TBLPROPERTIES expects a permanent view`.

### Does this PR introduce _any_ user-facing change?

The exception message changes as describe above.

### How was this patch tested?

Updated existing tests.

Closes #30676 from imback82/alter_view_set_unset_properties.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-10 05:18:34 +00:00
Max Gekk af37c7f411 [SPARK-33558][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. ADD PARTITION tests
### What changes were proposed in this pull request?
1. Move the `ALTER TABLE .. ADD PARTITION` parsing tests to `AlterTableAddPartitionParserSuite`
2. Place v1 tests for `ALTER TABLE .. ADD PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableAddPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS.

### Why are the changes needed?
- The unification will allow to run common `ALTER TABLE .. ADD PARTITION` tests for both DSv1 and Hive DSv1, DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite"
```

Closes #30685 from MaxGekk/unify-alter-table-add-partition-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-10 04:54:52 +00:00
Anton Okolnychyi fa9ce1d4e8
[SPARK-33722][SQL] Handle DELETE in ReplaceNullWithFalseInPredicate
### What changes were proposed in this pull request?

This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`.

### Why are the changes needed?

This change allows Spark to optimize delete conditions like we optimize filters.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This PR extends the existing test cases to also cover `DeleteFromTable`.

Closes #30688 from aokolnychyi/spark-33722.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-09 11:42:54 -08:00
HyukjinKwon b5399d4ef1 [SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing.

### Why are the changes needed?

To make it easier to maintain and read.

### Does this PR introduce _any_ user-facing change?

No. This is rather a code cleanup.

### How was this patch tested?

Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them.

Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-09 20:26:18 +09:00
Terry Kim 29fed23ba1 [SPARK-33703][SQL] Migrate MSCK REPAIR TABLE to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `MSCK REPAIR TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `MSCK REPAIR TABLE` is not supported for v2 tables.

### Why are the changes needed?

The PR makes the resolution consistent behavior consistent. For example,
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("MSCK REPAIR TABLE t") // works fine
```
, but after this PR:
```
sql("MSCK REPAIR TABLE t")
org.apache.spark.sql.AnalysisException: t is a temp view. 'MSCK REPAIR TABLE' expects a table; line 1 pos 0
```
, which is the consistent behavior with other commands.

### Does this PR introduce _any_ user-facing change?

After this PR, `MSCK REPAIR TABLE t` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`.

### How was this patch tested?

Updated existing tests.

Closes #30664 from imback82/repair_table_V2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-09 05:06:37 +00:00
Wenchen Fan 6fd234503c
[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++
### What changes were proposed in this pull request?

Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set.

The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places:
1. GROUP BY
2. join keys
3. window partition keys

This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++

### Why are the changes needed?

Fix the query result

### Does this PR introduce _any_ user-facing change?

Yes, the result of HyperLogLog++ becomes correct now.

### How was this patch tested?

a new test case, and a few more test cases that pass before this PR to improve test coverage.

Closes #30673 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-08 11:41:35 -08:00
Terry Kim c05ee06f5b [SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier
### What changes were proposed in this pull request?

This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier.

This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated existing tests.

Closes #30636 from imback82/drop_view_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-08 14:07:58 +00:00
Max Gekk 2b30dde249 [SPARK-33688][SQL] Migrate SHOW TABLE EXTENDED to new resolution framework
### What changes were proposed in this pull request?
1. Remove old statement `ShowTableStatement`
2. Introduce new command `ShowTableExtended` for  `SHOW TABLE EXTENDED`.

This PR is the first step of new V2 implementation of `SHOW TABLE EXTENDED`, see SPARK-33393.

### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: SPARK-29900.

### Does this PR introduce _any_ user-facing change?
The changes should not affect V1 tables. For V2, Spark outputs the error:
```
SHOW TABLE EXTENDED is not supported for v2 tables.
```

### How was this patch tested?
By running `SHOW TABLE EXTENDED` tests:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```

Closes #30645 from MaxGekk/show-table-extended-statement.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-08 12:08:22 +00:00
luluorta 99613cd581 [SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar
### What changes were proposed in this pull request?
`LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example:

`SELECT s LIKE 'm%aca' ESCAPE '%' FROM t`
`SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t`

For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Unit test.

Closes #30625 from luluorta/SPARK-33677.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-12-08 20:45:25 +09:00
Dongjoon Hyun 031c5ef280
[SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default
### What changes were proposed in this pull request?

This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark **3.2.0**.

### Why are the changes needed?

By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously.

### Does this PR introduce _any_ user-facing change?

Yes, but this is an improvement and it's supposed to have no bugs.

### How was this patch tested?

Pass the CIs.

Closes #30628 from dongjoon-hyun/SPARK-33679.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 23:10:35 -08:00
Terry Kim 5aefc49b0f [SPARK-33664][SQL] Migrate ALTER TABLE ... RENAME TO to use UnresolvedTableOrView to resolve identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `ALTER [TABLE|ViEW] ... RENAME TO` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

To use `UnresolvedTableOrView` for table/view resolution. Note that `AlterTableRenameCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated existing tests.

Closes #30610 from imback82/rename_v2.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-08 03:54:16 +00:00
Wenchen Fan c0874ba9f1
[SPARK-33480][SQL][FOLLOWUP] do not expose user data in error message
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data.

### Why are the changes needed?

This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

updated tests

Closes #30653 from cloud-fan/minor2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 13:35:37 -08:00
Wenchen Fan 6aff215077 [SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS.

### Why are the changes needed?

It's confusing for having two config while one can cover another completely.

### Does this PR introduce _any_ user-facing change?

no, it's deprecating not removing.

### How was this patch tested?

N/A

Closes #30651 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-07 10:50:31 -08:00
Josh Soref c62b84a043 [MINOR] Spelling sql not core
### What changes were proposed in this pull request?

This PR intends to fix typos in the sub-modules:
* `sql/catalyst`
* `sql/hive-thriftserver`
* `sql/hive`

Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618

NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)

### Why are the changes needed?

Misspelled words make it harder to read / understand content.

### Does this PR introduce _any_ user-facing change?

There are various fixes to documentation, etc...

### How was this patch tested?

No testing was performed

Closes #30532 from jsoref/spelling-sql-not-core.

Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-12-07 08:40:29 -06:00
Kent Yao da72b87374 [SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results
### What changes were proposed in this pull request?

In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later

### Why are the changes needed?
1. udf
```scala
scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2))

scala> spark.sql("select abcd()").show
scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212)
  at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
  at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
  at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
  at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198)
  at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
  at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
  at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171)
  at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606)
  ... 47 elided
```

2. spark.createDataframe

```
scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show
+--------------------+
|                   c|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
|rich set of highe...|
|MLlib for machine...|
|and Structured St...|
|                    |
|<https://spark.ap...|
|                    |
|[![Jenkins Build]...|
|[![AppVeyor Build...|
|[![PySpark Covera...|
|                    |
|                    |
```

3. reader.schema

```
scala> spark.read.schema("a varchar(2)").text("./README.md").show(100)
+--------------------+
|                   a|
+--------------------+
|      # Apache Spark|
|                    |
|Spark is a unifie...|
|high-level APIs i...|
|supports general ...|
```
4. etc

### Does this PR introduce _any_ user-facing change?

NO, we intend to avoid protentical breaking change

### How was this patch tested?

new tests

Closes #30586 from yaooqinn/SPARK-33641.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-07 13:40:15 +00:00
Yuming Wang 1e0c006748 [SPARK-33617][SQL] Add default parallelism configuration for Spark SQL queries
### What changes were proposed in this pull request?

This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`.

### Why are the changes needed?

Avoid generating small files for INSERT INTO TABLE from VALUES, for example:
```sql
CREATE TABLE t1(id int) USING parquet;
INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8);
```

Before this pr:
```
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root 421 Dec  1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet
-rw-r--r-- 1 root root   0 Dec  1 01:54 _SUCCESS
```

After this pr and set `spark.sql.files.minPartitionNum` to 1:
```
-rw-r--r-- 1 root root 452 Dec  1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet
-rw-r--r-- 1 root root   0 Dec  1 01:59 _SUCCESS
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30559 from wangyum/SPARK-33617.

Lead-authored-by: Yuming Wang <yumwang@ebay.com>
Co-authored-by: Yuming Wang <yumwang@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-07 21:36:52 +09:00
Max Gekk 26c0493318 [SPARK-33676][SQL] Require exact matching of partition spec to the schema in V2 ALTER TABLE .. ADD/DROP PARTITION
### What changes were proposed in this pull request?
Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs).

### Why are the changes needed?
1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error:
```sql
spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b);
spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9');
Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`';
```
2. To prevent future errors caused by not fully specified partition specs.

### Does this PR introduce _any_ user-facing change?
Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands.

### How was this patch tested?
By running the test suite with new UT:
```
$ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite"
```

Closes #30624 from MaxGekk/add-partition-full-spec.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-07 08:14:36 +00:00
Chao Sun e857e06452
[SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache
### What changes were proposed in this pull request?

This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`.

### Why are the changes needed?

Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later.

### Does this PR introduce _any_ user-facing change?

Yes. Now delete from table in v2 also refreshes cache.

### How was this patch tested?

Added a test case.

Closes #30597 from sunchao/SPARK-33652.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-06 01:14:22 -08:00
Wenchen Fan 1b4e35d1a8
[SPARK-33651][SQL] Allow CREATE EXTERNAL TABLE with LOCATION for data source tables
### What changes were proposed in this pull request?

This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it.

### Why are the changes needed?

It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all.

### Does this PR introduce _any_ user-facing change?

Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed.

### How was this patch tested?

new tests

Closes #30595 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 16:48:31 -08:00
Dongjoon Hyun de9818f043
[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT
### What changes were proposed in this pull request?

This PR aims to update `master` branch version to 3.2.0-SNAPSHOT.

### Why are the changes needed?

Start to prepare Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Pass the CIs.

Closes #30606 from dongjoon-hyun/SPARK-3.2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-04 14:10:42 -08:00
Wenchen Fan acc211d2cf [SPARK-33141][SQL][FOLLOW-UP] Store the max nested view depth in AnalysisContext
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30289. It removes the hack in `View.effectiveSQLConf`, by putting the max nested view depth in `AnalysisContext`. Then we don't get the max nested view depth from the active SQLConf, which keeps changing during nested view resolution.

### Why are the changes needed?

remove hacks.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

If I just remove the hack, `SimpleSQLViewSuite.restrict the nested level of a view` fails. With this fix, it passes again.

Closes #30575 from cloud-fan/view.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-04 14:01:15 +00:00
Jungtaek Lim (HeartSaVioR) 233a8494c8 [SPARK-27237][SS] Introduce State schema validation among query restart
## What changes were proposed in this pull request?

Please refer the description of [SPARK-27237](https://issues.apache.org/jira/browse/SPARK-27237) to see rationalization of this patch.

This patch proposes to introduce state schema validation, via storing key schema and value schema to `schema` file (for the first time) and verify new key schema and value schema for state are compatible with existing one. To be clear for definition of "compatible", state schema is "compatible" when number of fields are same and data type for each field is same - Spark has been allowing rename of field.

This patch will prevent query run which has incompatible state schema, which would reduce the chance to get indeterministic behavior (actually renaming of field is also the smell of semantically incompatible, but end users could just modify its name so we can't say) as well as providing more informative error message.

## How was this patch tested?

Added UTs.

Closes #24173 from HeartSaVioR/SPARK-27237.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 19:33:11 +09:00
Max Gekk 94c144bdd0 [SPARK-33571][SQL][DOCS] Add a ref to INT96 config from the doc for spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read
### What changes were proposed in this pull request?
For the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead`, improve their descriptions by:
1. Explicitly document on which parquet types, those configs influence on
2. Refer to corresponding configs for `INT96`

### Why are the changes needed?
To avoid user confusions like reposted in SPARK-33571, and make the config description more precise.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`.

Closes #30596 from MaxGekk/clarify-rebase-docs.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 16:26:07 +09:00
Gengliang Wang e8380665c7 [SPARK-33658][SQL] Suggest using Datetime conversion functions for invalid ANSI casting
### What changes were proposed in this pull request?

Suggest users using Datetime conversion functions in the error message of invalid ANSI explicit casting.

### Why are the changes needed?

In ANSI mode, explicit cast between DateTime types and Numeric types is not allowed.
As of now, we have introduced new functions `UNIX_SECONDS`/`UNIX_MILLIS`/`UNIX_MICROS`/`UNIX_DATE`/`DATE_FROM_UNIX_DATE`, we can show suggestions to users so that they can complete these type conversions precisely and easily in ANSI mode.

### Does this PR introduce _any_ user-facing change?

Yes, better error messages

### How was this patch tested?

Unit test

Closes #30603 from gengliangwang/improveErrorMsgOfExplicitCast.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-04 16:24:41 +09:00
Linhong Liu e02324f2dd [SPARK-33142][SPARK-33647][SQL] Store SQL text for SQL temp view
### What changes were proposed in this pull request?
Currently, in spark, the temp view is saved as its analyzed logical plan, while the permanent view
is kept in HMS with its origin SQL text. As a result, permanent and temporary views have
different behaviors in some cases. In this PR we store the SQL text for temporary view in order
to unify the behavior between permanent and temporary views.

### Why are the changes needed?
to unify the behavior between permanent and temporary views

### Does this PR introduce _any_ user-facing change?
Yes, with this PR, the temporary view will be re-analyzed when it's referred. So if the
underlying datasource changed, the view will also be updated.

### How was this patch tested?
existing and newly added test cases

Closes #30567 from linhongliu-db/SPARK-33142.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-04 06:48:49 +00:00
Gengliang Wang 29e415deac [SPARK-33649][SQL][DOC] Improve the doc of spark.sql.ansi.enabled
### What changes were proposed in this pull request?

Improve the documentation of SQL configuration `spark.sql.ansi.enabled`

### Why are the changes needed?

As there are more and more new features under the SQL configuration `spark.sql.ansi.enabled`, we should make it more clear about:
1. what exactly it is
2. where can users find all the features of the ANSI mode
3. whether all the features are exactly from the SQL standard

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

It's just doc change.

Closes #30593 from gengliangwang/reviseAnsiDoc.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-12-04 10:58:41 +08:00
Max Gekk 85949588b7 [SPARK-33650][SQL] Fix the error from ALTER TABLE .. ADD/DROP PARTITION for non-supported partition management table
### What changes were proposed in this pull request?
In the PR, I propose to change the order of post-analysis checks for the `ALTER TABLE .. ADD/DROP PARTITION` command, and perform the general check (does the table support partition management at all) before specific checks.

### Why are the changes needed?
The error message for the table which doesn't support partition management can mislead users:
```java
PartitionSpecs are not resolved;;
'AlterTableAddPartition [UnresolvedPartitionSpec(Map(id -> 1),None)], false
+- ResolvedTable org.apache.spark.sql.connector.InMemoryTableCatalog2fd64b11, ns1.ns2.tbl, org.apache.spark.sql.connector.InMemoryTable5d3ff859
```
because it says nothing about the root cause of the issue.

### Does this PR introduce _any_ user-facing change?
Yes. After the change, the error message will be:
```
Table ns1.ns2.tbl can not alter partitions
```

### How was this patch tested?
By running the affected test suite `AlterTablePartitionV2SQLSuite`.

Closes #30594 from MaxGekk/check-order-AlterTablePartition.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-03 16:43:15 -08:00
Anton Okolnychyi aa13e207c9
[SPARK-33623][SQL] Add canDeleteWhere to SupportsDelete
### What changes were proposed in this pull request?

This PR provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time.

### Why are the changes needed?

The only way to support delete statements right now is to implement ``SupportsDelete``. According to its Javadoc, that interface is meant for cases when we can delete data without much effort (e.g. like deleting a complete partition in a Hive table).

This PR actually provides us with a way to check if a data source is going to reject the delete via `deleteWhere` at planning time instead of just getting an exception during execution. In the future, we can use this functionality to decide whether Spark should rewrite this delete and execute a distributed query or it can just pass a set of filters.

Consider an example of a partitioned Hive table. If we have a delete predicate like `part_col = '2020'`, we can just drop the matching partition to satisfy this delete. In this case, the data source should return `true` from `canDeleteWhere` and use the filters it accepts in `deleteWhere` to drop the partition. I consider this as a delete without significant effort. At the same time, if we have a delete predicate like `id = 10`, Hive tables would not be able to execute this delete using a metadata only operation without rewriting files. In that case, the data source should return `false` from `canDeleteWhere` and we should use a more sophisticated row-level API to find out which records should be removed (the API is yet to be discussed, but we need this PR as a basis).

If we decide to support subqueries and all delete use cases by simply extending the existing API, this will mean all data sources will have to implement a lot of Spark logic to determine which records changed. I don't think we want to go that way as the Spark logic to determine which records should be deleted is independent of the underlying data source. So the assumption is that Spark will execute a plan to find which records must be deleted for data sources that return `false` from `canDeleteWhere`.
### Does this PR introduce _any_ user-facing change?

Yes but it is backward compatible.

### How was this patch tested?

This PR comes with a new test.

Closes #30562 from aokolnychyi/spark-33623.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-03 09:12:30 -08:00
Wenchen Fan 0706e64c49 [SPARK-30098][SQL] Add a configuration to use default datasource as provider for CREATE TABLE command
### What changes were proposed in this pull request?

For CRETE TABLE [AS SELECT] command, creates native Parquet table if neither USING nor STORE AS is specified and `spark.sql.legacy.createHiveTableByDefault` is false.

This is a retry after we unify the CREATE TABLE syntax. It partially reverts d2bec5e265

This PR allows `CREATE EXTERNAL TABLE` when `LOCATION` is present. This was not allowed for data source tables before, which is an unnecessary behavior different with hive tables.

### Why are the changes needed?

Changing from Hive text table to native Parquet table has many benefits:
1. be consistent with `DataFrameWriter.saveAsTable`.
2. better performance
3. better support for nested types (Hive text table doesn't work well with nested types, e.g. `insert into t values struct(null)` actually inserts a null value not `struct(null)` if `t` is a Hive text table, which leads to wrong result)
4. better interoperability as Parquet is a more popular open file format.

### Does this PR introduce _any_ user-facing change?

No by default. If the config is set, the behavior change is described below:

Behavior-wise, the change is very small as the native Parquet table is also Hive-compatible. All the Spark DDL commands that works for hive tables also works for native Parquet tables, with two exceptions: `ALTER TABLE SET [SERDE | SERDEPROPERTIES]` and `LOAD DATA`.

char/varchar behavior has been taken care by https://github.com/apache/spark/pull/30412, and there is no behavior difference between data source and hive tables.

One potential issue is `CREATE TABLE ... LOCATION ...` while users want to directly access the files later. It's more like a corner case and the legacy config should be good enough.

Another potential issue is users may use Spark to create the table and then use Hive to add partitions with different serde. This is not allowed for Spark native tables.

### How was this patch tested?

Re-enable the tests

Closes #30554 from cloud-fan/create-table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 15:24:44 +00:00
luluorta 512fb32b38 [SPARK-26218][SQL][FOLLOW UP] Fix the corner case of codegen when casting float to Integer
### What changes were proposed in this pull request?
This is a followup of [#27151](https://github.com/apache/spark/pull/27151). It fixes the same issue for the codegen path.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added Unit test.

Closes #30585 from luluorta/SPARK-26218.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 14:58:56 +00:00
Gengliang Wang ff13f574e6 [SPARK-20044][SQL] Add new function DATE_FROM_UNIX_DATE and UNIX_DATE
### What changes were proposed in this pull request?

Add new functions DATE_FROM_UNIX_DATE and UNIX_DATE for conversion between Date type and Numeric types.

### Why are the changes needed?

1. Explicit conversion between Date type and Numeric types is disallowed in ANSI mode. We need to provide new functions for users to complete the conversion.

2. We have introduced new functions from Bigquery for conversion between Timestamp type and Numeric types: TIMESTAMP_SECONDS, TIMESTAMP_MILLIS, TIMESTAMP_MICROS , UNIX_SECONDS, UNIX_MILLIS, and UNIX_MICROS. It makes sense to add functions for conversion between Date type and Numeric types as well.

### Does this PR introduce _any_ user-facing change?

Yes, two new datetime functions are added.

### How was this patch tested?

Unit tests

Closes #30588 from gengliangwang/dateToNumber.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-03 14:04:08 +00:00
Gengliang Wang b76c6b759c
[SPARK-33627][SQL] Add new function UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS
### What changes were proposed in this pull request?

As https://github.com/apache/spark/pull/28534 adds functions from [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions) for converting numbers to timestamp, this PR is to add functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to numbers.

### Why are the changes needed?

1. Symmetry of the conversion functions
2. Casting timestamp type to numeric types is disallowed in ANSI mode, we should provide functions for users to complete the conversion.

### Does this PR introduce _any_ user-facing change?

3 new functions UNIX_SECONDS, UNIX_MILLIS and UNIX_MICROS for converting timestamp to long type.

### How was this patch tested?

Unit tests.

Closes #30566 from gengliangwang/timestampLong.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-02 12:44:39 -08:00
yi.wu a082f4600b [SPARK-33071][SPARK-33536][SQL] Avoid changing dataset_id of LogicalPlan in join() to not break DetectAmbiguousSelfJoin
### What changes were proposed in this pull request?

Currently, `join()` uses `withPlan(logicalPlan)` for convenient to call some Dataset functions. But it leads to the `dataset_id` inconsistent between the `logicalPlan` and the original `Dataset`(because `withPlan(logicalPlan)` will create a new Dataset with the new id and reset the `dataset_id` with the new id of the `logicalPlan`). As a result, it breaks the rule `DetectAmbiguousSelfJoin`.

In this PR, we propose to drop the usage of `withPlan` but use the `logicalPlan` directly so its `dataset_id` doesn't change.

Besides, this PR also removes related metadata (`DATASET_ID_KEY`,  `COL_POS_KEY`) when an `Alias` tries to construct its own metadata. Because the `Alias` is no longer a reference column after converting to an `Attribute`.  To achieve that, we add a new field, `deniedMetadataKeys`, to indicate the metadata that needs to be removed.

### Why are the changes needed?

For the query below, it returns the wrong result while it should throws ambiguous self join exception instead:

```scala
val emp1 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop"),
  TestData(4, "IT")).toDS()
val emp2 = Seq[TestData](
  TestData(1, "sales"),
  TestData(2, "personnel"),
  TestData(3, "develop")).toDS()
val emp3 = emp1.join(emp2, emp1("key") === emp2("key")).select(emp1("*"))
emp1.join(emp3, emp1.col("key") === emp3.col("key"), "left_outer")
  .select(emp1.col("*"), emp3.col("key").as("e2")).show()

// wrong result
+---+---------+---+
|key|    value| e2|
+---+---------+---+
|  1|    sales|  1|
|  2|personnel|  2|
|  3|  develop|  3|
|  4|       IT|  4|
+---+---------+---+
```
This PR fixes the wrong behaviour.

### Does this PR introduce _any_ user-facing change?

Yes, users hit the exception instead of the wrong result after this PR.

### How was this patch tested?

Added a new unit test.

Closes #30488 from Ngone51/fix-self-join.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-02 17:51:22 +00:00
xuewei.linxuewei 58583f7c3f [SPARK-33619][SQL] Fix GetMapValueUtil code generation error
### What changes were proposed in this pull request?

Code Gen bug fix that introduced by SPARK-33460

```
GetMapValueUtil

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

SHOULD BE

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");"""
```

And the reason why SPARK-33460 failed to detect this bug via UT,  it was because that `checkExceptionInExpression ` did not work as expect like `checkEvaluation` which will try eval expression with BOTH `CODEGEN_ONLY` and `NO_CODEGEN` mode, and in this PR, will also fix this Test bug, too.

### Why are the changes needed?

Bug Fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Add UT and Existing UT.

Closes #30560 from leanken/leanken-SPARK-33619.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-02 16:10:45 +00:00
HyukjinKwon df8d3f1bf7 [SPARK-33544][SQL][FOLLOW-UP] Rename NoSideEffect to NoThrow and clarify the documentation more
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/30504. It proposes:

- Rename `NoSideEffect` to `NoThrow`, and use `Expression.deterministic` together where it is used.
- Clarify, in the docs in the expressions, that it means they don't throw exceptions

### Why are the changes needed?

`NoSideEffect` virtually means that `Expression.eval` does not throw an exception, and the expressions are deterministic.
It's best to be explicit so `NoThrow` was proposed -  I looked if there's a similar name to represent this concept and borrowed the name of [nothrow](https://clang.llvm.org/docs/AttributeReference.html#nothrow).
For determinism, we already have a way to note it under `Expression.deterministic`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually ran the existing unittests written.

Closes #30570 from HyukjinKwon/SPARK-33544.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-02 16:03:08 +00:00
Dongjoon Hyun 290aa02179 [SPARK-33618][CORE] Use hadoop-client instead of hadoop-client-api to make hadoop-aws work
### What changes were proposed in this pull request?

This reverts commit SPARK-33212 (cb3fa6c936) mostly with three exceptions:
1. `SparkSubmitUtils` was updated recently by SPARK-33580
2. `resource-managers/yarn/pom.xml` was updated recently by SPARK-33104 to add `hadoop-yarn-server-resourcemanager` test dependency.
3. Adjust `com.fasterxml.jackson.module:jackson-module-jaxb-annotations` dependency in K8s module which is updated recently by SPARK-33471.

### Why are the changes needed?

According to [HADOOP-16080](https://issues.apache.org/jira/browse/HADOOP-16080) since Apache Hadoop 3.1.1, `hadoop-aws` doesn't work with `hadoop-client-api`. It fails at write operation like the following.

**1. Spark distribution with `-Phadoop-cloud`**

```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY
20/11/30 23:01:24 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Spark context available as 'sc' (master = local[*], app id = local-1606806088715).
Spark session available as 'spark'.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.1.0-SNAPSHOT
      /_/

Using Scala version 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_272)
Type in expressions to have them evaluated.
Type :help for more information.

scala> spark.read.parquet("s3a://dongjoon/users.parquet").show
20/11/30 23:01:34 WARN MetricsConfig: Cannot locate configuration: tried hadoop-metrics2-s3a-file-system.properties,hadoop-metrics2.properties
+------+--------------+----------------+
|  name|favorite_color|favorite_numbers|
+------+--------------+----------------+
|Alyssa|          null|  [3, 9, 15, 20]|
|   Ben|           red|              []|
+------+--------------+----------------+

scala> Seq(1).toDF.write.parquet("s3a://dongjoon/out.parquet")
20/11/30 23:02:14 ERROR Executor: Exception in task 0.0 in stage 2.0 (TID 2)/ 1]
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
```

**2. Spark distribution without `-Phadoop-cloud`**
```scala
$ bin/spark-shell --conf spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID --conf spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY -c spark.eventLog.enabled=true -c spark.eventLog.dir=s3a://dongjoon/spark-events/ --packages org.apache.hadoop:hadoop-aws:3.2.0,org.apache.hadoop:hadoop-common:3.2.0
...
java.lang.NoSuchMethodError: org.apache.hadoop.util.SemaphoredDelegatingExecutor.<init>(Lcom/google/common/util/concurrent/ListeningExecutorService;IZ)V
  at org.apache.hadoop.fs.s3a.S3AFileSystem.create(S3AFileSystem.java:772)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #30508 from dongjoon-hyun/SPARK-33212-REVERT.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-02 18:23:48 +09:00
Cheng Su 51ebcd95a5 [SPARK-32863][SS] Full outer stream-stream join
### What changes were proposed in this pull request?

This PR is to add full outer stream-stream join, and the implementation of full outer join is:
* For left side input row, check if there's a match on right side state store.
  * if there's a match, output the joined row, o.w. output nothing. Put the row in left side state store.
* For right side input row, check if there's a match on left side state store.
  * if there's a match, output the joined row, o.w. output nothing. Put the row in right side state store.
* State store eviction: evict rows from left/right side state store below watermark, and output rows never matched before (a combination of left outer and right outer join).

### Why are the changes needed?

Enable more use cases for spark stream-stream join.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`.

Closes #30395 from c21/stream-foj.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-12-02 10:17:00 +09:00
Thomas Graves f71f34572d [SPARK-33544][SQL] Optimize size of CreateArray/CreateMap to be the size of its children
### What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-32295 added in an optimization to insert a filter for not null and size > 0 when using inner explode/inline. This is fine in most cases but the extra filter is not needed if the explode is with a create array and not using Literals (it already handles LIterals).  When this happens you know that the values aren't null and it has a size.  It already handles the empty array.

The not null check is already optimized out because Createarray and createMap are not nullable, that leaves the size > 0 check. To handle that this PR makes it so that the size > 0 check gets optimized in ConstantFolding to be the size of the children in the array or map.  That makes it a literal and then makes it ultimately be optimized out.

### Why are the changes needed?
remove unneeded filter

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
Unit tests added and manually tested various cases

Closes #30504 from tgravescs/SPARK-33544.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-12-02 09:50:02 +09:00
Anton Okolnychyi c24f2b2d6a
[SPARK-33612][SQL] Add dataSourceRewriteRules batch to Optimizer
### What changes were proposed in this pull request?

This PR adds a new batch to the optimizer for executing rules that rewrite plans for data sources.

### Why are the changes needed?

Right now, we have a special place in the optimizer where we construct v2 scans. As time shows, we need more rewrite rules that would be executed after the operator optimization and before any stats-related rules for v2 tables. Not all rules will be specific to reads. One option is to rename the current batch into something more generic but it would require changing quite some places. That's why it seems better to introduce a new batch and use it for all rewrites. The name is generic so that we don't limit ourselves to v2 data sources only.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The change is trivial and SPARK-23889 will depend on it.

Closes #30558 from aokolnychyi/spark-33612.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-12-01 09:27:46 -08:00
Anton Okolnychyi 478fb7f528 [SPARK-33608][SQL] Handle DELETE/UPDATE/MERGE in PullupCorrelatedPredicates
### What changes were proposed in this pull request?

This PR adds logic to handle DELETE/UPDATE/MERGE plans in `PullupCorrelatedPredicates`.
### Why are the changes needed?

Right now, `PullupCorrelatedPredicates` applies only to filters and unary nodes. As a result, correlated predicates in DELETE/UPDATE/MERGE are not rewritten.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The PR adds 3 new test cases.

Closes #30555 from aokolnychyi/spark-33608.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-01 14:11:01 +00:00
Prakhar Jain cf4ad212b1 [SPARK-33503][SQL] Refactor SortOrder class to allow multiple childrens
### What changes were proposed in this pull request?
This is a followup of #30302 . As part of this PR, sameOrderExpressions set is made part of children of SortOrder node - so that they don't need any special handling as done in #30302 .

### Why are the changes needed?
sameOrderExpressions should get same treatment as child. So making them part of children helps in transforming them easily.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs

Closes #30430 from prakharjain09/SPARK-33400-sortorder-refactor.

Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-12-01 21:13:27 +09:00
gengjiaan 9273d4250d [SPARK-33045][SQL][FOLLOWUP] Support built-in function like_any and fix StackOverflowError issue
### What changes were proposed in this pull request?
Spark already support `LIKE ANY` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ANY to fix this issue.

Why the stack overflow can happen in the current approach ?
The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems.

Why the fix in this PR can avoid the error?
This PR support built-in function for `LIKE ANY` and avoid this issue.

### Why are the changes needed?
1.Fix the `StackOverflowError` issue.
2.Support built-in function `like_any`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #30465 from beliefer/SPARK-33045-like_any-bak.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-12-01 11:48:30 +00:00
Gabor Somogyi e5bb2937f6 [SPARK-32032][SS] Avoid infinite wait in driver because of KafkaConsumer.poll(long) API
### What changes were proposed in this pull request?
Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details.

The PR contains the following changes:
* Added `AdminClient` based offset fetching
* GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
* GroupId override feature removed from driver but only in `AdminClient` based approach  (`AdminClient` doesn't need any GroupId)
* Additional unit tests
* Code comment changes
* Minor bugfixes here and there
* Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone.
* Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration`

### Why are the changes needed?
Driver may hang forever.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.
Cluster test with simple Kafka topic to another topic query.
Documentation:
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #29729 from gaborgsomogyi/SPARK-32032.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-12-01 20:34:00 +09:00