Commit graph

5688 commits

Author SHA1 Message Date
Angerszhuuuu 568ad6aa44 [SPARK-36637][SQL] Provide proper error message when use undefined window frame
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message

1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

### Why are the changes needed?
Provide proper error message

### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc

### How was this patch tested?
Added UT

Closes #33892 from AngersZhuuuu/SPARK-36637.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-02 22:32:31 +08:00
Kazuyuki Tanimura 799a0116a8 [SPARK-36607][SQL] Support BooleanType in UnwrapCastInBinaryComparison
### What changes were proposed in this pull request?
This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only.

The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`.

This work is an extension of SPARK-24994 and SPARK-32858

### Why are the changes needed?
Current implementation of Spark without this PR cannot properly optimize the filter for the following case
```
SELECT * FROM t WHERE boolean_field = 2
```
The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passed existing tests
```
build/sbt "catalyst/test"
build/sbt "sql/test"
```
Added unit tests
```
build/sbt "catalyst/testOnly *UnwrapCastInBinaryComparisonSuite   -- -z SPARK-36607"
build/sbt "sql/testOnly *UnwrapCastInComparisonEndToEndSuite  -- -z SPARK-36607"
```

Closes #33865 from kazuyukitanimura/SPARK-36607.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-01 14:27:30 +08:00
Bo Zhang e33cdfb317 [SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches
### What changes were proposed in this pull request?

This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one.

To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called.

This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`.

For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query.

### Why are the changes needed?

Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory.

### Does this PR introduce _any_ user-facing change?

Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change.

### How was this patch tested?

Added unit tests.

Closes #33763 from bozhang2820/new-trigger.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-01 15:02:21 +09:00
Hyukjin Kwon 4ed2dab5ee [SPARK-36608][SQL] Support TimestampNTZ in Arrow
### What changes were proposed in this pull request?

This PR proposes to add the support of `TimestampNTZType` in Arrow APIs.
Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow.

### Why are the changes needed?

To complete the support of `TimestampNTZType` in Apache Spark.

### Does this PR introduce _any_ user-facing change?

Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType`

### How was this patch tested?

Unittests were added.

Closes #33875 from HyukjinKwon/SPARK-36608-arrow.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-01 10:23:42 +09:00
Gengliang Wang 8a52ad9f82 [SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide
### What changes were proposed in this pull request?

The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval

And, the `try_divide` function allows the following inputs:

- number, number
- interval, number

However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.

### Why are the changes needed?

Improve documentation and tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)

Closes #33861 from gengliangwang/enhanceTryDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-29 10:30:04 +09:00
Gengliang Wang e650d06ba9 [SPARK-36597][DOCS] Fix issues in SQL function docs
### What changes were proposed in this pull request?

* the functions make_dt_interval and make_ym_interval should make it clear that some of the fields are optional
* remove the `|` symbol from the doc of `bit_get` https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#bit_get
* Address one missing comment in https://github.com/apache/spark/pull/33824#discussion_r695405699

### Why are the changes needed?

Improve the documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build doc and preview:
![image](https://user-images.githubusercontent.com/1097932/130996918-8c1fff88-ef5a-434b-8445-df7140bad3ba.png)
![image](https://user-images.githubusercontent.com/1097932/130996954-0ced28e7-fb90-4fcc-857e-6ccc31dc3c09.png)

![image](https://user-images.githubusercontent.com/1097932/130955106-5ae32dfc-6e89-4e28-bb8a-6c1b5213051c.png)

![image](https://user-images.githubusercontent.com/1097932/130922351-2f0f262d-5624-4d08-ba83-dfa3ed0b646b.png)

Closes #33847 from gengliangwang/auditSQLDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 13:29:34 +08:00
Jungtaek Lim bc32144a91 [SPARK-36595][SQL][SS][DOCS] Document window & session_window function in SQL API doc
### What changes were proposed in this pull request?

This PR proposes to document `window` & `session_window` function in SQL API doc page.

Screenshot of functions:

> window

![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)

> session_window

![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)

### Why are the changes needed?

Description is missing in both `window` / `session_window` functions for SQL API page.

### Does this PR introduce _any_ user-facing change?

Yes, the description of `window` / `session_window` functions will be available in SQL API page.

### How was this patch tested?

Only doc changes.

Closes #33846 from HeartSaVioR/SPARK-36595.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-27 12:39:09 +09:00
Wenchen Fan 72d6d64835 [SPARK-36587][SQL] Migrate CreateNamespaceStatement to v2 command framework
### What changes were proposed in this pull request?

This PR migrates CreateNamespaceStatement to the v2 command framework. Two new logical plans `UnresolvedObjectName` and `ResolvedObjectName` are introduced to support these CreateXXXStatements.

### Why are the changes needed?

Avoid duplicated code

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33835 from cloud-fan/ddl.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 20:36:04 +08:00
Gengliang Wang 1a42aa5bd4 [SPARK-36457][DOCS] Review and fix issues in Scala/Java API docs
### What changes were proposed in this pull request?

Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:

- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc

### Why are the changes needed?

Improve API docs

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #33824 from gengliangwang/auditDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-26 12:59:18 +08:00
Pablo Langa 14622fcec8 [SPARK-36488][SQL] Improve error message with quotedRegexColumnNames
### What changes were proposed in this pull request?

When `spark.sql.parser.quotedRegexColumnNames=true` and a pattern is used in a place where is not allowed the message is a little bit confusing

```
scala> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")

scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'divide'
```
This PR attempts to improve the error message
```
scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of regular expression in expression 'divide'
```

### Why are the changes needed?

To clarify the error message with this option active

### Does this PR introduce _any_ user-facing change?

Yes, change the error message

### How was this patch tested?

Unit testing and manual testing

Closes #33802 from planga82/feature/spark36488_improve_error_message.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 11:33:40 +08:00
Max Gekk 159ff9fd14 [SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone
### What changes were proposed in this pull request?
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.

### Why are the changes needed?
Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```

### Does this PR introduce _any_ user-facing change?
Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832

spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone	America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```

Closes #33838 from MaxGekk/fix-ts_ntz-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:09:18 +08:00
Gengliang Wang 18143fb426 [SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry
### What changes were proposed in this pull request?

Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:

- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like

This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`

### Why are the changes needed?

Correct the SQL doc
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually

Closes #33834 from gengliangwang/allowSQLFunVersion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-25 22:32:20 +08:00
Max Gekk df0ec56723 [SPARK-36567][SQL] Support foldable special datetime strings by CAST
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.

### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```

### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.

### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```

Closes #33816 from MaxGekk/foldable-datetime-special-values.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-25 14:08:59 +08:00
Hyukjin Kwon 93cec49212 [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:02:53 +09:00
PengLei 3e32ea17db [SPARK-36336][SQL] Add new exception of base exception used in QueryExecutionErrors
### What changes were proposed in this pull request?
When we refactor the query execution errors to use error classes in QueryExecutionErrors, we need define some exception that mix SparkThrowable into a base Exception type.
according the example [SparkArithmeticException](f90eb6a5db/core/src/main/scala/org/apache/spark/SparkException.scala (L75))

Add SparkXXXException as follows:
- `SparkClassNotFoundException`
- `SparkConcurrentModificationException`
- `SparkDateTimeException`
- `SparkFileAlreadyExistsException`
- `SparkFileNotFoundException`
- `SparkNoSuchMethodException`
- `SparkIndexOutOfBoundsException`
- `SparkIOException`
- `SparkSecurityException`
- `SparkSQLException`
- `SparkSQLFeatureNotSupportedException`

Refactor some exceptions in QueryExecutionErrors to use error classes and new exception for testing new exception

Some added by [PR](https://github.com/apache/spark/pull/33538) as follows:

- `SparkUnsupportedOperationException`
- `SparkIllegalStateException`
- `SparkNumberFormatException`
- `SparkIllegalArgumentException`
- `SparkArrayIndexOutOfBoundsException`
- `SparkNoSuchElementException`

### Why are the changes needed?
[SPARK-36336](https://issues.apache.org/jira/browse/SPARK-36336)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed ut test

Closes #33573 from Peng-Lei/SPARK-36336.

Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 09:39:28 +09:00
Gengliang Wang 5b4c216478 [SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package.

### Why are the changes needed?

There are two packages for `org.apache.spark.sql.connector`
SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector
Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector

As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst.
### Does this PR introduce _any_ user-facing change?

No, the trait is not released yet.

### How was this patch tested?

Existing UT.

Closes #33826 from gengliangwang/moveLocalScan.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:23:50 -07:00
Huaxin Gao cd2342691d [SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.

### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33815 from huaxingao/agg_package.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-23 15:31:13 -07:00
yangjie01 1ccb06ca8c Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"
### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:

- https://github.com/apache/spark/pull/31517
- https://github.com/apache/spark/pull/33772

### Why are the changes needed?

1. No really performance improvement in Spark
2. Added an additional dependency

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33784 from LuciferYang/revert-caffeine.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:36:15 +09:00
Yesheng Ma 5c0762b5d2 [SPARK-36448][SQL] Exceptions in NoSuchItemException.scala have to be case classes
### What changes were proposed in this pull request?
Change all exceptions in NoSuchItemException.scala to case classes.

### Why are the changes needed?
Exceptions in NoSuchItemException.scala are not case classes. This is causing issues because in Analyzer's executeAndCheck method always calls the `copy` method on the exception. However, since these exceptions are not case classes, the `copy` method was always delegated to `AnalysisException::copy`, which is not the specialized version.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #33673 from yeshengm/SPARK-36448.

Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-20 20:16:30 +08:00
gengjiaan 462aa7cd3c [SPARK-36428][TESTS][FOLLOWUP] Revert mistake change to DateExpressionsSuite
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/33775 commits the debug code mistakely.
This PR revert the test path.

### Why are the changes needed?
Revoke debug code.

### Does this PR introduce _any_ user-facing change?
 'No'.
Just adjust test.

### How was this patch tested?
Revert non-ansi test path.

Closes #33787 from beliefer/SPARK-36428-followup2.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 21:33:21 +08:00
Shixiong Zhu ea4919801a [SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries
### What changes were proposed in this pull request?

RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version.

A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format.

In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed.

We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint.

### Why are the changes needed?

Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change.

### Does this PR introduce _any_ user-facing change?

No. RocksDB state store is a new feature in Spark 3.2, which has not yet released.

### How was this patch tested?

The new unit tests.

Closes #33749 from zsxwing/SPARK-36519.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-19 00:23:40 -07:00
Max Gekk 1235bd29f0 [SPARK-36536][SQL] Use CAST for datetime in CSV/JSON by default
### What changes were proposed in this pull request?
In the PR, I propose to split the `dateFormat` and `timestampFormat` options in CSV/JSON datasources to:
- In write (`dateFormatInWrite`/`timestampFormatInWrite`). CSV/JSON datasource will use it in formatting of dates/timestamps. If an user doesn't initialise it, it will be set to a default value.
- In read (`dateFormatInRead`/`timestampFormatInRead`). The datasources will use it while parsing of input dates/timestamps strings. If an user doesn't set it, we will keep it as uninitialized (None), and use CAST to parse the input dates/timestamps strings.

### Why are the changes needed?
This should improve user experience with Spark SQL, and make the default parsing behavior more flexible.

### Does this PR introduce _any_ user-facing change?
Potentially, it can.

### How was this patch tested?
By existing test suites, and by new tests that are added to `JsonSuite` and to `CSVSuite`:
```
$ build/sbt "sql/test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "sql/test:testOnly *JsonFunctionsSuite"
$ build/sbt "sql/test:testOnly *CSVv1Suite"
$ build/sbt "sql/test:testOnly *JsonV2Suite"
```

Closes #33769 from MaxGekk/split-datetime-ds-options.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-19 09:30:50 +03:00
Kousuke Saruta c458edb77e [SPARK-36371][SQL] Support raw string literal
### What changes were proposed in this pull request?

This PR proposes to support raw string literal which escape no character using `\`.
The raw string literal is the form of `r"..."` or `r'...'` like the syntax BigQuery and Python supports.

Actually, there is no standard way to represent raw string literals.

In PostgreSQL, any special character isn't escaped by \ unless a string literal starts with E prefix.
https://www.postgresql.org/docs/13/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS

In MySQL, special characters in a string literal are not escaped if NO_BACKSLASH_ESCAPES is enabled.
https://dev.mysql.com/doc/refman/8.0/en/string-literals.html

In MsSQLServer, any special character isn't escaped by \ but STRING_ESCAPE function can escape such characters.
https://docs.microsoft.com/en-us/sql/t-sql/functions/string-escape-transact-sql?view=sql-server-ver15

But BigQuery supports `r"..."` and `r'...'` forms so this PR proposes this syntax for the purpose.
https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#literals

### Why are the changes needed?

In the current master, sometimes it's too confusable to represent JSON and regex in a string literal if they contain backslash.
For example, in JSON, `\` needs to be escaped like as follows.
```
{"a": "\\"}
```
But, if the JSON above is represented in a string literal, further two `\` are needed because string literal also requires `\` to be escaped.
```
SELECT from_json('{"a": "\\\\"}', 'a string')
{"a":"\"}
```
With the raw string literal, we can represent such JSON like as follows.
```
SELECT from_json(r'{"a": "\\"}', 'a string')
{"a":"\"}
```

### Does this PR introduce _any_ user-facing change?

No. This PR just extends the existing syntax.

### How was this patch tested?

Added new test.
I also confirmed that the modified document is successfully built with `SKIP_API=1 bundle exec jekyll build`.
![raw_string_literal](https://user-images.githubusercontent.com/4736016/129223184-3bb4e206-f40f-42d2-a128-0cd8fc83b9c9.png)

Closes #33599 from sarutak/raw-string-literal.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:37:32 +08:00
gengjiaan 707eefa3c7 [SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp
### What changes were proposed in this pull request?
The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter.
This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code.

### Why are the changes needed?
Simplify `make_timestamp`.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
New tests.

Closes #33775 from beliefer/SPARK-36428-followup.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-18 22:57:06 +08:00
Gengliang Wang 8bfb4f1e72 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"
### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-17 20:23:49 +08:00
Max Gekk 82a31508af [SPARK-36524][SQL] Common class for ANSI interval types
### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType`

### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance:
```scala
    case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
    case _: AnsiIntervalType => false
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By existing test suites.

Closes #33753 from MaxGekk/ansi-interval-type-trait.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-17 12:27:56 +03:00
Gengliang Wang 26d6b952dc [SPARK-36521][SQL] Disallow comparison between Interval and String
### What changes were proposed in this pull request?

Disallow comparison between Interval and String in the default type coercion rules.

### Why are the changes needed?

If a binary comparison contains interval type and string type, we can't decide which
interval type the string should be promoted as. There are many possible interval
types, such as year interval, month interval, day interval, hour interval, etc.

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33750 from gengliangwang/disallowCom.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 22:41:14 +03:00
Max Gekk f620996142 [SPARK-36418][SQL] Use CAST in parsing of dates/timestamps with default pattern
### What changes were proposed in this pull request?
In the PR, I propose to use the `CAST` logic when the pattern is not specified in `DateFormatter` or `TimestampFormatter`. In particular, invoke the `DateTimeUtils.stringToTimestampAnsi()` or `stringToDateAnsi()` in the case.

### Why are the changes needed?
1. This can improve user experience with Spark SQL by making the default date/timestamp parsers more flexible and tolerant to their inputs.
2. We make the default case consistent to the behavior of the `CAST` expression which makes implementation more consistent.

### Does this PR introduce _any_ user-facing change?
The changes shouldn't introduce behavior change in regular cases but it can influence on corner cases. New implementation is able to parse more dates/timestamps by default. For instance, old (current) date parses can recognize dates only in the format **yyyy-MM-dd** but new one can handle:
   * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d *`
   * `[+-]yyyy*-[m]m-[d]dT*`

Similarly for timestamps. The old (current) timestamp formatter is able to parse timestamps only in the format **yyyy-MM-dd HH:mm:ss** by default, but new implementation can handle:
   * `[+-]yyyy*`
   * `[+-]yyyy*-[m]m`
   * `[+-]yyyy*-[m]m-[d]d`
   * `[+-]yyyy*-[m]m-[d]d `
   * `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
   * `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`

### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *ImageFileFormatSuite"
$ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite"
```

Closes #33709 from MaxGekk/datetime-cast-default-pattern.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-16 23:29:33 +08:00
Wenchen Fan f4b31c6068 [SPARK-36498][SQL] Reorder inner fields of the input query in byName V2 write
### What changes were proposed in this pull request?

Today, when we write data to a v2 table with byName mode, we only reorder the top-level columns, not inner struct fields. This doesn't make sense as Spark should treat inner struct fields as the first-class citizen (e.g. nested column pruning, filter pushdown with nested columns).

This PR improves `TableOutputResolver` to reorder inner fields as well.

### Why are the changes needed?

better user-experience

### Does this PR introduce _any_ user-facing change?

yes, more queries are allowed to write to v2 tables.

### How was this patch tested?

new test

Closes #33728 from cloud-fan/reorder.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-16 15:08:08 +08:00
Kousuke Saruta 9b9db5a8a0 [SPARK-36491][SQL] Make from_json/to_json to handle timestamp_ntz type properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_json` and `to_json` cannot handle `timestamp_ntz` type properly.
In the current master, `from_json`/`to_json` can handle `timestamp` type like as follows.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```
```
SELECT to_json(map("a", TIMESTAMP"2021-11-23 11:22:33"));
{"a":"2021-11-23T11:22:33.000+09:00"}
```
But they cannot handle `timestamp_ntz` type properly.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ");
21/08/12 16:16:00 ERROR SparkSQLDriver: Failed in [SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
        at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
        at org.apache.spark.sql.catalyst.json.JacksonParser.makeConverter(JacksonParser.scala:339)
        at org.apache.spark.sql.catalyst.json.JacksonParser.$anonfun$makeConverter$17(JacksonParser.scala:313)
```
```
SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"));
21/08/12 16:14:07 ERROR SparkSQLDriver: Failed in [SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"))]
java.lang.RuntimeException: Failed to convert value 1637666553000000 (class of class java.lang.Long) with the type of TimestampNTZType to JSON.
        at org.apache.spark.sql.errors.QueryExecutionErrors$.failToConvertValueToJsonError(QueryExecutionErrors.scala:294)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25(JacksonGenerator.scala:201)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25$adapted(JacksonGenerator.scala:199)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeMapData(JacksonGenerator.scala:253)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$write$3(JacksonGenerator.scala:293)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeObject(JacksonGenerator.scala:206)
        at org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:292)
```
### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33742 from sarutak/json-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-16 10:03:22 +03:00
Liang-Chi Hsieh 8b8d91cf64 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:00 +09:00
Pablo Langa a9ab41ad56 [SPARK-35320][SQL] Align error message for unsupported key types in MapType in Json reader
### What changes were proposed in this pull request?

This PR is related with https://github.com/apache/spark/pull/33525.
The purpose is to align error messages between the function from_json and the Json reader for unsupported key types in MapType.
Current behavior:
```
scala> spark.read.schema(StructType(Seq(StructField("col", MapType(IntegerType, StringType))))).json(Seq("""{"1": "test"}""").toDS()).show
+----+
| col|
+----+
|null|
+----+

```
```
scala> Seq("""{"1": "test"}""").toDF("col").write.json("/tmp/jsontests1234")

scala> spark.read.schema(StructType(Seq(StructField("col", MapType(IntegerType, StringType))))).json("/tmp/jsontests1234").show
+----+
| col|
+----+
|null|
+----+
```
With this change, an AnalysisException with the message `"Input schema $schema can only contain StringType as a key type for a MapType."` wil be thrown

### Why are the changes needed?

It's more consistent to align the behavior

### Does this PR introduce _any_ user-facing change?

Yes, now an Exception will be thrown

### How was this patch tested?

Unit testing, manual testing

Closes #33672 from planga82/feature/spark35320_improve_error_message_reader.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-15 10:31:57 +09:00
Gengliang Wang ecdea91602 [SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal
### What changes were proposed in this pull request?

If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc.
The related binary operation for Interval contains
- Add
- Subtract
- Comparisions

Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them.

### Why are the changes needed?

Avoid ambiguously implicit casting string literals to interval types.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion is not released yet.

### How was this patch tested?

New tests.

Closes #33737 from gengliangwang/disallowStringAndInterval.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-14 10:45:05 +08:00
yangjie01 1da1e33a49 [SPARK-36495][SQL] Use type match to simplify methods in CatalystTypeConverter
### What changes were proposed in this pull request?

`CatalystTypeConverter.toCatalyst` method use `isInstanceOf  + asInstanceOf` for type conversion, the main change of this pr is use  type match to simplify this process.

`CatalystTypeConverters.createToCatalystConverter` method has a similar pattern.

### Why are the changes needed?
Code simplification

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Add a new case to `ScalaReflectionSuite` to add the coverage of the `case None` branch of `CatalystTypeConverter#toCatalyst` method

Closes #33722 from LuciferYang/SPARK-36495.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 19:24:22 +09:00
Kousuke Saruta 7fd34548b1 [SPARK-36490][SQL] Make from_csv/to_csv to handle timestamp_ntz type properly
### What changes were proposed in this pull request?

This PR fixes an issue that `from_csv` and `to_csv` cannot handle `timestamp_ntz` type properly.
In the current master, to_csv/from_csv can handle timestamp type like as follows.
```
SELECT to_csv(struct(TIMESTAMP"2021-11-23 11:22:33"));
2021-11-23T11:22:33.000+09:00
```
```
SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```

But they cannot handle timestamp_ntz type properly.
```
SELECT to_csv(struct(TIMESTAMP_NTZ"2021-11-23 11:22:33"));
-- 2021-11-23T11:22:33.000 is expected.
1637666553000000
```
```
SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP_NTZ");
21/08/12 16:12:49 ERROR SparkSQLDriver: Failed in [SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
        at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
        at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:234)
        at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33719 from sarutak/csv-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-13 12:08:53 +03:00
gengjiaan 7d82336734 [SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type
### What changes were proposed in this pull request?
With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode.

```
org.apache.spark.sql.AnalysisException
cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7
```

We should update the function `make_timestamp` to allow integer type 'seconds' parameter.

### Why are the changes needed?
Make `make_timestamp` could accepts integer as 'seconds' parameter.

### Does this PR introduce _any_ user-facing change?
'Yes'.
`make_timestamp` could accepts integer as 'seconds' parameter.

### How was this patch tested?
New tests.

Closes #33665 from beliefer/SPARK-36428.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 13:13:02 +08:00
Maryann Xue 29b1e394c6 [SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs
### What changes were proposed in this pull request?
This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances.

The major changes include:
1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined.
2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference.
3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined.
4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse.

A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.

### Why are the changes needed?
This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added UTs.
Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs.

Closes #33671 from maryannxue/spark-36447.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 11:37:02 +08:00
Dongjoon Hyun e8e5785f02 [SPARK-36502][SQL] Remove jaxb-api from sql/catalyst module
### What changes were proposed in this pull request?

This PR aims to remove `jaxb-api` usage from `sql/catalyst` module.

### Why are the changes needed?

We only use `DatatypeConverter.parseHexBinary` and `DatatypeConverter.printHexBinary` twice.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33732 from dongjoon-hyun/SPARK-36502.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 12:31:09 +09:00
Gengliang Wang d4466d55ca [SPARK-36497][SQL] Support Interval add/subtract NULL
### What changes were proposed in this pull request?

Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type.
This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well.
To change it, I propose to cast the null as the same data type of the other element in the add/subtract:
```
null + interval => cast(null as interval) + interval
null - interval => cast(null as interval) - interval
interval + null=> interval + cast(null as interval)
interval - null => interval - cast(null as interval)
```

### Why are the changes needed?

Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval`

### Does this PR introduce _any_ user-facing change?

No, the new interval type is not released yet.

### How was this patch tested?

Existing UT

Closes #33727 from gengliangwang/intervalTypeCoercion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-13 11:10:32 +08:00
Gengliang Wang 48e333af54 [SPARK-36445][SQL][FOLLOWUP] ANSI type coercion: revisit promoting string literals in datetime expressions
### What changes were proposed in this pull request?

1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted
```
string - timestamp => cast(string as timestamp) - timestamp
```
This PR is to promote the following string literals:
```
string - date => cast(string as date) - date
date - string => date - cast(date as string)
timestamp - string => timestamp
```
It is very straightforward to cast the string literal as the data type of the other side in the subtraction.

2. Merge the string promotion logic from the rule `StringLiteralCoercion`:
```
date_sub(date, string) => date_sub(date, cast(string as int))
date_add(date, string) => date_add(date, cast(string as int))
```

### Why are the changes needed?

1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL
2. Certerize all the string literal promotion in the ANSI type coercion rule

### Does this PR introduce _any_ user-facing change?

No, the new ANSI type coercion rules are not released yet.

### How was this patch tested?

Existing UT

Closes #33724 from gengliangwang/datetimeTypeCoercion.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-13 01:02:34 +08:00
Gengliang Wang 3029e62a82 [SPARK-36445][SQL] ANSI type coercion rule for date time operations
### What changes were proposed in this pull request?

Implement a new rule for the date-time operations in the ANSI type coercion system:
1. Date will be converted to Timestamp when it is in the subtraction with Timestmap.
2. Promote string literals in date_add/date_sub/time_add

### Why are the changes needed?

Currently the type coercion rule `DateTimeOperations` doesn't match the design of the ANSI type coercion system:
1. For date_add/date_sub, if the input is timestamp type, Spark should not convert it into date type since date type is narrower than the timestamp type.
2. For date_add/date_sub/time_add, string value can be implicit cast to date/timestamp only when it is literal.

Thus, we need to have a new rule for the date-time operations in the ANSI type coercion system.

### Does this PR introduce _any_ user-facing change?

No, the ANSI type coercion rules are not releaesd.

### How was this patch tested?

New UT

Closes #33666 from gengliangwang/datetimeOp.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-11 11:55:45 +08:00
Jungtaek Lim ed60aaa9f1 [SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window
### What changes were proposed in this pull request?

This PR proposes to prohibit update mode in streaming aggregation with session window.

UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.

This PR also cleans up test code via deduplicating.

### Why are the changes needed?

The semantic of "update" mode for session window based streaming aggregation is quite unclear.

For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.

This doesn't hold true for session window based streaming aggregation, as session range is changing.

If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.

If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.

### Does this PR introduce _any_ user-facing change?

No, as we haven't released this feature.

### How was this patch tested?

Updated tests.

Closes #33689 from HeartSaVioR/SPARK-36463.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-11 10:45:52 +09:00
gengjiaan 186815be1c [SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported
### What changes were proposed in this pull request?
Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`.
```
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>>
-- !query output
{"t":null}
```

```
-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz
```

We should make `from_json` throws exception too.
This PR fix the discussion below
https://github.com/apache/spark/pull/33640#discussion_r682862523

### Why are the changes needed?
Make the behavior of `from_json` more reasonable.

### Does this PR introduce _any_ user-facing change?
'Yes'.
from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ.

### How was this patch tested?
Tests updated.

Closes #33684 from beliefer/SPARK-36429-new.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 22:52:20 +08:00
Angerszhuuuu 89d8a4eacf [SPARK-36431][SQL] Support TypeCoercion of ANSI intervals with different fields
### What changes were proposed in this pull request?
 Support TypeCoercion of ANSI intervals with different fields

### Why are the changes needed?
 Support TypeCoercion of ANSI intervals with different fields

### Does this PR introduce _any_ user-facing change?
After this pr user can
 - use comparison function with  different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH`
 - support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)`
 - support different field of ansi interval type in `coalesce` etc..

### How was this patch tested?
Added UT

Closes #33661 from AngersZhuuuu/SPARK-SPARK-36431.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-10 14:22:31 +03:00
Cheng Pan 7f56b73cad [SPARK-36466][SQL] Table in unloaded catalog referenced by view should load correctly
### What changes were proposed in this pull request?

Retain `spark.sql.catalog.*` confs when resolving view.

### Why are the changes needed?

Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet.

### Does this PR introduce _any_ user-facing change?

Yes, bug fix.

### How was this patch tested?

Add UT.

Closes #33692 from pan3793/SPARK-36466.

Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 17:31:21 +08:00
Terry Kim e1a5d94117 [SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns
### What changes were proposed in this pull request?

Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to handle it.

### Why are the changes needed?

To check the duplicate columns during analysis.

### Does this PR introduce _any_ user-facing change?

Yes, now the above will command will print out the following:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`
```

### How was this patch tested?

Added new unit tests

Closes #33676 from imback82/replace_cols_duplicates.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-10 13:20:29 +08:00
Mick Jermsurawong 33c6d1168c [SPARK-20384][SQL] Support value class in nested schema for Dataset
### What changes were proposed in this pull request?

- This PR revisits https://github.com/apache/spark/pull/22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class.
- Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema.
- We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
- However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
- With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
- This PR revisits https://github.com/apache/spark/pull/27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]`

### Why are the changes needed?

- Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given `AnyVal` wrapper and its root-level class container
```
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)
```
The problematic part of generated code:
```
    private InternalRow If_1(InternalRow i) {
        boolean isNull_42 = i.isNullAt(0);
        // 1) ******** The root-level case class we care
        org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
            null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
        if (isNull_42) {
            throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
        }
        boolean isNull_39 = true;
        // 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
        org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
        if (!false) {

            isNull_39 = false;
            if (!isNull_39) {
                // 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
                value_43 = value_46.c();
            }
        }
```
We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```

From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`,
1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`.

### Does this PR introduce _any_ user-facing change?

- Yes, this will allow support for the nested value class.

### How was this patch tested?

- Added unit tests to illustrate
  - raw schema
  - projection
  - round-trip encode/decode

Closes #33205 from mickjermsurawong-stripe/SPARK-20384-2.

Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-08-09 08:47:35 -05:00
Wenchen Fan 9a539d5846 [SPARK-36430][SQL] Adaptively calculate the target size when coalescing shuffle partitions in AQE
### What changes were proposed in this pull request?

This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172

Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs.

After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable.

This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster.

### Why are the changes needed?

fix perf regression

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33655 from cloud-fan/minor.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 17:25:55 +08:00
ulysses-you bb6f65acca [SPARK-36424][SQL] Support eliminate limits in AQE Optimizer
### What changes were proposed in this pull request?

* override the maxRows method in `LogicalQueryStage`
* add rule `EliminateLimits` in `AQEOptimizer`

### Why are the changes needed?

In Ad-hoc scenario, we always add limit for the query if user have no special limit value, but not all limit is nesessary.

With the power of AQE, we can eliminate limits using running statistics.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

add test

Closes #33651 from ulysses-you/SPARK-36424.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:51:51 +08:00
Angerszhuuuu e051a540a1 [SPARK-36352][SQL] Spark should check result plan's output schema name
### What changes were proposed in this pull request?
Spark should check result plan's output schema name

### Why are the changes needed?
In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan
```
Project[a, B]
|--Scan[A, b, c]
```
the origin output schema is `a, B`, after SchemaPruning. it become
```
Project[A, b]
|--Scan[A, b]
```
It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existed UT

Closes #33583 from AngersZhuuuu/SPARK-36352.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-09 16:47:56 +08:00