### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message
1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
### Why are the changes needed?
Provide proper error message
### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc
### How was this patch tested?
Added UT
Closes#33892 from AngersZhuuuu/SPARK-36637.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only.
The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`.
This work is an extension of SPARK-24994 and SPARK-32858
### Why are the changes needed?
Current implementation of Spark without this PR cannot properly optimize the filter for the following case
```
SELECT * FROM t WHERE boolean_field = 2
```
The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passed existing tests
```
build/sbt "catalyst/test"
build/sbt "sql/test"
```
Added unit tests
```
build/sbt "catalyst/testOnly *UnwrapCastInBinaryComparisonSuite -- -z SPARK-36607"
build/sbt "sql/testOnly *UnwrapCastInComparisonEndToEndSuite -- -z SPARK-36607"
```
Closes#33865 from kazuyukitanimura/SPARK-36607.
Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one.
To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called.
This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`.
For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query.
### Why are the changes needed?
Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory.
### Does this PR introduce _any_ user-facing change?
Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change.
### How was this patch tested?
Added unit tests.
Closes#33763 from bozhang2820/new-trigger.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to add the support of `TimestampNTZType` in Arrow APIs.
Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow.
### Why are the changes needed?
To complete the support of `TimestampNTZType` in Apache Spark.
### Does this PR introduce _any_ user-facing change?
Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType`
### How was this patch tested?
Unittests were added.
Closes#33875 from HyukjinKwon/SPARK-36608-arrow.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval
And, the `try_divide` function allows the following inputs:
- number, number
- interval, number
However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.
### Why are the changes needed?
Improve documentation and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)
Closes#33861 from gengliangwang/enhanceTryDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to document `window` & `session_window` function in SQL API doc page.
Screenshot of functions:
> window
![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)
> session_window
![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)
### Why are the changes needed?
Description is missing in both `window` / `session_window` functions for SQL API page.
### Does this PR introduce _any_ user-facing change?
Yes, the description of `window` / `session_window` functions will be available in SQL API page.
### How was this patch tested?
Only doc changes.
Closes#33846 from HeartSaVioR/SPARK-36595.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR migrates CreateNamespaceStatement to the v2 command framework. Two new logical plans `UnresolvedObjectName` and `ResolvedObjectName` are introduced to support these CreateXXXStatements.
### Why are the changes needed?
Avoid duplicated code
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33835 from cloud-fan/ddl.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Improve API docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33824 from gengliangwang/auditDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
When `spark.sql.parser.quotedRegexColumnNames=true` and a pattern is used in a place where is not allowed the message is a little bit confusing
```
scala> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'divide'
```
This PR attempts to improve the error message
```
scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of regular expression in expression 'divide'
```
### Why are the changes needed?
To clarify the error message with this option active
### Does this PR introduce _any_ user-facing change?
Yes, change the error message
### How was this patch tested?
Unit testing and manual testing
Closes#33802 from planga82/feature/spark36488_improve_error_message.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.
### Why are the changes needed?
Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```
### Does this PR introduce _any_ user-facing change?
Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```
### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```
Closes#33838 from MaxGekk/fix-ts_ntz-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like
This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`
### Why are the changes needed?
Correct the SQL doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually
Closes#33834 from gengliangwang/allowSQLFunVersion.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.
### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```
### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.
### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```
Closes#33816 from MaxGekk/foldable-datetime-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.
```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```
**Before:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
+- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
+- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
+- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
+- Project [id#37L]
+- Filter atleastnnonnulls(1, id#37L)
+- Scan ExistingRDD[__index_level_0__#36L,id#37L]
# ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```
**After:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
+- HashAggregate(keys=[id#258L], functions=[count(1)])
+- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
+- Filter atleastnnonnulls(1, id#258L)
+- Range (0, 10, step=1, splits=16)
# ^^^ Removed the Spark job execution for `zipWithIndex`
```
### Why are the changes needed?
To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.
Closes#33807 from HyukjinKwon/SPARK-36559.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When we refactor the query execution errors to use error classes in QueryExecutionErrors, we need define some exception that mix SparkThrowable into a base Exception type.
according the example [SparkArithmeticException](f90eb6a5db/core/src/main/scala/org/apache/spark/SparkException.scala (L75))
Add SparkXXXException as follows:
- `SparkClassNotFoundException`
- `SparkConcurrentModificationException`
- `SparkDateTimeException`
- `SparkFileAlreadyExistsException`
- `SparkFileNotFoundException`
- `SparkNoSuchMethodException`
- `SparkIndexOutOfBoundsException`
- `SparkIOException`
- `SparkSecurityException`
- `SparkSQLException`
- `SparkSQLFeatureNotSupportedException`
Refactor some exceptions in QueryExecutionErrors to use error classes and new exception for testing new exception
Some added by [PR](https://github.com/apache/spark/pull/33538) as follows:
- `SparkUnsupportedOperationException`
- `SparkIllegalStateException`
- `SparkNumberFormatException`
- `SparkIllegalArgumentException`
- `SparkArrayIndexOutOfBoundsException`
- `SparkNoSuchElementException`
### Why are the changes needed?
[SPARK-36336](https://issues.apache.org/jira/browse/SPARK-36336)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existed ut test
Closes#33573 from Peng-Lei/SPARK-36336.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there.
### Why are the changes needed?
Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33815 from huaxingao/agg_package.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This pr revert the change of SPARK-34309, includes:
- https://github.com/apache/spark/pull/31517
- https://github.com/apache/spark/pull/33772
### Why are the changes needed?
1. No really performance improvement in Spark
2. Added an additional dependency
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#33784 from LuciferYang/revert-caffeine.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change all exceptions in NoSuchItemException.scala to case classes.
### Why are the changes needed?
Exceptions in NoSuchItemException.scala are not case classes. This is causing issues because in Analyzer's executeAndCheck method always calls the `copy` method on the exception. However, since these exceptions are not case classes, the `copy` method was always delegated to `AnalysisException::copy`, which is not the specialized version.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UTs.
Closes#33673 from yeshengm/SPARK-36448.
Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/33775 commits the debug code mistakely.
This PR revert the test path.
### Why are the changes needed?
Revoke debug code.
### Does this PR introduce _any_ user-facing change?
'No'.
Just adjust test.
### How was this patch tested?
Revert non-ansi test path.
Closes#33787 from beliefer/SPARK-36428-followup2.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version.
A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format.
In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed.
We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint.
### Why are the changes needed?
Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change.
### Does this PR introduce _any_ user-facing change?
No. RocksDB state store is a new feature in Spark 3.2, which has not yet released.
### How was this patch tested?
The new unit tests.
Closes#33749 from zsxwing/SPARK-36519.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to split the `dateFormat` and `timestampFormat` options in CSV/JSON datasources to:
- In write (`dateFormatInWrite`/`timestampFormatInWrite`). CSV/JSON datasource will use it in formatting of dates/timestamps. If an user doesn't initialise it, it will be set to a default value.
- In read (`dateFormatInRead`/`timestampFormatInRead`). The datasources will use it while parsing of input dates/timestamps strings. If an user doesn't set it, we will keep it as uninitialized (None), and use CAST to parse the input dates/timestamps strings.
### Why are the changes needed?
This should improve user experience with Spark SQL, and make the default parsing behavior more flexible.
### Does this PR introduce _any_ user-facing change?
Potentially, it can.
### How was this patch tested?
By existing test suites, and by new tests that are added to `JsonSuite` and to `CSVSuite`:
```
$ build/sbt "sql/test:testOnly *CSVLegacyTimeParserSuite"
$ build/sbt "sql/test:testOnly *JsonFunctionsSuite"
$ build/sbt "sql/test:testOnly *CSVv1Suite"
$ build/sbt "sql/test:testOnly *JsonV2Suite"
```
Closes#33769 from MaxGekk/split-datetime-ds-options.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to support raw string literal which escape no character using `\`.
The raw string literal is the form of `r"..."` or `r'...'` like the syntax BigQuery and Python supports.
Actually, there is no standard way to represent raw string literals.
In PostgreSQL, any special character isn't escaped by \ unless a string literal starts with E prefix.
https://www.postgresql.org/docs/13/sql-syntax-lexical.html#SQL-SYNTAX-CONSTANTS
In MySQL, special characters in a string literal are not escaped if NO_BACKSLASH_ESCAPES is enabled.
https://dev.mysql.com/doc/refman/8.0/en/string-literals.html
In MsSQLServer, any special character isn't escaped by \ but STRING_ESCAPE function can escape such characters.
https://docs.microsoft.com/en-us/sql/t-sql/functions/string-escape-transact-sql?view=sql-server-ver15
But BigQuery supports `r"..."` and `r'...'` forms so this PR proposes this syntax for the purpose.
https://cloud.google.com/bigquery/docs/reference/standard-sql/lexical#literals
### Why are the changes needed?
In the current master, sometimes it's too confusable to represent JSON and regex in a string literal if they contain backslash.
For example, in JSON, `\` needs to be escaped like as follows.
```
{"a": "\\"}
```
But, if the JSON above is represented in a string literal, further two `\` are needed because string literal also requires `\` to be escaped.
```
SELECT from_json('{"a": "\\\\"}', 'a string')
{"a":"\"}
```
With the raw string literal, we can represent such JSON like as follows.
```
SELECT from_json(r'{"a": "\\"}', 'a string')
{"a":"\"}
```
### Does this PR introduce _any_ user-facing change?
No. This PR just extends the existing syntax.
### How was this patch tested?
Added new test.
I also confirmed that the modified document is successfully built with `SKIP_API=1 bundle exec jekyll build`.
![raw_string_literal](https://user-images.githubusercontent.com/4736016/129223184-3bb4e206-f40f-42d2-a128-0cd8fc83b9c9.png)
Closes#33599 from sarutak/raw-string-literal.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter.
This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code.
### Why are the changes needed?
Simplify `make_timestamp`.
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
New tests.
Closes#33775 from beliefer/SPARK-36428-followup.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)
### Why are the changes needed?
It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.
As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.
### Does this PR introduce _any_ user-facing change?
No, the feature is not released yet.
### How was this patch tested?
Unit tests
Closes#33758 from gengliangwang/revertGroupByAlias.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType`
### Why are the changes needed?
To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance:
```scala
case _: YearMonthIntervalType | _: DayTimeIntervalType => false
```
by
```scala
case _: AnsiIntervalType => false
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By existing test suites.
Closes#33753 from MaxGekk/ansi-interval-type-trait.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Disallow comparison between Interval and String in the default type coercion rules.
### Why are the changes needed?
If a binary comparison contains interval type and string type, we can't decide which
interval type the string should be promoted as. There are many possible interval
types, such as year interval, month interval, day interval, hour interval, etc.
### Does this PR introduce _any_ user-facing change?
No, the new interval type is not released yet.
### How was this patch tested?
Existing UT
Closes#33750 from gengliangwang/disallowCom.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to use the `CAST` logic when the pattern is not specified in `DateFormatter` or `TimestampFormatter`. In particular, invoke the `DateTimeUtils.stringToTimestampAnsi()` or `stringToDateAnsi()` in the case.
### Why are the changes needed?
1. This can improve user experience with Spark SQL by making the default date/timestamp parsers more flexible and tolerant to their inputs.
2. We make the default case consistent to the behavior of the `CAST` expression which makes implementation more consistent.
### Does this PR introduce _any_ user-facing change?
The changes shouldn't introduce behavior change in regular cases but it can influence on corner cases. New implementation is able to parse more dates/timestamps by default. For instance, old (current) date parses can recognize dates only in the format **yyyy-MM-dd** but new one can handle:
* `[+-]yyyy*`
* `[+-]yyyy*-[m]m`
* `[+-]yyyy*-[m]m-[d]d`
* `[+-]yyyy*-[m]m-[d]d `
* `[+-]yyyy*-[m]m-[d]d *`
* `[+-]yyyy*-[m]m-[d]dT*`
Similarly for timestamps. The old (current) timestamp formatter is able to parse timestamps only in the format **yyyy-MM-dd HH:mm:ss** by default, but new implementation can handle:
* `[+-]yyyy*`
* `[+-]yyyy*-[m]m`
* `[+-]yyyy*-[m]m-[d]d`
* `[+-]yyyy*-[m]m-[d]d `
* `[+-]yyyy*-[m]m-[d]d [h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
* `[+-]yyyy*-[m]m-[d]dT[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
* `[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
* `T[h]h:[m]m:[s]s.[ms][ms][ms][us][us][us][zone_id]`
### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *ImageFileFormatSuite"
$ build/sbt "test:testOnly *ParquetV2PartitionDiscoverySuite"
```
Closes#33709 from MaxGekk/datetime-cast-default-pattern.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Today, when we write data to a v2 table with byName mode, we only reorder the top-level columns, not inner struct fields. This doesn't make sense as Spark should treat inner struct fields as the first-class citizen (e.g. nested column pruning, filter pushdown with nested columns).
This PR improves `TableOutputResolver` to reorder inner fields as well.
### Why are the changes needed?
better user-experience
### Does this PR introduce _any_ user-facing change?
yes, more queries are allowed to write to v2 tables.
### How was this patch tested?
new test
Closes#33728 from cloud-fan/reorder.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes an issue that `from_json` and `to_json` cannot handle `timestamp_ntz` type properly.
In the current master, `from_json`/`to_json` can handle `timestamp` type like as follows.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```
```
SELECT to_json(map("a", TIMESTAMP"2021-11-23 11:22:33"));
{"a":"2021-11-23T11:22:33.000+09:00"}
```
But they cannot handle `timestamp_ntz` type properly.
```
SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ");
21/08/12 16:16:00 ERROR SparkSQLDriver: Failed in [SELECT from_json('{"a":"2021-11-23 11:22:33"}', "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
at org.apache.spark.sql.catalyst.json.JacksonParser.makeConverter(JacksonParser.scala:339)
at org.apache.spark.sql.catalyst.json.JacksonParser.$anonfun$makeConverter$17(JacksonParser.scala:313)
```
```
SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"));
21/08/12 16:14:07 ERROR SparkSQLDriver: Failed in [SELECT to_json(map("a", TIMESTAMP_NTZ"2021-11-23 11:22:33"))]
java.lang.RuntimeException: Failed to convert value 1637666553000000 (class of class java.lang.Long) with the type of TimestampNTZType to JSON.
at org.apache.spark.sql.errors.QueryExecutionErrors$.failToConvertValueToJsonError(QueryExecutionErrors.scala:294)
at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25(JacksonGenerator.scala:201)
at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$makeWriter$25$adapted(JacksonGenerator.scala:199)
at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeMapData(JacksonGenerator.scala:253)
at org.apache.spark.sql.catalyst.json.JacksonGenerator.$anonfun$write$3(JacksonGenerator.scala:293)
at org.apache.spark.sql.catalyst.json.JacksonGenerator.writeObject(JacksonGenerator.scala:206)
at org.apache.spark.sql.catalyst.json.JacksonGenerator.write(JacksonGenerator.scala:292)
```
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33742 from sarutak/json-ntz.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This patch supports dynamic gap duration in session window.
### Why are the changes needed?
The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.
### Does this PR introduce _any_ user-facing change?
Yes, users can specify dynamic gap duration.
### How was this patch tested?
Modified existing tests and new test.
Closes#33691 from viirya/dynamic-session-window-gap.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR is related with https://github.com/apache/spark/pull/33525.
The purpose is to align error messages between the function from_json and the Json reader for unsupported key types in MapType.
Current behavior:
```
scala> spark.read.schema(StructType(Seq(StructField("col", MapType(IntegerType, StringType))))).json(Seq("""{"1": "test"}""").toDS()).show
+----+
| col|
+----+
|null|
+----+
```
```
scala> Seq("""{"1": "test"}""").toDF("col").write.json("/tmp/jsontests1234")
scala> spark.read.schema(StructType(Seq(StructField("col", MapType(IntegerType, StringType))))).json("/tmp/jsontests1234").show
+----+
| col|
+----+
|null|
+----+
```
With this change, an AnalysisException with the message `"Input schema $schema can only contain StringType as a key type for a MapType."` wil be thrown
### Why are the changes needed?
It's more consistent to align the behavior
### Does this PR introduce _any_ user-facing change?
Yes, now an Exception will be thrown
### How was this patch tested?
Unit testing, manual testing
Closes#33672 from planga82/feature/spark35320_improve_error_message_reader.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc.
The related binary operation for Interval contains
- Add
- Subtract
- Comparisions
Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them.
### Why are the changes needed?
Avoid ambiguously implicit casting string literals to interval types.
### Does this PR introduce _any_ user-facing change?
No, the ANSI type coercion is not released yet.
### How was this patch tested?
New tests.
Closes#33737 from gengliangwang/disallowStringAndInterval.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
`CatalystTypeConverter.toCatalyst` method use `isInstanceOf + asInstanceOf` for type conversion, the main change of this pr is use type match to simplify this process.
`CatalystTypeConverters.createToCatalystConverter` method has a similar pattern.
### Why are the changes needed?
Code simplification
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Add a new case to `ScalaReflectionSuite` to add the coverage of the `case None` branch of `CatalystTypeConverter#toCatalyst` method
Closes#33722 from LuciferYang/SPARK-36495.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `from_csv` and `to_csv` cannot handle `timestamp_ntz` type properly.
In the current master, to_csv/from_csv can handle timestamp type like as follows.
```
SELECT to_csv(struct(TIMESTAMP"2021-11-23 11:22:33"));
2021-11-23T11:22:33.000+09:00
```
```
SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP");
{"a":2021-11-23 11:22:33}
```
But they cannot handle timestamp_ntz type properly.
```
SELECT to_csv(struct(TIMESTAMP_NTZ"2021-11-23 11:22:33"));
-- 2021-11-23T11:22:33.000 is expected.
1637666553000000
```
```
SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP_NTZ");
21/08/12 16:12:49 ERROR SparkSQLDriver: Failed in [SELECT from_csv("2021-11-23 11:22:33", "a TIMESTAMP_NTZ")]
java.lang.Exception: Unsupported type: timestamp_ntz
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:777)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:234)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33719 from sarutak/csv-ntz.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode.
```
org.apache.spark.sql.AnalysisException
cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7
```
We should update the function `make_timestamp` to allow integer type 'seconds' parameter.
### Why are the changes needed?
Make `make_timestamp` could accepts integer as 'seconds' parameter.
### Does this PR introduce _any_ user-facing change?
'Yes'.
`make_timestamp` could accepts integer as 'seconds' parameter.
### How was this patch tested?
New tests.
Closes#33665 from beliefer/SPARK-36428.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances.
The major changes include:
1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined.
2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference.
3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined.
4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse.
A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces.
### Why are the changes needed?
This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added UTs.
Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs.
Closes#33671 from maryannxue/spark-36447.
Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to remove `jaxb-api` usage from `sql/catalyst` module.
### Why are the changes needed?
We only use `DatatypeConverter.parseHexBinary` and `DatatypeConverter.printHexBinary` twice.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33732 from dongjoon-hyun/SPARK-36502.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type.
This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well.
To change it, I propose to cast the null as the same data type of the other element in the add/subtract:
```
null + interval => cast(null as interval) + interval
null - interval => cast(null as interval) - interval
interval + null=> interval + cast(null as interval)
interval - null => interval - cast(null as interval)
```
### Why are the changes needed?
Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval`
### Does this PR introduce _any_ user-facing change?
No, the new interval type is not released yet.
### How was this patch tested?
Existing UT
Closes#33727 from gengliangwang/intervalTypeCoercion.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted
```
string - timestamp => cast(string as timestamp) - timestamp
```
This PR is to promote the following string literals:
```
string - date => cast(string as date) - date
date - string => date - cast(date as string)
timestamp - string => timestamp
```
It is very straightforward to cast the string literal as the data type of the other side in the subtraction.
2. Merge the string promotion logic from the rule `StringLiteralCoercion`:
```
date_sub(date, string) => date_sub(date, cast(string as int))
date_add(date, string) => date_add(date, cast(string as int))
```
### Why are the changes needed?
1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL
2. Certerize all the string literal promotion in the ANSI type coercion rule
### Does this PR introduce _any_ user-facing change?
No, the new ANSI type coercion rules are not released yet.
### How was this patch tested?
Existing UT
Closes#33724 from gengliangwang/datetimeTypeCoercion.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Implement a new rule for the date-time operations in the ANSI type coercion system:
1. Date will be converted to Timestamp when it is in the subtraction with Timestmap.
2. Promote string literals in date_add/date_sub/time_add
### Why are the changes needed?
Currently the type coercion rule `DateTimeOperations` doesn't match the design of the ANSI type coercion system:
1. For date_add/date_sub, if the input is timestamp type, Spark should not convert it into date type since date type is narrower than the timestamp type.
2. For date_add/date_sub/time_add, string value can be implicit cast to date/timestamp only when it is literal.
Thus, we need to have a new rule for the date-time operations in the ANSI type coercion system.
### Does this PR introduce _any_ user-facing change?
No, the ANSI type coercion rules are not releaesd.
### How was this patch tested?
New UT
Closes#33666 from gengliangwang/datetimeOp.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR proposes to prohibit update mode in streaming aggregation with session window.
UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode.
This PR also cleans up test code via deduplicating.
### Why are the changes needed?
The semantic of "update" mode for session window based streaming aggregation is quite unclear.
For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed.
This doesn't hold true for session window based streaming aggregation, as session range is changing.
If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows.
If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions.
### Does this PR introduce _any_ user-facing change?
No, as we haven't released this feature.
### How was this patch tested?
Updated tests.
Closes#33689 from HeartSaVioR/SPARK-36463.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`.
```
-- !query
select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>>
-- !query output
{"t":null}
```
```
-- !query
select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'))
-- !query schema
struct<>
-- !query output
java.lang.Exception
Unsupported type: timestamp_ntz
```
We should make `from_json` throws exception too.
This PR fix the discussion below
https://github.com/apache/spark/pull/33640#discussion_r682862523
### Why are the changes needed?
Make the behavior of `from_json` more reasonable.
### Does this PR introduce _any_ user-facing change?
'Yes'.
from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ.
### How was this patch tested?
Tests updated.
Closes#33684 from beliefer/SPARK-36429-new.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support TypeCoercion of ANSI intervals with different fields
### Why are the changes needed?
Support TypeCoercion of ANSI intervals with different fields
### Does this PR introduce _any_ user-facing change?
After this pr user can
- use comparison function with different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH`
- support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)`
- support different field of ansi interval type in `coalesce` etc..
### How was this patch tested?
Added UT
Closes#33661 from AngersZhuuuu/SPARK-SPARK-36431.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Retain `spark.sql.catalog.*` confs when resolving view.
### Why are the changes needed?
Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet.
### Does this PR introduce _any_ user-facing change?
Yes, bug fix.
### How was this patch tested?
Add UT.
Closes#33692 from pan3793/SPARK-36466.
Authored-by: Cheng Pan <chengpan@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example,
```
spark.sql(s"CREATE TABLE $t (id int) USING $v2Format")
spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)")
```
doesn't fail the analysis, and it's up to the catalog implementation to handle it.
### Why are the changes needed?
To check the duplicate columns during analysis.
### Does this PR introduce _any_ user-facing change?
Yes, now the above will command will print out the following:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data`
```
### How was this patch tested?
Added new unit tests
Closes#33676 from imback82/replace_cols_duplicates.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- This PR revisits https://github.com/apache/spark/pull/22309, and [SPARK-20384](https://issues.apache.org/jira/browse/SPARK-20384) solving the original problem, but additionally will prevent backward-compat break on schema of top-level `AnyVal` value class.
- Why previous break? We currently support top-level value classes just as any other case class; field of the underlying type is present in schema. This means any dataframe SQL filtering on this expects the field name to be present. The previous PR changes this schema and would result in breaking current usage. See test `"schema for case class that is a value class"`. This PR keeps the schema.
- We actually currently support collection of value classes prior to this change, but not case class of nested value class. This means the schema of these classes shouldn't change to prevent breaking too.
- However, what we can change, without breaking, is schema of nested value class, which will fails due to the compile problem, and thus its schema now isn't actually valid. After the change, the schema of this nested value class is now flattened
- With this PR, there's flattening only for nested value class (new), but not for top-level and collection classes (existing behavior)
- This PR revisits https://github.com/apache/spark/pull/27153 by handling tuple `Tuple2[AnyVal, AnyVal]` which is a constructor ("nested class") but is a generic type, so it should not be flattened behaving similarly to `Seq[AnyVal]`
### Why are the changes needed?
- Currently, nested value class isn't supported. This is because when the generated code treats `anyVal` class in its unwrapped form, but we encode the type to be the wrapped case class. This results in compile of generated code
For example,
For a given `AnyVal` wrapper and its root-level class container
```
case class IntWrapper(i: Int) extends AnyVal
case class ComplexValueClassContainer(c: IntWrapper)
```
The problematic part of generated code:
```
private InternalRow If_1(InternalRow i) {
boolean isNull_42 = i.isNullAt(0);
// 1) ******** The root-level case class we care
org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer value_46 = isNull_42 ?
null : ((org.apache.spark.sql.catalyst.encoders.ComplexValueClassContainer) i.get(0, null));
if (isNull_42) {
throw new NullPointerException(((java.lang.String) references[5] /* errMsg */ ));
}
boolean isNull_39 = true;
// 2) ******** We specify its member to be unwrapped case class extending `AnyVal`
org.apache.spark.sql.catalyst.encoders.IntWrapper value_43 = null;
if (!false) {
isNull_39 = false;
if (!isNull_39) {
// 3) ******** ERROR: `c()` compiled however is of type `int` and thus we see error
value_43 = value_46.c();
}
}
```
We get this errror: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException:
File 'generated.java', Line 159, Column 1: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 159, Column 1: Assignment conversion not possible from type "int" to type "org.apache.spark.sql.catalyst.encoders.IntWrapper"
```
From [doc](https://docs.scala-lang.org/overviews/core/value-classes.html) on value class: , Given: `class Wrapper(val underlying: Int) extends AnyVal`,
1) "The type at compile time is `Wrapper`, but at runtime, the representation is an `Int`". This implies that when our struct has a field of value class, the generated code should support the underlying type during runtime execution.
2) `Wrapper` "must be instantiated... when a value class is used as a type argument". This implies that `scala.Tuple[Wrapper, ...], Seq[Wrapper], Map[String, Wrapper], Option[Wrapper]` will still contain Wrapper as-is in during runtime instead of `Int`.
### Does this PR introduce _any_ user-facing change?
- Yes, this will allow support for the nested value class.
### How was this patch tested?
- Added unit tests to illustrate
- raw schema
- projection
- round-trip encode/decode
Closes#33205 from mickjermsurawong-stripe/SPARK-20384-2.
Lead-authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Co-authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172
Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs.
After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable.
This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster.
### Why are the changes needed?
fix perf regression
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33655 from cloud-fan/minor.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
* override the maxRows method in `LogicalQueryStage`
* add rule `EliminateLimits` in `AQEOptimizer`
### Why are the changes needed?
In Ad-hoc scenario, we always add limit for the query if user have no special limit value, but not all limit is nesessary.
With the power of AQE, we can eliminate limits using running statistics.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
add test
Closes#33651 from ulysses-you/SPARK-36424.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Spark should check result plan's output schema name
### Why are the changes needed?
In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema.
For example, for SchemaPruning, if we have a plan
```
Project[a, B]
|--Scan[A, b, c]
```
the origin output schema is `a, B`, after SchemaPruning. it become
```
Project[A, b]
|--Scan[A, b]
```
It change the plan's schema. when we use CTAS, the schema is same as query plan's output.
Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existed UT
Closes#33583 from AngersZhuuuu/SPARK-36352.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>