### What changes were proposed in this pull request?
Fix [comment](https://github.com/apache/spark/pull/32488#issuecomment-879315179)
This PR fix rule `UnwrapCastInBinaryComparison` bug. Rule UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal.
- In
Before this pr, the following example will throw an exception.
```scala
withTable("tbl") {
sql("CREATE TABLE tbl (d decimal(33, 27)) USING PARQUET")
sql("SELECT d FROM tbl WHERE d NOT IN (d + 1)")
}
```
- InSet
As the analyzer guarantee that all the elements in the `inSet.hset` are literal, so this is not an issue for `InSet`.
fbf53dee37/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L264-L279)
### Does this PR introduce _any_ user-facing change?
No, only bug fix.
### How was this patch tested?
New test.
Closes#33335 from cfmcgrady/SPARK-36130.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 103d16e868)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`LOCALTIMESTAMP()` is a datetime value function from ANSI SQL.
The syntax show below:
```
<datetime value function> ::=
<current date value function>
| <current time value function>
| <current timestamp value function>
| <current local time value function>
| <current local timestamp value function>
<current date value function> ::=
CURRENT_DATE
<current time value function> ::=
CURRENT_TIME [ <left paren> <time precision> <right paren> ]
<current local time value function> ::=
LOCALTIME [ <left paren> <time precision> <right paren> ]
<current timestamp value function> ::=
CURRENT_TIMESTAMP [ <left paren> <timestamp precision> <right paren> ]
<current local timestamp value function> ::=
LOCALTIMESTAMP [ <left paren> <timestamp precision> <right paren> ]
```
`LOCALTIMESTAMP()` returns the current timestamp at the start of query evaluation as TIMESTAMP WITH OUT TIME ZONE. This is similar to `CURRENT_TIMESTAMP()`.
Note we need to update the optimization rule `ComputeCurrentTime` so that Spark returns the same result in a single query if the function is called multiple times.
### Why are the changes needed?
`CURRENT_TIMESTAMP()` returns the current timestamp at the start of query evaluation.
`LOCALTIMESTAMP()` returns the current timestamp without time zone at the start of query evaluation.
The `LOCALTIMESTAMP` function is an ANSI SQL.
The `LOCALTIMESTAMP` function is very useful.
### Does this PR introduce _any_ user-facing change?
'Yes'. Support new function `LOCALTIMESTAMP()`.
### How was this patch tested?
New tests.
Closes#33258 from beliefer/SPARK-36037.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit b4f7758944)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Make sure all physical plans of TPCDS queries are valid (satisfy the partitioning requirement).
### Why are the changes needed?
improve test coverage
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#33248 from cloud-fan/aqe2.
Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 583173b7cc)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds an INVALID_FIELD_NAME error class for the errors in `StructType.findNestedField`. It also cleans up the code there and adds UT for this method.
### Why are the changes needed?
follow the new error message framework
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33282 from cloud-fan/error.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove the private method `checkIntervalStringDataType()` from `IntervalUtils` since it hasn't been used anymore after https://github.com/apache/spark/pull/33242.
### Why are the changes needed?
To improve code maintenance.
### Does this PR introduce _any_ user-facing change?
No. The method is private, and it existing in code base for short time.
### How was this patch tested?
By existing GAs/tests.
Closes#33321 from MaxGekk/SPARK-35735-remove-unused-method.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 1ba3982d16)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR allow the parser to parse unit list interval literals like `'3' day '10' hours '3' seconds` or `'8' years '3' months` as `YearMonthIntervalType` or `DayTimeIntervalType`.
### Why are the changes needed?
For ANSI compliance.
### Does this PR introduce _any_ user-facing change?
Yes. I noted the following things in the `sql-migration-guide.md`.
* Unit list interval literals are parsed as `YearMonthIntervaType` or `DayTimeIntervalType` instead of `CalendarIntervalType`.
* `WEEK`, `MILLISECONS`, `MICROSECOND` and `NANOSECOND` are not valid units for unit list interval literals.
* Units of year-month and day-time cannot be mixed like `1 YEAR 2 MINUTES`.
### How was this patch tested?
New tests and modified tests.
Closes#32949 from sarutak/day-time-multi-units.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8e92ef825a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add new SQL function `to_timestamp_ltz`
syntax:
```
to_timestamp_ltz(timestamp_str_column[, fmt])
to_timestamp_ltz(timestamp_column)
to_timestamp_ltz(date_column)
```
### Why are the changes needed?
As the result of to_timestamp become consistent with the SQL configuration spark.sql.timestmapType and there is already a SQL function to_timestmap_ntz, we need new function to_timestamp_ltz to construct timestamp with local time zone values.
### Does this PR introduce _any_ user-facing change?
Yes, a new function for constructing timestamp with local time zone values
### How was this patch tested?
Unit test
Closes#33318 from gengliangwang/to_timestamp_ltz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 01ddaf3918)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR modifies `DecorrelateInnerQuery` to handle the COUNT bug for lateral subqueries. Similar to SPARK-15370, rewriting lateral subqueries as joins can change the semantics of the subquery and lead to incorrect answers.
However we can't reuse the existing code to handle the count bug for correlated scalar subqueries because it assumes the subquery to have a specific shape (either with Filter + Aggregate or Aggregate as the root node). Instead, this PR proposes a more generic way to handle the COUNT bug. If an Aggregate is subject to the COUNT bug, we insert a left outer domain join between the outer query and the aggregate with a `alwaysTrue` marker and rewrite the final result conditioning on the marker. For example:
```sql
-- t1: [(0, 1), (1, 2)]
-- t2: [(0, 2), (0, 3)]
select * from t1 left outer join lateral (select count(*) from t2 where t2.c1 = t1.c1)
```
Without count bug handling, the query plan is
```
Project [c1#44, c2#45, count(1)#53L]
+- Join LeftOuter, (c1#48 = c1#44)
:- LocalRelation [c1#44, c2#45]
+- Aggregate [c1#48], [count(1) AS count(1)#53L, c1#48]
+- LocalRelation [c1#48]
```
and the answer is wrong:
```
+---+---+--------+
|c1 |c2 |count(1)|
+---+---+--------+
|0 |1 |2 |
|1 |2 |null |
+---+---+--------+
```
With the count bug handling:
```
Project [c1#1, c2#2, count(1)#10L]
+- Join LeftOuter, (c1#34 <=> c1#1)
:- LocalRelation [c1#1, c2#2]
+- Project [if (isnull(alwaysTrue#32)) 0 else count(1)#33L AS count(1)#10L, c1#34]
+- Join LeftOuter, (c1#5 = c1#34)
:- Aggregate [c1#1], [c1#1 AS c1#34]
: +- LocalRelation [c1#1]
+- Aggregate [c1#5], [count(1) AS count(1)#33L, c1#5, true AS alwaysTrue#32]
+- LocalRelation [c1#5]
```
and we have the correct answer:
```
+---+---+--------+
|c1 |c2 |count(1)|
+---+---+--------+
|0 |1 |2 |
|1 |2 |0 |
+---+---+--------+
```
### Why are the changes needed?
Fix a correctness bug with lateral join rewrite.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added SQL query tests. The results are consistent with Postgres' results.
Closes#33070 from allisonwang-db/spark-35551-lateral-count-bug.
Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 4f760f2b1f)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR reverts https://github.com/apache/spark/pull/32455 and its followup https://github.com/apache/spark/pull/32536 , because the new janino version has a bug that is not fixed yet: https://github.com/janino-compiler/janino/pull/148
### Why are the changes needed?
avoid regressions
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33302 from cloud-fan/revert.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae6199af44)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support new functions make_timestamp_ntz and make_timestamp_ltz
Syntax:
* `make_timestamp_ntz(year, month, day, hour, min, sec)`: Create local date-time from year, month, day, hour, min, sec fields
* `make_timestamp_ltz(year, month, day, hour, min, sec[, timezone])`: Create current timestamp with local time zone from year, month, day, hour, min, sec and timezone fields
### Why are the changes needed?
As the result of `make_timestamp` become consistent with the SQL configuration `spark.sql.timestmapType`, we need these two new functions to construct timestamp literals. They align to the functions [`make_timestamp` and `make_timestamptz`](https://www.postgresql.org/docs/9.4/functions-datetime.html) in PostgreSQL
### Does this PR introduce _any_ user-facing change?
Yes, two new datetime functions: make_timestamp_ntz and make_timestamp_ltz.
### How was this patch tested?
End-to-end tests.
Closes#33299 from gengliangwang/make_timestamp_ntz_ltz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 92bf83ed0a)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR group exception messages in sql/core/src/main/scala/org/apache/spark/sql/execution/command
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce any user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32951 from dgd-contributor/SPARK-33603_grouping_execution/command.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit d03f71657e)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
The SQL function TO_TIMESTAMP should return different results based on the default timestamp type:
* when "spark.sql.timestampType" is TIMESTAMP_NTZ, return TimestampNTZType literal
* when "spark.sql.timestampType" is TIMESTAMP_LTZ, return TimestampType literal
This PR also refactor the class GetTimestamp and GetTimestampNTZ to reduce duplicated code.
### Why are the changes needed?
As "spark.sql.timestampType" sets the default timestamp type, the to_timestamp function should behave consistently with it.
### Does this PR introduce _any_ user-facing change?
Yes, when the value of "spark.sql.timestampType" is TIMESTAMP_NTZ, the result type of `TO_TIMESTAMP` is of TIMESTAMP_NTZ type.
### How was this patch tested?
Unit test
Closes#33280 from gengliangwang/to_timestamp.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 32720dd3e1)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The functions `unix_timestamp`/`to_unix_timestamp` should be able to accept input of `TimestampNTZType`.
### Why are the changes needed?
The functions `unix_timestamp`/`to_unix_timestamp` should be able to accept input of `TimestampNTZType`.
### Does this PR introduce _any_ user-facing change?
'Yes'.
### How was this patch tested?
New tests.
Closes#33278 from beliefer/SPARK-36044.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 8738682f6a)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The SQL function MAKE_TIMESTAMP should return different results based on the default timestamp type:
* when "spark.sql.timestampType" is TIMESTAMP_NTZ, return TimestampNTZType literal
* when "spark.sql.timestampType" is TIMESTAMP_LTZ, return TimestampType literal
### Why are the changes needed?
As "spark.sql.timestampType" sets the default timestamp type, the make_timestamp function should behave consistently with it.
### Does this PR introduce _any_ user-facing change?
Yes, when the value of "spark.sql.timestampType" is TIMESTAMP_NTZ, the result type of `MAKE_TIMESTAMP` is of TIMESTAMP_NTZ type.
### How was this patch tested?
Unit test
Closes#33290 from gengliangwang/mkTS.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 17ddcc9e82)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Remove IntervalUnit
### Why are the changes needed?
Clean code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#33265 from AngersZhuuuu/SPARK-36049.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit fef7e1703c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Support group by TimestampNTZ type column
### Why are the changes needed?
It's a basic SQL operation.
### Does this PR introduce _any_ user-facing change?
No, the new timestmap type is not released yet.
### How was this patch tested?
Unit test
Closes#33268 from gengliangwang/agg.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 382b66e267)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Currently the TimestampNTZ literals shows only long value instead of timestamp string in its SQL string and toString result.
Before changes (with default timestamp type as TIMESTAMP_NTZ)
```
– !query
select timestamp '2019-01-01\t'
– !query schema
struct<1546300800000000:timestamp_ntz>
```
After changes:
```
– !query
select timestamp '2019-01-01\t'
– !query schema
struct<TIMESTAMP_NTZ '2019-01-01 00:00:00':timestamp_ntz>
```
### Why are the changes needed?
Make the schema of TimestampNTZ literals readable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#33269 from gengliangwang/ntzLiteralString.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ee945e99cc)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
When exec the command `SHOW CREATE TABLE`, we should not lost the info null flag if the table column that
is specified `NOT NULL`
### Why are the changes needed?
[SPARK-36012](https://issues.apache.org/jira/browse/SPARK-36012)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add UT test for V1 and existed UT for V2
Closes#33219 from Peng-Lei/SPARK-36012.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit e071721a51)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Refactors the base Throwable trait `SparkError.scala` (introduced in SPARK-34920) an interface `SparkThrowable.java`.
### Why are the changes needed?
- Renaming `SparkError` to `SparkThrowable` better reflect sthat this is the base interface for both `Exception` and `Error`
- Migrating to Java maximizes its extensibility
### Does this PR introduce _any_ user-facing change?
Yes; the base trait has been renamed and the accessor methods have changed (eg. `sqlState` -> `getSqlState()`).
### How was this patch tested?
Unit tests.
Closes#33164 from karenfeng/SPARK-35958.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 71c086eb87)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
With more thought, all DT/YM function use field byte to keep consistence is better
### Why are the changes needed?
Keep code consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Not need
Closes#33252 from AngersZhuuuu/SPARK-36021-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 89aa16b4a8)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue about `extract`.
`Extract` should process only existing fields of interval types. For example:
```
spark-sql> SELECT EXTRACT(MONTH FROM INTERVAL '2021-11' YEAR TO MONTH);
11
spark-sql> SELECT EXTRACT(MONTH FROM INTERVAL '2021' YEAR);
0
```
The last command should fail as the month field doesn't present in INTERVAL YEAR.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33247 from sarutak/fix-extract-interval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 39002cb995)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
For case
```
spark-sql> select interval '123456:12' minute to second;
Error in query:
requirement failed: Interval string must match day-time format of '^(?<sign>[+|-])?(?<minute>\d{1,2}):(?<second>(\d{1,2})(\.(\d{1,9}))?)$': 123456:12, set spark.sql.legacy.fromDayTimeString.enabled to true to restore the behavior before Spark 3.0.(line 1, pos 16)
== SQL ==
select interval '123456:12' minute to second
----------------^^^
```
we should support hour/minute/second when for more than 2 digits when parse interval literal string
### Why are the changes needed?
Keep consistence
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#33231 from AngersZhuuuu/SPARK-36021.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit ea3333a200)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The method `WindowSpecDefinition.isValidFrameType` doesn't consider `TimestampNTZType`. We should support it as for `TimestampType`.
### Why are the changes needed?
Support `TimestampNTZType` in the Window spec definition.
### Does this PR introduce _any_ user-facing change?
'Yes'. This PR allows users use `TimestampNTZType` as the sort spec in window spec definition.
### How was this patch tested?
New tests.
Closes#33246 from beliefer/SPARK-36015.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 62ff2add94)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The current `ApproxCountDistinctForInterval`s supports `TimestampType`, but not supports timestamp without time zone yet.
This PR will add the function.
### Why are the changes needed?
`ApproxCountDistinctForInterval` need supports `TimestampNTZType`.
### Does this PR introduce _any_ user-facing change?
'Yes'. `ApproxCountDistinctForInterval` accepts `TimestampNTZType`.
### How was this patch tested?
New tests.
Closes#33243 from beliefer/SPARK-36016.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit be382a6285)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The current `ApproximatePercentile` supports `TimestampType`, but not supports timestamp without time zone yet.
This PR will add the function.
### Why are the changes needed?
`ApproximatePercentile` need supports `TimestampNTZType`.
### Does this PR introduce _any_ user-facing change?
'Yes'. `ApproximatePercentile` accepts `TimestampNTZType`.
### How was this patch tested?
New tests.
Closes#33241 from beliefer/SPARK-36017.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit cc4463e818)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
When cast `10:10` to interval minute to second, it can be catch by hour to minute regex, here to fix this.
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#33242 from AngersZhuuuu/SPARK-35735-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 3953754f36)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Make it recursive remove sort if the maximum number of rows less than or equal to 1. For example:
```sql
select a from (select a from values(0, 1) t(a, b) order by a) order by a
```
### Why are the changes needed?
Fix Once strategy's idempotence is broken for batch Eliminate Sorts.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#33240 from wangyum/SPARK-35906-2.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ddc5cb9051)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a config `spark.sql.join.forceApplyShuffledHashJoin` to force applying shuffled hash join
during the join selection.
### Why are the changes needed?
In the `SQLQueryTestSuite`, we want to cover 3 kinds of join (BHJ, SHJ, SMJ) in join.sql. But even
if the `spark.sql.join.preferSortMergeJoin` is set to `false`, shuffled hash join is still not guaranteed.
Thus, we need another config to force the selection.
### Does this PR introduce _any_ user-facing change?
No, only for testing
### How was this patch tested?
newly added tests
Verified all queries in join.sql will use `ShuffledHashJoin` when the config set to `true`
Closes#33182 from linhongliu-db/SPARK-35984-hash-join-config.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 7566db6033)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Some of the test cases in `DateExpressionsSuite` are quite slow:
- `Hour`: 24s
- `Minute`: 26s
- `Day / DayOfMonth`: 8s
- `Year`: 4s
Each test case has a large loop. We should improve them.
### Why are the changes needed?
Save test running time
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Verified the run times on local:
- `Hour`: 2s
- `Minute`: 3.2
- `Day / DayOfMonth`:0.5s
- `Year`: 2s
Total reduced time: 54.3s
Closes#33229 from gengliangwang/improveTest.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit d5d1222686)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Refactor code about parse string to DT/YM intervals.
### Why are the changes needed?
Extracting the common code about parse string to DT/YM should improve code maintenance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existed UT.
Closes#33217 from AngersZhuuuu/SPARK-35735-35768.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 26d1bb16bc)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue that `from_csv/to_csv` doesn't handle day-time intervals properly.
`from_csv` throws exception if day-time interval types are given.
```
spark-sql> select from_csv("interval '1 2:3:4' day to second", "a interval day to second");
21/07/03 04:39:13 ERROR SparkSQLDriver: Failed in [select from_csv("interval '1 2:3:4' day to second", "a interval day to second")]
java.lang.Exception: Unsupported type: interval day to second
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:775)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:224)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```
Also, `to_csv` doesn't handle day-time interval types properly though any exception is thrown.
The result of `to_csv` for day-time interval types is not ANSI interval compliant form.
```
spark-sql> select to_csv(named_struct("a", interval '1 2:3:4' day to second));
93784000000
```
The result above should be `INTERVAL '1 02:03:04' DAY TO SECOND`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33226 from sarutak/csv-dtinterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit def8bc5c96)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes two issues. One is that `to_json` doesn't support `map` types where value types are `day-time` interval types like:
```
spark-sql> select to_json(map('a', interval '1 2:3:4' day to second));
21/07/06 14:53:58 ERROR SparkSQLDriver: Failed in [select to_json(map('a', interval '1 2:3:4' day to second))]
java.lang.RuntimeException: Failed to convert value 93784000000 (class of class java.lang.Long) with the type of DayTimeIntervalType(0,3) to JSON.
```
The other issue is that even if the issue of `to_json` is resolved, `from_json` doesn't support to convert `day-time` interval string to JSON. So the result of following query will be `null`.
```
spark-sql> select from_json(to_json(map('a', interval '1 2:3:4' day to second)), 'a interval day to second');
{"a":null}
```
### Why are the changes needed?
There should be no reason why day-time intervals cannot used as map value types.
`CalendarIntervalTypes` can do it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33225 from sarutak/json-dtinterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit c8ff613c3c)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Ideally, in SQL query, nested columns should result to GetStructField with non-None name. But there are places that can create GetStructField with None name, such as UnresolvedStar.expand, Dataset encoder stuff, etc.
the current `nestedFieldToAlias` cannot catch it up and will cause job failed.
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT,
Closes#33183 from AngersZhuuuu/SPARK-35972.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 87282f04bf)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Support new keyword `TIMESTAMP_LTZ`, which can be used for:
- timestamp with local time zone data type in DDL
- timestamp with local time zone data type in Cast clause.
- timestamp with local time zone data type literal
### Why are the changes needed?
Users can use `TIMESTAMP_LTZ` in DDL/Cast/Literals for the timestamp with local time zone type directly. The new keyword is independent of the SQL configuration `spark.sql.timestampType`.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#33224 from gengliangwang/TIMESTAMP_LTZ.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit b0b9643cd7)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/33113, to do some code cleanup:
1. `UnresolvedFieldPosition` doesn't need to include the field name. We can get it through "context" (`AlterTableAlterColumn.column.name`).
2. Run `ResolveAlterTableCommands` in the main resolution batch, so that the column/field resolution is also unified between v1 and v2 commands (same error message).
3. Fail immediately in `ResolveAlterTableCommands` if we can't resolve the field, instead of waiting until `CheckAnalysis`. We don't expect other rules to resolve fields in ALTER TABLE commands, so failing immediately is simpler and we can remove duplicated code in `CheckAnalysis`.
### Why are the changes needed?
code simplification.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33213 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 8b46e26fc6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support new keyword TIMESTAMP_NTZ, which can be used for:
- timestamp without time zone data type in DDL
- timestamp without time zone data type in Cast clause.
- timestamp without time zone data type literal
### Why are the changes needed?
Users can use `TIMESTAMP_NTZ` in DDL/Cast/Literals for the timestamp without time zone type directly.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#33221 from gengliangwang/timstamp_ntz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 5f44acff3d)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
For the timestamp literal, it should have the following behavior.
1. When `spark.sql.timestampType` is TIMESTAMP_NTZ: if there is no time zone part, return timestamp without time zone literal; otherwise, return timestamp with local time zone literal
2. When `spark.sql.timestampType` is TIMESTAMP_LTZ: return timestamp with local time zone literal
### Why are the changes needed?
When the default timestamp type is TIMESTAMP_NTZ, the result of type literal should return TIMESTAMP_NTZ when there is no time zone part in the string.
From setion 5.3 "literal" of ANSI SQL standard 2011:
```
27) The declared type of a <timestamp literal> that does not specify <time zone interval> is TIMESTAMP(P) WITHOUT TIME ZONE, where P is the number of digits in <seconds fraction>, if specified, and 0 (zero) otherwise. The declared type of a <timestamp literal> that specifies <time zone interval> is TIMESTAMP(P) WITH TIME ZONE, where P is the number of digits in <seconds fraction>, if specified, and 0 (zero) otherwise.
```
Since we don't have "timestamp with time zone", we use timestamp with local time zone instead.
### Does this PR introduce _any_ user-facing change?
No, the new timestmap type and the default timestamp configuration is not released yet.
### How was this patch tested?
Unit test
Closes#33215 from gengliangwang/tsLiteral.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 2fffec7de8)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fix the incorrect comment for `TimestampNTZType`.
### Why are the changes needed?
Fix the incorrect comment
### Does this PR introduce _any_ user-facing change?
'No'.
### How was this patch tested?
No need.
Closes#33218 from beliefer/SPARK-35664-followup.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit c605ba2d46)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `from_csv/to_csv` doesn't handle year-month intervals properly.
`from_csv` throws exception if year-month interval types are given.
```
spark-sql> select from_csv("interval '1-2' year to month", "a interval year to month");
21/07/03 04:32:24 ERROR SparkSQLDriver: Failed in [select from_csv("interval '1-2' year to month", "a interval year to month")]
java.lang.Exception: Unsupported type: interval year to month
at org.apache.spark.sql.errors.QueryExecutionErrors$.unsupportedTypeError(QueryExecutionErrors.scala:775)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.makeConverter(UnivocityParser.scala:224)
at org.apache.spark.sql.catalyst.csv.UnivocityParser.$anonfun$valueConverters$1(UnivocityParser.scala:134)
```
Also, `to_csv` doesn't handle year-month interval types properly though any exception is thrown.
The result of `to_csv` for year-month interval types is not ANSI interval compliant form.
```
spark-sql> select to_csv(named_struct("a", interval '1-2' year to month));
14
```
The result above should be `INTERVAL '1-2' YEAR TO MONTH`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33210 from sarutak/csv-yminterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit f4237aff7e)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Current AQE has cost evaluator to decide whether to use new plan after replanning. The current used evaluator is `SimpleCostEvaluator` to make decision based on number of shuffle in the query plan. This is not perfect cost evaluator, and different production environments might want to use different custom evaluators. E.g., sometimes we might want to still do skew join even though it might introduce extra shuffle (trade off resource for better latency), sometimes we might want to take sort into consideration for cost as well. Take our own setting as an example, we are using a custom remote shuffle service (Cosco), and the cost model is more complicated. So We want to make the cost evaluator to be pluggable, and developers can implement their own `CostEvaluator` subclass and plug in dynamically based on configuration.
The approach is to introduce a new config to allow define sub-class name of `CostEvaluator` - `spark.sql.adaptive.customCostEvaluatorClass`. And add `CostEvaluator.instantiate` to instantiate the cost evaluator class in `AdaptiveSparkPlanExec.costEvaluator`.
### Why are the changes needed?
Make AQE cost evaluation more flexible.
### Does this PR introduce _any_ user-facing change?
No but an internal config is introduced - `spark.sql.adaptive.customCostEvaluatorClass` to allow custom implementation of `CostEvaluator`.
### How was this patch tested?
Added unit test in `AdaptiveQueryExecSuite.scala`.
Closes#32944 from c21/aqe-cost.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 044dddf288)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes two issues. One is that `to_json` doesn't support `map` types where value types are `year-month` interval types like:
```
spark-sql> select to_json(map('a', interval '1-2' year to month));
21/07/02 11:38:15 ERROR SparkSQLDriver: Failed in [select to_json(map('a', interval '1-2' year to month))]
java.lang.RuntimeException: Failed to convert value 14 (class of class java.lang.Integer) with the type of YearMonthIntervalType(0,1) to JSON.
```
The other issue is that even if the issue of `to_json` is resolved, `from_json` doesn't support to convert `year-month` interval string to JSON. So the result of following query will be `null`.
```
spark-sql> select from_json(to_json(map('a', interval '1-2' year to month)), 'a interval year to month');
{"a":null}
```
### Why are the changes needed?
There should be no reason why year-month intervals cannot used as map value types.
`CalendarIntervalTypes` can do it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33181 from sarutak/map-json-yminterval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 6474226852)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Make the ANSI flag part of expressions `Sum` and `Average`'s parameter list, instead of fetching it from the sessional SQLConf.
### Why are the changes needed?
For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list.
We should make it consistent for the expressions `Sum` and `Average`
### Does this PR introduce _any_ user-facing change?
Yes, the `Sum` and `Average` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session.
### How was this patch tested?
Existing UT
Closes#33186 from gengliangwang/sumAndAvg.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 51103cdcdd)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR uses 2 ideas to make `EquivalentExpressions` more efficient:
1. do not keep all the equivalent expressions, we only need a count
2. track the "height" of common subexpressions, to quickly do child-parent sort, and filter out non-child expressions in `addCommonExprs`
This PR also fixes several small bugs (exposed by the refactoring), please see PR comments.
### Why are the changes needed?
code cleanup and small perf improvement
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33142 from cloud-fan/codegen.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit e6ce220690)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.
However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.
This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.
### Why are the changes needed?
AQE is default on now, we should make the perf better in the default case.
### Does this PR introduce _any_ user-facing change?
yes, a new config.
### How was this patch tested?
new tests
Closes#33172 from cloud-fan/aqe2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0c9c8ff569)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fixes decimal overflow issues for decimal average in ANSI mode, so that overflows throw an exception rather than returning null.
### Why are the changes needed?
Query:
```
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._
scala> spark.conf.set("spark.sql.ansi.enabled", true)
scala> val df = Seq(
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 1),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2),
| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(mean("decNum"))
df2: org.apache.spark.sql.DataFrame = [avg(decNum): decimal(38,22)]
scala> df2.show(40,false)
```
Before:
```
+-----------+
|avg(decNum)|
+-----------+
|null |
+-----------+
```
After:
```
21/07/01 19:48:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 24)
java.lang.ArithmeticException: Overflow in sum of decimals.
at org.apache.spark.sql.errors.QueryExecutionErrors$.overflowInSumOfDecimalError(QueryExecutionErrors.scala:162)
at org.apache.spark.sql.errors.QueryExecutionErrors.overflowInSumOfDecimalError(QueryExecutionErrors.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:349)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:499)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:502)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test
Closes#33177 from karenfeng/SPARK-35955.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR aims to add support for specifying a user defined initial state for arbitrary structured streaming stateful processing using [flat]MapGroupsWithState operator.
### Why are the changes needed?
Users can load previous state of their stateful processing as an initial state instead of redoing the entire processing once again.
### Does this PR introduce _any_ user-facing change?
Yes this PR introduces new API
```
def mapGroupsWithState[S: Encoder, U: Encoder](
timeoutConf: GroupStateTimeout,
initialState: KeyValueGroupedDataset[K, S])(
func: (K, Iterator[V], GroupState[S]) => U): Dataset[U]
def flatMapGroupsWithState[S: Encoder, U: Encoder](
outputMode: OutputMode,
timeoutConf: GroupStateTimeout,
initialState: KeyValueGroupedDataset[K, S])(
func: (K, Iterator[V], GroupState[S]) => Iterator[U])
```
### How was this patch tested?
Through unit tests in FlatMapGroupsWithStateSuite
Closes#33093 from rahulsmahadev/flatMapGroupsWithState.
Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR implemented the proposal per [design doc](https://docs.google.com/document/d/1RfFn2e9o_1uHJ8jFGsSakp-BZMizX1uRrJSybMe2a6M) for SPARK-35779.
### Why are the changes needed?
Spark supports dynamic partition filtering that enables reusing parts of the query to skip unnecessary partitions in the larger table during joins. This optimization has proven to be beneficial for star-schema queries which are common in the industry. Unfortunately, dynamic pruning is currently limited to partition pruning during joins and is only supported for built-in v1 sources. As more and more Spark users migrate to Data Source V2, it is important to generalize dynamic filtering and expose it to all v2 connectors.
Please, see the design doc for more information on this effort.
### Does this PR introduce _any_ user-facing change?
Yes, this PR adds a new optional mix-in interface for `Scan` in Data Source V2.
### How was this patch tested?
This PR comes with tests.
Closes#32921 from aokolnychyi/dynamic-filtering-wip.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Add a new configuration `spark.sql.timestampType`, which configures the default timestamp type of Spark SQL, including SQL DDL and Cast clause. Setting the configuration as `TIMESTAMP_NTZ` will use `TIMESTAMP WITHOUT TIME ZONE` as the default type while putting it as `TIMESTAMP_LTZ` will use `TIMESTAMP WITH LOCAL TIME ZONE`.
The default value of the new configuration is TIMESTAMP_LTZ, which is consistent with previous Spark releases.
### Why are the changes needed?
A new configuration for switching the default timestamp type as timestamp without time zone.
### Does this PR introduce _any_ user-facing change?
No, it's a new feature.
### How was this patch tested?
Unit test
Closes#33176 from gengliangwang/newTsTypeConf.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
unionByName does not supports struct having same col names but different sequence
```
val df1 = Seq((1, Struct1(1, 2))).toDF("a", "b")
val df2 = Seq((1, Struct2(1, 2))).toDF("a", "b")
val unionDF = df1.unionByName(df2)
```
it gives the exception
`org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c2:int,c1:int> <> struct<c1:int,c2:int> at the second column of the second table; 'Union false, false :- LocalRelation [_1#38, _2#39] +- LocalRelation _1#45, _2#46`
In this case the col names are same so this unionByName should have the support to check within in the Struct if col names are same it should not throw this exception and works.
after fix we are getting the result
```
val unionDF = df1.unionByName(df2)
scala> unionDF.show
+---+------+
| a| b|
+---+------+
| 1|{1, 2}|
| 1|{2, 1}|
+---+------+
```
### Why are the changes needed?
As per unionByName functionality based on name, does the union. In the case of struct this scenario was missing where all the columns names are same but sequence is different, so added this functionality.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added the unit test and also done the testing through spark shell
Closes#32972 from SaurabhChawla100/SPARK-35756.
Authored-by: SaurabhChawla <s.saurabhtim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename the type name string of TimestampNTZType from "timestamp without time zone" to "timestamp_ntz".
### Why are the changes needed?
This is to make the column header shorter and simpler.
Snowflake and Flink uses similar approach:
https://docs.snowflake.com/en/sql-reference/data-types-datetime.htmlhttps://ci.apache.org/projects/flink/flink-docs-master/docs/dev/table/concepts/timezone/
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit tests
Closes#33173 from gengliangwang/reviseTypeName.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
As described in #32831, Spark has compatible issues when querying a view created by an
older version. The root cause is that Spark changed the auto-generated alias name. To avoid
this in the future, we could ask the user to specify explicit column names when creating
a view.
### Why are the changes needed?
Avoid compatible issue when querying a view
### Does this PR introduce _any_ user-facing change?
Yes. User will get error when running query below after this change
```
CREATE OR REPLACE VIEW v AS SELECT CAST(t.a AS INT), to_date(t.b, 'yyyyMMdd') FROM t
```
### How was this patch tested?
not yet
Closes#32832 from linhongliu-db/SPARK-35686-no-auto-alias.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
If the user creates a view in 2.4 and reads it in 3.1/3.2, there will be an incompatible schema issue.
So this PR adds a view ddl in the error message to prompt the user recreating the view to fix the
incompatible issue.
For example:
```sql
-- create view in 2.4
CREATE TABLE IF NOT EXISTS t USING parquet AS SELECT '1' as a, '20210420' as b"
CREATE OR REPLACE VIEW v AS SELECT CAST(t.a AS INT), to_date(t.b, 'yyyyMMdd') FROM t
-- select view in master
SELECT * FROM v
```
Then we will get below error:
```
cannot resolve '`to_date(spark_catalog.default.t.b, 'yyyyMMdd')`' given input columns: [a, to_date(b, yyyyMMdd)];
```
### Why are the changes needed?
Improve the error message
### Does this PR introduce _any_ user-facing change?
Yes, the error message will change
### How was this patch tested?
newly added test case
Closes#32831 from linhongliu-db/SPARK-35685-view-compatible.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR supports resolving star expressions in subqueries using outer query plans.
### Why are the changes needed?
Currently, Spark can only resolve star expressions using the inner query plan when resolving subqueries. Instead, it should also be able to resolve star expressions using the outer query plans.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests
Closes#32787 from allisonwang-db/spark-35618-resolve-star-in-subquery.
Lead-authored-by: allisonwang-db <allison.wang@databricks.com>
Co-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename TimestampWithoutTZType to TimestampNTZType
### Why are the changes needed?
The time name of `TimestampWithoutTZType` is verbose. Rename it as `TimestampNTZType` so that
1. it is easier to read and type.
2. As we have the function to_timestamp_ntz, this makes the names consistent.
3. We will introduce a new SQL configuration `spark.sql.timestampType` for the default timestamp type. The configuration values can be "TIMESTMAP_NTZ" or "TIMESTMAP_LTZ" for simplicity.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Run `git grep -i WithoutTZ` and there is no result.
And Ci tests.
Closes#33167 from gengliangwang/rename.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR group all exception messages in `sql/core/src/main/scala/org/apache/spark/sql`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32958 from beliefer/SPARK-35065.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
* Add a new rule `ExpandShufflePartitions` in AQE `queryStageOptimizerRules`
* Add a new config `spark.sql.adaptive.optimizeSkewsInRebalancePartitions.enabled` to decide if should enable the new rule
The new rule `OptimizeSkewInRebalancePartitions` only handle two shuffle origin `REBALANCE_PARTITIONS_BY_NONE` and `REBALANCE_PARTITIONS_BY_COL` for data skew issue. And re-use the exists config `ADVISORY_PARTITION_SIZE_IN_BYTES` to decide what partition size should be.
### Why are the changes needed?
Currently, we don't support expand partition dynamically in AQE which is not friendly for some data skew job.
Let's say if we have a simple query:
```
SELECT /*+ REBALANCE(col) */ * FROM table
```
The column of `col` is skewed, then some shuffle partitions would handle too much data than others.
If we haven't inroduced extra shuffle, we can optimize this case by expanding partitions in AQE.
### Does this PR introduce _any_ user-facing change?
Yes, a new config
### How was this patch tested?
Add test
Closes#32883 from ulysses-you/expand-partition.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support extracting date fields from timestamp without time zone, which includes:
- year
- month
- day
- year of week
- week
- day of week
- quarter
- day of month
- day of year
### Why are the changes needed?
Support basic operations for the new timestamp type.
### Does this PR introduce _any_ user-facing change?
No, the timestamp without time zone type is not released yet.
### How was this patch tested?
Unit tests
Closes#33156 from gengliangwang/dateField.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Support take into account day-time interval field in cast.
### Why are the changes needed?
To conform to the SQL standard.
### Does this PR introduce _any_ user-facing change?
An user can use `cast(str, DayTimeInterval(DAY, HOUR))`, for instance.
### How was this patch tested?
Added UT.
Closes#32943 from AngersZhuuuu/SPARK-35735.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Support extracting hour/minute/second fields from timestamp without time zone values. In details, the following syntaxes are supported:
- extract [hour | minute | second] from timestampWithoutTZ
- date_part('[hour | minute | second]', timestampWithoutTZ)
- hour(timestampWithoutTZ)
- minute(timestampWithoutTZ)
- second(timestampWithoutTZ)
### Why are the changes needed?
Support basic operations for the new timestamp type.
### Does this PR introduce _any_ user-facing change?
No, the timestamp without time zone type is not release yet.
### How was this patch tested?
Unit test
Closes#33136 from gengliangwang/field.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Unifies exceptions thrown from Spark under a single base trait `SparkError`, which unifies:
- Error classes
- Parametrized error messages
- SQLSTATE, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Add-error-IDs-td31126.html.
### Why are the changes needed?
- Adding error classes creates a consistent label for exceptions, even as error messages change
- Creating a single, centralized source-of-truth for parametrized error messages improves auditing for error message quality
- Adding SQLSTATE helps ODBC/JDBC users receive standardized error codes
### Does this PR introduce _any_ user-facing change?
Yes, changes ODBC experience by:
- Adding error classes to error messages
- Adding SQLSTATE to TStatus
### How was this patch tested?
Unit tests, as well as local tests with PyODBC.
Closes#32850 from karenfeng/SPARK-34920.
Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add a new ANSI type coercion rule: when getting a date field from a Timestamp column, cast the column as Date type.
This is Spark's current hack to make the implementation simple. In the default type coercion rules, the implicit cast rule does the work. However, The ANSI implicit cast rule doesn't allow converting Timestamp type as Date type, so we need to have this additional rule to make sure the date field extraction from Timestamp columns works.
### Why are the changes needed?
Fix a bug.
### Does this PR introduce _any_ user-facing change?
No, the new type coercion rules are not released yet.
### How was this patch tested?
Unit test
Closes#33138 from gengliangwang/fixGetDateField.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This patch refactors the evaluation of subexpressions.
There are two changes:
1. Clean up subexpression code after evaluation to avoid duplicate evaluation.
2. Evaluate all children subexpressions when evaluating a subexpression.
### Why are the changes needed?
Currently `subexpressionEliminationForWholeStageCodegen` return the gen-ed code of subexpressions. The caller simply puts the code into its code block. We need more flexible evaluation here. For example, for Filter operator's subexpression evaluation, we may need to evaluate particular subexpression for one predicate. Current approach cannot satisfy the requirement.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32980 from viirya/subexpr-eval.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR removes order by if the maximum number of rows less than or equal to 1. For example:
```scala
spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost")
```
Before this pr:
```
== Optimized Logical Plan ==
Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B)
+- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5)
```
After this pr:
```
== Optimized Logical Plan ==
Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1)
+- Project, Statistics(sizeInBytes=20.0 B)
+- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5)
```
### Why are the changes needed?
Improve query performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#33100 from wangyum/SPARK-35906.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Replace the type collection `AllTimestampTypes` with the new data type `AnyTimestampType`
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33115#discussion_r659866760, it is more convenient to have a new data type "AnyTimestampType" instead of using type collection `AllTimestampTypes`:
1. simplify the pattern match
2. In the default type coercion rules, when implicit casting a type to a TypeCollection type, Spark chooses the first convertible data type as the result. If we are going to make the default timestamp type configurable, having AnyTimestampType is better
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33129 from gengliangwang/allTimestampTypes.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Support the following operations:
- TimestampWithoutTZ - Date
- Date - TimestampWithoutTZ
- TimestampWithoutTZ - Timestamp
- Timestamp - TimestampWithoutTZ
- TimestampWithoutTZ - TimestampWithoutTZ
For subtraction between `TimestampWithoutTZ` and `Timestamp`, the `Timestamp` column is cast as TimestampWithoutTZType.
### Why are the changes needed?
Support basic subtraction among Date/Timestamp/TimestampWithoutTZ.
### Does this PR introduce _any_ user-facing change?
No, the timestamp without time zone type is not release yet.
### How was this patch tested?
Unit tests
Closes#33115 from gengliangwang/subtractTimestampWithoutTz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR addresses post-review comments on PR #33096:
- removes `private[sql]` modifier
- removes the option to pass a resolver to simplify the API
### Why are the changes needed?
These changes are needed to simply the utility API.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33120 from aokolnychyi/spark-35899-follow-up.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that field names of structs generated by `arrays_zip` function could be unexpectedly re-written by analyzer/optimizer.
Here is an example.
```
val df = sc.parallelize(Seq((Array(1, 2), Array(3, 4)))).toDF("a1", "b1").selectExpr("arrays_zip(a1, b1) as zipped")
df.printSchema
root
|-- zipped: array (nullable = true)
| |-- element: struct (containsNull = false)
| | |-- a1: integer (nullable = true) // OK. a1 is expected name
| | |-- b1: integer (nullable = true) // OK. b1 is expected name
df.explain
== Physical Plan ==
*(1) Project [arrays_zip(_1#3, _2#4) AS zipped#12] // Not OK. field names are re-written as _1 and _2 respectively
df.write.parquet("/tmp/test.parquet")
val df2 = spark.read.parquet("/tmp/test.parquet")
df2.printSchema
root
|-- zipped: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _1: integer (nullable = true) // Not OK. a1 is expected but got _1
| | |-- _2: integer (nullable = true) // Not OK. b1 is expected but got _2
```
This issue happens when aliases are eliminated by `AliasHelper.replaceAliasButKeepName` or `AliasHelper.trimNonTopLevelAliases` called via analyzer/optimizer
b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L883)b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L3759)
I investigated functions which can be affected this issue but I found only `arrays_zip` so far.
To fix this issue, this PR changes the definition of `ArraysZip` to retain field names to avoid being re-written by analyzer/optimizer.
### Why are the changes needed?
This is apparently a bug.
### Does this PR introduce _any_ user-facing change?
No. After this change, the field names are no longer re-written but it should be expected behavior for users.
### How was this patch tested?
New tests.
Closes#33106 from sarutak/arrays-zip-retain-names.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to migrate the following `ALTER TABLE ... CHANGE COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).
### Does this PR introduce _any_ user-facing change?
After this PR, the above `ALTER TABLE ... CHANGE COLUMN` commands will have a consistent resolution behavior.
### How was this patch tested?
Updated existing tests.
Closes#33113 from imback82/alter_change_column.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Implement V2 execution node `ShowCreateTableExec` similar to V1 `ShowCreateTableCommand`
2. No support `SHOW CREATE TABLE XXX AS SERDE`
### Why are the changes needed?
[SPARK-33898](https://issues.apache.org/jira/browse/SPARK-33898)
### Does this PR introduce _any_ user-facing change?
Yes. Support the user to execute `SHOW CREATE TABLE` command in V2 table
### How was this patch tested?
Add two UT tests
1. ./dev/scalastyle
2. run test DataSourceV2SQLSuite
Closes#32931 from Peng-Lei/SPARK-33898.
Lead-authored-by: PengLei <18066542445@189.cn>
Co-authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
[SPARK-35728](https://issues.apache.org/jira/browse/SPARK-35728): Add test case to check multiply/divide of day-time
intervals of any fields by numeric
[SPARK-35778](https://issues.apache.org/jira/browse/SPARK-35778): Add test case to check multiply/divide of year-month intervals of any fields by numeric
### Why are the changes needed?
Improve test coverage
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add ut tests
Lead-authored-by: Lei Peng <peng.8leigmail.com>
Co-authored-by: AngersZhuuuu <angers.zhugmail.com>
Closes#33080 from Peng-Lei/SPARK-35728-35778.
Lead-authored-by: PengLei <peng.8lei@gmail.com>
Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR group exception messages in sql/catalyst/src/main/scala/org/apache/spark/sql (except catalyst)
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce any user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32916 from dgd-contributor/SPARK-35064_catalyst_group_error.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch fixes `PromotePrecision` where it overwrites `genCode` where subexpression elimination should happen.
### Why are the changes needed?
`PromotePrecision` overwrites `genCode` where subexpression elimination should happen. So if it is most top expression of a subexpression, it is never replaced.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added test.
Closes#33103 from viirya/fix-precision.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
1. Make `RebalancePartitions` extend `RepartitionOperation`.
2. Make `CollapseRepartition` support `RebalancePartitions`.
### Why are the changes needed?
`CollapseRepartition` can optimize `RebalancePartitions` if possible.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#33099 from wangyum/SPARK-35904.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Support the following operation:
- TimestampWithoutTZ - Year-Month interval
The following operation is actually supported in https://github.com/apache/spark/pull/33076/. This PR is to add end-to-end tests for them:
- TimestampWithoutTZ - Calendar interval
- TimestampWithoutTZ - Daytime interval
### Why are the changes needed?
Support subtracting all 3 interval types from a timestamp without time zone
### Does this PR introduce _any_ user-facing change?
No, the timestamp without time zone type is not release yet.
### How was this patch tested?
Unit tests
Closes#33086 from gengliangwang/subtract.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR adds a utility to convert public connector expressions to Catalyst expressions.
Notable differences:
- Switched to `QueryCompilationErrors` from an explicit `AnalysisException`.
- Decoupled the resolving logic for v2 references into separate methods to use in other places.
### Why are the changes needed?
These changes are needed as more and more places require this logic and it is better to implement it in a single place.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33096 from aokolnychyi/spark-35899.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Supprot the following operations:
- TimestampWithoutTZ + Calendar interval
- TimestampWithoutTZ + Year-Month interval
- TimestampWithoutTZ + Daytime interval
### Why are the changes needed?
Support basic '+' operator for timestamp without time zone type.
### Does this PR introduce _any_ user-facing change?
No, the timestamp without time zone type is not release yet.
### How was this patch tested?
Unit tests
Closes#33076 from gengliangwang/addForNewTS.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR proposes to migrate the following `ALTER TABLE ... RENAME COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).
### Does this PR introduce _any_ user-facing change?
After this PR, the above `ALTER TABLE ... RENAME COLUMN` commands will have a consistent resolution behavior.
### How was this patch tested?
Updated existing tests.
Closes#33066 from imback82/alter_rename.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR change the behavior of unit-to-unit interval syntax to prohibit the case that the same units are specified for FROM and TO.
### Why are the changes needed?
For ANSI compliance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33057 from sarutak/prohibit-unit-pattern.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR changes the unionByName with null filling logic to append new nested struct fields from the right side of the union to the schema versus sorting fields alphabetically. It removes the need to use UpdateField expressions, and just directly projects new nested structs from each side of the union with the correct schema. This changes the union'd schema from being alphabetically sorted previously to now "left dominant", where the fields from the left side of the union are included and then the missing ones from the right are added in the same order found originally.
### Why are the changes needed?
Certain nested structs would cause unionByName with null filling to error out due to part of the logic for rewriting the expression tree to sort the structs.
### Does this PR introduce _any_ user-facing change?
Yes, nested struct fields will be in a different order after unionByName with null filling than before, though shouldn't cause much effective difference.
### How was this patch tested?
Updated existing tests based on the new StructField ordering and added a new test for the case that was broken originally.
Closes#33040 from Kimahriman/union-by-name-struct-order.
Authored-by: Adam Binford <adamq43@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to migrate the following `ALTER TABLE ... DROP COLUMNS` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).
### Why are the changes needed?
This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900).
### Does this PR introduce _any_ user-facing change?
After this PR, the above `ALTER TABLE ... DROP COLUMNS` commands will have a consistent resolution behavior.
### How was this patch tested?
Updated existing tests.
Closes#32854 from imback82/alter_alternative.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Current Literal.create(data, dataType) for Period to YearMonthIntervalType and Duration to DayTimeIntervalType is not correct.
if data type is Period/Duration, it will create converter of default YearMonthIntervalType/DayTimeIntervalType, then the result is not correct, this pr fix this bug.
### Why are the changes needed?
Fix bug when use Literal.create()
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#33056 from AngersZhuuuu/SPARK-35871.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Revert 8a1995f936
### Why are the changes needed?
The merged test doesn't check different interval fields, actually. Need to apply this https://github.com/apache/spark/pull/33056 first of all.
### Does this PR introduce _any_ user-facing change?
No. This is tests.
### How was this patch tested?
By existing GAs.
Closes#33060 from MaxGekk/revert-Peng-Lei-SPARK-35728.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Revert 3904c0edba
### Why are the changes needed?
The merged test doesn't check different interval fields, actually. Need to apply this https://github.com/apache/spark/pull/33056 first of all.
### Does this PR introduce _any_ user-facing change?
No. This is tests.
### How was this patch tested?
By existing GAs.
Closes#33059 from MaxGekk/revert-Peng-Lei-SPARK-35778.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Check multiply/divide of year-month intervals of any fields by numeric.
### Why are the changes needed?
To improve test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Expanded existed test cases.
Closes#33051 from Peng-Lei/SPARK-35778.
Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
1. The testcase is just cover the DayTimeIntervalType() */ numeric
2. Add testcase for following intervals */ numeric:
INTERVAL DAY
INTERVAL DAY TO HOUR
INTERVAL DAY TO MINUTE
INTERVAL HOUR
INTERVAL HOUR TO MINUTE
INTERVAL HOUR TO SECOND
INTERVAL MINUTE
INTERVAL MINUTE TO SECOND
INTERVAL SECOND
### Why are the changes needed?
Add testcase coverage.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existed testcase
Closes#33014 from Peng-Lei/SPARK-35728.
Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
* Add a new repartition operator `RebalanceRepartition`.
* Support a new hint `REBALANCE`
After this patch, user can run this query:
```sql
SELECT /*+ REBALANCE(c) */ * FROM t
```
### Why are the changes needed?
Add a new hint to distingush if we can optimize it safely.
This new hint can let AQE optimize with `CustomShuffleReaderExec` safely. Currently, AQE can only coalesce shuffle partitions but can not expand shuffle partitions due to the semantics of output partitioning.
Let's say we have a query:
```sql
SELECT /*+ REPARTITION(col) */ * FROM t
```
AQE can not expand the shuffle partitions even if `col` is skewed because expanding shuffle partitions will break the hashed output paritioning of `RepartitionByExpression`. But if the query is use`REPARTITION_BY_AQE`, AQE can optimize it without considering the semantics of output partitioning.
### Does this PR introduce _any_ user-facing change?
Yes, a new hint.
### How was this patch tested?
Add test.
Closes#32932 from ulysses-you/SPARK-35786.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
… sum of the digits is greater than 38
### What changes were proposed in this pull request?
Since Spark 3.1.1, NULL is returned when casting a string with many decimal places to a decimal type. If the sum of the digits before and after the decimal point is less than 39, a value is returned. From 39 digits, however, NULL is returned.
This worked until Spark 3.0.X.
Code to reproduce:
A string with 2 decimal places in front of the decimal point and 37 decimal places after the decimal point returns null
```
val data = Seq(
"28.9259999999999983799625624669715762138",
"28.925999999999998379962562466971576213",
"2.9259999999999983799625624669715762138"
)
val df = data.toDF("num")
df.withColumn("numConverted", col("num").cast("decimal(38, 5)")).show()
```
before this pull request, the result is
+----------------------+---------------+
| num |numConverted|
+----------------------+---------------+
|28.92599999999999...| null|
|28.92599999999999...| 28.92600|
|2.925999999999998...| 2.92600|
+----------------------+---------------+
the correct result should be
+----------------------+---------------+
| num |numConverted|
+----------------------+---------------+
|28.92599999999999...| 28.92600|
|28.92599999999999...| 28.92600|
|2.925999999999998...| 2.92600|
+----------------------+---------------+
The problem occur since https://issues.apache.org/jira/browse/SPARK-32706, it because the fast fail is checking precision length, which should only check the whole number part length of the input value, not the precision length
### Why are the changes needed?
correctness
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
test added
Closes#33011 from dgd-contributor/SPARK-35841_castStringToDecimalTypeError.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Simplify the condition code which is introduced by [SPARK-35282](https://issues.apache.org/jira/browse/SPARK-35282).
### Why are the changes needed?
Reduce the code size and make code more readable.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass CI
Closes#33046 from ulysses-you/simplify-shj.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support take into account year-month interval field in cast
##### Rule cast to target YearMonthIntervalType
| string | demo | strict target type | months |
|---|---|---|---|
| [+\|-]y-m | 1-1 | YearMonthIntervalType(YEAR. MONTH) | 13 |
| [+\|-]y| 1 | YearMonthIntervalType(YEAR. YEAR) | 12 |
| [+\|-]m | 1 | YearMonthIntervalType(MONTH. MONTH) | 1 |
| INTERVAL [+\|-]'[+\|-]y-m' YEAR TO MONTH | interval '1-1' year to month | YearMonthIntervalType(YEAR. MONTH) | 13 |
| INTERVAL [+\|-]'[+\|-]m' MONTH | interval '1' month | YearMonthIntervalType(MONTH. MONTH) | 1 |
| INTERVAL [+\|-]'[+\|-]y' YEAR | interval '1' year | YearMonthIntervalType(YEAR.YEAR) | 12 |
### Why are the changes needed?
Support take into account year-month interval field in cast
### Does this PR introduce _any_ user-facing change?
user can use `cast(str, YearMonthInterval(YEAR, YEAR))` etc
### How was this patch tested?
Added UT
Closes#32940 from AngersZhuuuu/SPARK-35768.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
We use `DateAdd` to impl `DateType` `+`/`-` `INTERVAL DAY`
### Why are the changes needed?
To improve the impl of `DateType` `+`/`-` `INTERVAL DAY`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add ut test
Closes#33033 from Peng-Lei/SPARK-35852.
Authored-by: PengLei <18066542445@189.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Extend the `TransposeWindow` rule to transpose `Window` nodes, that have `Project` between them.
### Why are the changes needed?
The analyzer will turn a `dataset.withColumn("colName", expressionWithWindowFunction)` method call to a `Project - Window - Project` chain in the logical plan. When this method is called multiple times in a row, then the projects can block the `Window` nodes from being transposed by the current `TransposeWindow` rule.
TPCDS q47 and q57 are also improved by this.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#31980 from tanelk/SPARK-34807_transpose_window.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
Make the ANSI flag part of expression `Cast`'s parameter list, instead of fetching it from the sessional SQLConf.
### Why are the changes needed?
For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list.
We should make it consistent for the expression `Cast`
### Does this PR introduce _any_ user-facing change?
Yes, the `Cast` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session.
### How was this patch tested?
Existing UT
Closes#33027 from gengliangwang/ansiFlagInCast.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Support UpCast between different field of YearMonthIntervalType/DayTimeIntervalType
### Why are the changes needed?
Since in our encoder we handle Period/Duration as default YearMonthIntervalType/DayTimeIntervalType, when we use udf to handle this type, it will upcast all type of YearMonthIntervalType/DayTimeIntervalType to default YearMonthIntervalType/DayTimeIntervalType, so we need to support this.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added Ut
Closes#33035 from AngersZhuuuu/SPARK-35860.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR unifies reuse map data structures in non-AQE and AQE rules to a simple `Map[<canonicalized plan>, <plan>]` based on the discussion here: https://github.com/apache/spark/pull/28885#discussion_r655073897
### Why are the changes needed?
The proposed `Map[<canonicalized plan>, <plan>]` is simpler than the currently used `Map[<schema>, ArrayBuffer[<plan>]]` in `ReuseMap`/`ReuseExchangeAndSubquery` (non-AQE) and consistent with the `ReuseAdaptiveSubquery` (AQE) subquery reuse rule.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#33021 from peter-toth/SPARK-35855-unify-reuse-map-data-structures.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The current OuterReference resolution is a bit weird: when the outer plan has more than one child, it resolves OuterReference from the output of each child, one by one, left to right.
This is incorrect in the case of join, as the column name can be ambiguous if both left and right sides output this column.
This PR fixes this bug by resolving OuterReference with `outerPlan.resolveChildren`, instead of something like `outerPlan.children.foreach(_.resolve(...))`
### Why are the changes needed?
bug fix
### Does this PR introduce _any_ user-facing change?
The problem only occurs in join, and join condition doesn't support correlated subquery yet. So this PR only improves the error message. Before this PR, people see
```
java.lang.UnsupportedOperationException
Cannot generate code for expression: outer(t1a#291)
```
### How was this patch tested?
a new test
Closes#33004 from cloud-fan/outer-ref.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `IntervalUtils.toDayTimeIntervalString` doesn't consider the case that a day-time interval type is casted as another day-time interval type.
if data of `interval day to second` is casted as `interval hour to second`, the value of the day is multiplied by 24 and added to the value of hour. For example, `INTERVAL '1 2' DAY TO HOUR` will be `INTERVAL '26' HOUR` if it's casted.
If this behavior is intended, it should be stringified as `INTERVAL '26' HOUR` but currently, it will be `INTERVAL '2' HOUR`
### Why are the changes needed?
t's a bug if the behavior of cast is intended.
### Does this PR introduce _any_ user-facing change?
No, because this feature is not released yet.
### How was this patch tested?
Modified the tests added in SPARK-35734 (#32891)
Closes#33031 from sarutak/fix-toDayTimeIntervalString.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
There are a few test cases that are supposed to be in CastSuiteBase instead of CastSuite:
- SPARK-35112: Cast string to day-time interval
- SPARK-35111: Cast string to year-month interval
- SPARK-35820: Support cast DayTimeIntervalType in different fields
- SPARK-35819: Support cast YearMonthIntervalType in different fields
This PR is to move them to CastSuiteBase. Also, it adds comments for the scope of CastSuiteBase/CastSuite/AnsiCastSuiteBase.
### Why are the changes needed?
Increase test coverage so that we can test the casting under ANSI mode.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33022 from gengliangwang/moveTest.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
When SQL function `to_timestamp_ntz` has invalid format pattern input, throw a runtime exception with hints for the valid patterns, instead of throwing an upgrade exception with suggestions to use legacy formatters.
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/32995/files#r655148980, there is an error message saying
"You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0"
This is not true for function to_timestamp_ntz, which only uses the Iso8601TimestampFormatter and added since Spark 3.2. We should improve it.
### Does this PR introduce _any_ user-facing change?
No, the new SQL function is not released yet.
### How was this patch tested?
Unit test
Closes#33019 from gengliangwang/improveError.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
What changes were proposed in this pull request?
1. Change the return value type from DayTimeIntervalType(DAY, SECOND) to DayTimeIntervalType(DAY, DAY) of SubtractDates.
Why are the changes needed?
https://issues.apache.org/jira/browse/SPARK-35727
Does this PR introduce any user-facing change?
no
How was this patch tested?
existed ut test
Closes#32999 from Peng-Lei/SPARK-35727.
Lead-authored-by: Lei Peng <peng.8lei@gmail.com>
Co-authored-by: PengLei <18066542445@189.cn>
Co-authored-by: Peng-Lei <peng.8lei@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Implement new SQL function: `to_timestamp_ntz`.
The syntax is similar to the built-in function `to_timestamp`:
```
to_timestamp_ntz ( <date_expr> )
to_timestamp_ntz ( <timestamp_expr> )
to_timestamp_ntz ( <string_expr> [ , <format> ] )
```
The naming is from snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_timestamp.html
### Why are the changes needed?
Adds a new SQL function to create a literal/column of timestamp without time zone.
It's convenient for both end-users and developers.
### Does this PR introduce _any_ user-facing change?
Yes, a new SQL function `to_timestamp_ntz`.
### How was this patch tested?
Unit tests
Closes#32995 from gengliangwang/toTimestampNtz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Extend the `CollapseWindow` rule to collapse `Window` nodes, that have `Project` between them.
### Why are the changes needed?
The analyzer will turn a `dataset.withColumn("colName", expressionWithWindowFunction)` method call to a `Project - Window - Project` chain in the logical plan. When this method is called multiple times in a row, then the projects can block the `Window` nodes from being collapsed by the current `CollapseWindow` rule.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#31677 from tanelk/SPARK-34565_collapse_windows.
Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to add 2 new methods that accept one field and produce either `YearMonthIntervalType` or `DayTimeIntervalType`.
### Why are the changes needed?
To improve code maintenance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By existing test suites.
Closes#32997 from MaxGekk/ansi-interval-types-single-field.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Support Cast between different field DayTimeIntervalType
### Why are the changes needed?
Make user convenient to get different field DayTimeIntervalType
### Does this PR introduce _any_ user-facing change?
User can call cast DayTimeIntervalType(DAY, SECOND) to DayTimeIntervalType(DAY, MINUTE) etc
### How was this patch tested?
Added UT
Closes#32975 from AngersZhuuuu/SPARK-35820.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes error message shown when changing a column type to year-month/day-time interval type is attempted.
### Why are the changes needed?
It's for consistent behavior.
Updating column types to interval types are prohibited for V2 source tables.
So, if we attempt to update the type of a column to the conventional interval type, an error message like `Error in query: Cannot update <table> field <column> to interval type;`.
But, for year-month/day-time interval types, another error message like `Error in query: Cannot update <table> field <column>:<type> cannot be cast to interval year;`.
You can reproduce with the following procedure.
```
$ bin/spark-sql
spark-sql> SET spark.sql.catalog.mycatalog=<a catalog implementation class>;
spark-sql> CREATE TABLE mycatalog.t1(c1 int) USING <V2 datasource implementation class>;
spark-sql> ALTER TABLE mycatalog.t1 ALTER COLUMN c1 TYPE interval year to month;
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified an existing test.
Closes#32978 from sarutak/err-msg-interval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue that `IntervalUtils.toYearMonthIntervalString` doesn't consider the case that year-month interval type is casted as month interval type.
If a year-month interval data is casted as month interval, the value of the year is multiplied by `12` and added to the value of month. For example, `INTERVAL '1-2' YEAR TO MONTH` will be `INTERVAL '14' MONTH` if it's casted.
If this behavior is intended, it's stringified to be `'INTERVAL 14' MONTH` but currently, it will be `INTERVAL '2' MONTH`
### Why are the changes needed?
It's a bug if the behavior of cast is intended.
### Does this PR introduce _any_ user-facing change?
No, because this feature is not released yet.
### How was this patch tested?
Modified the tests added in SPARK-35771 (#32924).
Closes#32982 from sarutak/fix-toYearMonthIntervalString.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Support Cast between different field YearMonthIntervalType
### Why are the changes needed?
Make user convenient to get different field YearMonthIntervalType
### Does this PR introduce _any_ user-facing change?
User can call cast YearMonthIntervalType(YEAR, MONTH) to YearMonthIntervalType(YEAR, YEAR) etc
### How was this patch tested?
Added UT
Closes#32974 from AngersZhuuuu/SPARK-35819.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Support truncate java.time.Duration by fields of day-time interval type.
### Why are the changes needed?
To respect fields of the target day-time interval types.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#32950 from AngersZhuuuu/SPARK-35726.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to add an internal config for ignoring metadata of `FileStreamSink` when reading the output path.
### Why are the changes needed?
`FileStreamSink` produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log.
Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark.
Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily.
We need a config for ignoring the metadata of `FileStreamSink` when reading the output path.
### Does this PR introduce _any_ user-facing change?
Added a config for ignoring metadata of FileStreamSink when reading the output.
### How was this patch tested?
Unit tests.
Closes#32702 from viirya/ignore-metadata.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR improves `Distinct` statistics estimation by rewrite it to `Aggregate`.
### Why are the changes needed?
1. The current implementation will lack column statistics.
2. Some rules before the `ReplaceDistinctWithAggregate` may use it. For example: https://github.com/apache/spark/pull/31113/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR1808
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#32291 from wangyum/SPARK-35185.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
### What changes were proposed in this pull request?
Change `AQEPropagateEmptyRelation` from `transformUp` to `transformUpWithPruning
### Why are the changes needed?
To avoid unnecessary iteration during AQE optimizer.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass CI.
Closes#32742 from ulysses-you/aqe-transformUpWithPruning.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Support truncate java.time.Period by fields of year-month interval type
### Why are the changes needed?
To follow the SQL standard and respect the field restriction of the target year-month type.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#32945 from AngersZhuuuu/SPARK-35769.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Extend the Cast expression and support StringType in casting to TimestampWithoutTZType.
Closes#32898
### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#32936 from gengliangwang/castStringToTswtz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR is a follow-up for SPARK-34382. It refines the lateral join syntax to only allow the LATERAL keyword to be in front of subqueries, instead of all `relationPriamry`. For example, `SELECT * FROM t1, LATERAL t2` should not be allowed.
### Why are the changes needed?
To be consistent with Postgres.
### Does this PR introduce _any_ user-facing change?
Yes. After this PR, the LATERAL keyword can only be in front of subqueries.
```scala
sql("SELECT * FROM t1, LATERAL t2")
org.apache.spark.sql.catalyst.parser.ParseException:
LATERAL can only be used with subquery(line 1, pos 26)
== SQL ==
select * from t1, lateral t2
--------------------------^^^
```
### How was this patch tested?
New unit tests.
Closes#32937 from allisonwang-db/spark-35789-lateral-join-parser.
Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`RelationConversions` is actually an optimization rule while it's executed in the analysis phase.
For view, it's designed to only capture semantic configs, so we should ignore the optimization
configs that will be used in the analysis phase.
This PR also fixes the issue that view resolution will always use the default value for uncaptured config
### Why are the changes needed?
Bugfix
### Does this PR introduce _any_ user-facing change?
Yes, after this PR view resolution will respect the values set in the current session for the below configs
```
"spark.sql.hive.convertMetastoreParquet"
"spark.sql.hive.convertMetastoreOrc"
"spark.sql.hive.convertInsertingPartitionedTable"
"spark.sql.hive.convertMetastoreCtas"
```
### How was this patch tested?
By running new UT:
```
$ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveSQLViewSuite"
```
Closes#32941 from linhongliu-db/SPARK-35792-ignore-convert-configs.
Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support Parse DayTimeIntervalType from JSON
### Why are the changes needed?
this will allow to store day-second intervals as table columns into Hive external catalog.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT
Closes#32930 from AngersZhuuuu/SPARK-35732.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Providing a new function make_dt_interval to construct DayTimeIntervalType value
### Why are the changes needed?
As the JIRA described, we should provide a function to construct DayTimeIntervalType value
### Does this PR introduce _any_ user-facing change?
Yes, a new make_dt_interval function provided
### How was this patch tested?
Updated UTs, manual testing
Closes#32601 from copperybean/work.
Authored-by: copperybean <copperybean.zhang@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Parse YearMonthIntervalType from JSON.
### Why are the changes needed?
This will allow to store year-month intervals as table columns into Hive external catalog.
### Does this PR introduce _any_ user-facing change?
People can store year-month interval types as json string.
### How was this patch tested?
Added UT.
Closes#32929 from AngersZhuuuu/SPARK-35770.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Currently, `ResolveAggregateFunctions` is a complicated rule that recursively calls the entire analyzer to resolve aggregate functions in parent nodes of aggregate. It's kind of necessary as we need to do many things to identify the aggregate function and push it down to the aggregate node: resolve columns as if they are in the aggregate node, resolve functions, apply type coercion, etc. However, this is overly complicated and it's hard to fully understand how the resolution is done there. It also leads to hacks such as the [char/varchar hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2396-L2401), [subquery hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2274-L2277), [grouping function hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2465-L2467), etc.
This PR simplifies the `ResolveAggregateFunctions` rule and clarifies the resolution logic. To resolve aggregate functions/grouping columns in HAVING, ORDER BY and `df.where`, we expand the aggregate node below to output these required aggregate functions/grouping columns. In details, when resolving an expression from the parent of an aggregate node:
1. try to resolve columns with `agg.child` and wrap the result with `TempResolvedColumn`.
2. try to resolve subqueries with `agg.child`
3. if the expression is not resolved, return it and wait for other rules to resolve it, such as resolve functions, type coercions, etc.
4. if the expression is resolved, we transform it and push aggregate functions/grouping columns into the aggregate node below.
4.1 the expression may already present in `agg.aggregateExpressions`, we can simply replace the expression with attr ref.
4.2 if a `TempResolvedColumn` is neither inside an aggregate function, or wrap a grouping column, turn it back to an `UnresolvedAttribute`
5. after the main resolution batch, remove all `TempResolvedColumn` and turn them back to `UnresolvedAttribute`.
### Why are the changes needed?
Code cleanup
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing test
Closes#32470 from cloud-fan/agg2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to format year-month interval to strings using the start and end fields of `YearMonthIntervalType`.
### Why are the changes needed?
Currently, they are ignored, and any `YearMonthIntervalType` is formatted as `INTERVAL YEAR TO MONTH`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#32924 from sarutak/year-month-interval-format.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR extends the parser rules to be able to parse the following types:
* INTERVAL YEAR
* INTERVAL YEAR TO MONTH
* INTERVAL MONTH
### Why are the changes needed?
For ANSI compliance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New assertion.
Closes#32922 from sarutak/parse-any-year-month.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR improves `Repartition` and `RepartitionByExpr` statistics estimation using child statistics.
### Why are the changes needed?
The current implementation will missing column stat. For example:
```sql
CREATE TABLE t1 USING parquet AS SELECT id % 10 AS key FROM range(100);
ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS;
set spark.sql.cbo.enabled=true;
EXPLAIN COST SELECT key FROM (SELECT key FROM t1 DISTRIBUTE BY key) t GROUP BY key;
```
Before this PR:
```
== Optimized Logical Plan ==
Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=1600.0 B)
+- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100)
+- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100)
```
After this PR:
```
== Optimized Logical Plan ==
Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=160.0 B, rowCount=10)
+- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100)
+- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#32309 from wangyum/SPARK-35203.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/31964
We should only quote the column name when nested column predicate pushdown is enabled, otherwise the data source side may not have the logic to parse the quoted column name and fail. This is not a problem before #31964 , as we don't quote the column name if there is no dot in the name. But #31964 changed it.
### Why are the changes needed?
fix a query failure
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new test
Closes#32807 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Extend `YearMonthIntervalType` to support interval fields. Valid interval field values:
- 0 (YEAR)
- 1 (MONTH)
After the changes, the following year-month interval types are supported:
1. `YearMonthIntervalType(0, 0)` or `YearMonthIntervalType(YEAR, YEAR)`
2. `YearMonthIntervalType(0, 1)` or `YearMonthIntervalType(YEAR, MONTH)`. **It is the default one**.
3. `YearMonthIntervalType(1, 1)` or `YearMonthIntervalType(MONTH, MONTH)`
Closes#32825
### Why are the changes needed?
In the current implementation, Spark supports only `interval year to month` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely.
### Does this PR introduce _any_ user-facing change?
Yes but `YearMonthIntervalType` has not been released yet.
### How was this patch tested?
By existing test suites.
Closes#32909 from MaxGekk/add-fields-to-YearMonthIntervalType.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Add a new function to support construct YearMonthIntervalType from integral fields
### Why are the changes needed?
Add a new function to support construct YearMonthIntervalType from integral fields
### Does this PR introduce _any_ user-facing change?
Yea user can use `make_ym_interval` to construct TearMonthIntervalType from years/months integral fields
### How was this patch tested?
Added UT
Closes#32645 from AngersZhuuuu/SPARK-35129.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Currently, the file CastSuite.scala becomes big: 2000 lines, 2 base classes, 4 test suites.
In my previous work of Timestamp without time zone, I planned to put new test cases in CastSuiteBase, but they were accidentally added in AnsiCastSuiteBase.
This PR is to break the file down into 3 files. It also moves the test cases about timestamp without time zone to the right base class.
### Why are the changes needed?
Make development and review easier.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit tests
Closes#32918 from gengliangwang/refactorCastSuite.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Extended `RemoveRedundantAggregates` to remove deduplicating aggregations before aggregations that ignore duplicates.
### Why are the changes needed?
Performance imporovement.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Extending existing UT
Closes#32904 from tanelk/SPARK-33122_followup2_distinct_agg.
Authored-by: Tanel Kiis <tanel.kiis@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR group exception messages in `sql/core/src/main/scala/org/apache/spark/sql/execution/streaming`.
### Why are the changes needed?
It will largely help with standardization of error messages and its maintenance.
### Does this PR introduce _any_ user-facing change?
No. Error messages remain unchanged.
### How was this patch tested?
No new tests - pass all original tests to make sure it doesn't break any existing behavior.
Closes#32880 from beliefer/SPARK-35056.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to override the typeName() method in TimestampWithoutTZType, and assign it a name according to the ANSI SQL standard
![image](https://user-images.githubusercontent.com/1097932/122013859-2cf50680-cdf1-11eb-9fcd-0ec1b59fb5c0.png)
### Why are the changes needed?
To improve Spark SQL user experience, and have readable types in error messages.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#32915 from gengliangwang/typename.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Currently, there are some expressions that overwrite `semanticEquals`, which makes it not symmetrical. Ideally, expressions should overwrite `canonicalized` instead of `semanticEquals`.
This PR marks `semanticEquals` as final, and implement `canonicalized` for the few expressions that overwrote `semanticEquals` before.
### Why are the changes needed?
To avoid subtle bugs (I haven't found a real bug yet).
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
a new test
Closes#32885 from cloud-fan/attr.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup PR for SPARK-35736(#32893) and SPARK-35737(#32892).
This PR moves a common logic to `object DayTimeIntervalType`.
That logic is like `val strToFieldIndex = DayTimeIntervalType.dayTimeFields.map(i => DayTimeIntervalType.fieldToString(i) -> (i).toMap`, a `Map` which maps each time unit to the corresponding day-time field index.
### Why are the changes needed?
That logic appeared in the change in SPARK-35736 and SPARK-35737 so it can be a common logic and it's better to avoid the similar logic scattered.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32905 from sarutak/followup-SPARK-35736-35737.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR fixes `StreamingJoinHelper` to be able to handle day-time interval.
### Why are the changes needed?
In the current master, `StreamingJoinHelper.getStateValueWatermark` can't handle conditions which contain day-time interval literals.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New assertions added to `StreamingJoinHlelperSuite`.
Closes#32896 from sarutak/streamingjoinhelper-daytime.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR add a feature which parse day-time interval literals to tightest type.
### Why are the changes needed?
To comply with the ANSI behavior.
For example, `INTERVAL '10 20:30' DAY TO MINUTE` should be parsed as `DayTimeIntervalType(DAY, MINUTE)` but not as `DayTimeIntervalType(DAY, SECOND)`.
### Does this PR introduce _any_ user-facing change?
No because `DayTimeIntervalType` will be introduced in `3.2.0`.
### How was this patch tested?
New tests.
Closes#32892 from sarutak/tight-daytime-interval.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This PR adda a feature which allow the parser parse any day-time interval types in SQL.
### Why are the changes needed?
To comply with ANSI standard, we additionally need to support the following types.
* INTERVAL DAY
* INTERVAL DAY TO HOUR
* INTERVAL DAY TO MINUTE
* INTERVAL HOUR
* INTERVAL HOUR TO MINUTE
* INTERVAL HOUR TO SECOND
* INTERVAL MINUTE
* INTERVAL MINUTE TO SECOND
* INTERVAL SECOND
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#32893 from sarutak/parse-any-day-time.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
1. Extend the Cast expression and support TimestampType in casting to TimestampWithoutTZType.
2. There was a mistake in casting TimestampWithoutTZType as TimestampType in https://github.com/apache/spark/pull/32864. The target value should be `sourceValue - timeZoneOffset` instead of being the same value.
### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#32878 from gengliangwang/timestampToTimestampWithoutTZ.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Using copy-on-write for `SQLConf.sqlConfEntries` and `SQLConf.staticConfKeys` to reduce contention in concurrent workloads.
### Why are the changes needed?
The global locks used to protect `SQLConf.sqlConfEntries` map and the `SQLConf.staticConfKeys` set can cause significant contention on the `SQLConf` instance in a concurrent setting.
Using copy-on-write versions should reduce the contention given that modifications to the configs are relatively rare.
Closes#32865 from haiyangsun-db/SPARK-35701.
Authored-by: Haiyang Sun <haiyang.sun@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR add a feature which formats day-time interval to strings using the start and end fields of `DayTimeIntervalType`.
### Why are the changes needed?
Currently, they are ignored, and any `DayTimeIntervalType` is formatted as `INTERVAL DAY TO SECOND.`
### Does this PR introduce _any_ user-facing change?
Yes. The format of day-time intervals is determined the start and end fields.
### How was this patch tested?
New test.
Closes#32891 from sarutak/interval-format.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Extend DayTimeIntervalType to support interval fields. Valid interval field values:
- 0 (DAY)
- 1 (HOUR)
- 2 (MINUTE)
- 3 (SECOND)
After the changes, the following day-time interval types are supported:
1. `DayTimeIntervalType(0, 0)` or `DayTimeIntervalType(DAY, DAY)`
2. `DayTimeIntervalType(0, 1)` or `DayTimeIntervalType(DAY, HOUR)`
3. `DayTimeIntervalType(0, 2)` or `DayTimeIntervalType(DAY, MINUTE)`
4. `DayTimeIntervalType(0, 3)` or `DayTimeIntervalType(DAY, SECOND)`. **It is the default one**. The second fraction precision is microseconds.
5. `DayTimeIntervalType(1, 1)` or `DayTimeIntervalType(HOUR, HOUR)`
6. `DayTimeIntervalType(1, 2)` or `DayTimeIntervalType(HOUR, MINUTE)`
7. `DayTimeIntervalType(1, 3)` or `DayTimeIntervalType(HOUR, SECOND)`
8. `DayTimeIntervalType(2, 2)` or `DayTimeIntervalType(MINUTE, MINUTE)`
9. `DayTimeIntervalType(2, 3)` or `DayTimeIntervalType(MINUTE, SECOND)`
10. `DayTimeIntervalType(3, 3)` or `DayTimeIntervalType(SECOND, SECOND)`
### Why are the changes needed?
In the current implementation, Spark supports only `interval day to second` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely.
### Does this PR introduce _any_ user-facing change?
Yes but `DayTimeIntervalType` has not been released yet.
### How was this patch tested?
By existing test suites.
Closes#32849 from MaxGekk/day-time-interval-type-units.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
The STRUCT type syntax is defined like this:
STRUCT(fieldNmae: fileType [NOT NULL][COMMENT stringLiteral][,.....])
So the field list is nearly the same as a column list
if we could make ':' optional it would be so much cleaner an less proprietary
### Why are the changes needed?
ease of use
### Does this PR introduce _any_ user-facing change?
Yes, you can use Struct type list is nearly the same as a column list
### How was this patch tested?
unit tests
Closes#32858 from jerqi/master.
Authored-by: RoryQi <1242949407@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup of #32586. We introduced `ExpressionContainmentOrdering` to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue.
### Why are the changes needed?
To fix the possible transitivity issue of `ExpressionContainmentOrdering`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#32870 from viirya/SPARK-35439-followup.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Extend the Cast expression and support DateType in casting to TimestampWithoutTZType.
### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#32873 from gengliangwang/dateToTswtz.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Extend the Cast expression and support TimestampWithoutTZType in casting to DateType.
### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#32869 from gengliangwang/castToDate.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Extend the Cast expression and support TimestampWithoutTZType in casting to TimestampType.
### Why are the changes needed?
To conform the ANSI SQL standard which requires to support such casting.
### Does this PR introduce _any_ user-facing change?
No, the new timestamp type is not released yet.
### How was this patch tested?
Unit test
Closes#32864 from gengliangwang/castToTimestamp.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
#31637 removed the usage of `CheckAnalysis.checkAlterTablePartition` but didn't remove the function.
### Why are the changes needed?
To removed an unused function.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32855 from imback82/SPARK-34524-followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Use `UnresolvedHint.resolved = child.resolved` instead `UnresolvedHint.resolved = false`, then the plan contains `UnresolvedHint` child can be optimized by rule in batch `Resolution`.
For instance, before this pr, the following plan can't be optimized by `ResolveReferences`.
```
!'Project [*]
+- SubqueryAlias __auto_generated_subquery_name
+- UnresolvedHint use_hash
+- Project [42 AS 42#10]
+- OneRowRelation
```
### Why are the changes needed?
fix hint in subquery bug
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#32841 from cfmcgrady/SPARK-35673.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### Why are the changes needed?
With Long.minValue cast to an instant, secs will be floored in function microsToInstant and cause overflow when multiply with Micros_per_second
```
def microsToInstant(micros: Long): Instant = {
val secs = Math.floorDiv(micros, MICROS_PER_SECOND)
// Unfolded Math.floorMod(us, MICROS_PER_SECOND) to reuse the result of
// the above calculation of `secs` via `floorDiv`.
val mos = micros - secs * MICROS_PER_SECOND <- it will overflow here
Instant.ofEpochSecond(secs, mos * NANOS_PER_MICROS)
}
```
But the overflow is acceptable because it won't produce any change to the result
However, when convert the instant back to micro value, it will raise Overflow Error
```
def instantToMicros(instant: Instant): Long = {
val us = Math.multiplyExact(instant.getEpochSecond, MICROS_PER_SECOND) <- It overflow here
val result = Math.addExact(us, NANOSECONDS.toMicros(instant.getNano))
result
}
```
Code to reproduce this error
```
instantToMicros(microToInstant(Long.MinValue))
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Test added
Closes#32839 from dgd-contributor/SPARK-35679_instantToMicro.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Add an `internalRegisterFunction` for the built-in function registry. So that
we can skip the unnecessary function normalization.
### Why are the changes needed?
small refactor
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing ut
Closes#32842 from linhongliu-db/function-refactor.
Lead-authored-by: Linhong Liu <linhong.liu@databricks.com>
Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR changes an occurrence of `Seq` to `collections.Seq` in `NestedColumnAliasing`.
### Why are the changes needed?
In the current master, `NestedColumnAliasing` doesn't work with Scala 2.13 and the relevant tests fail.
The following are examples.
* `NestedColumnAliasingSuite`
* Subclasses of `SchemaPruningSuite`
* `ColumnPruningSuite`
```
NestedColumnAliasingSuite:
[info] - Pushing a single nested field projection *** FAILED *** (14 milliseconds)
[info] scala.MatchError: (none#211451,ArrayBuffer(name#211451.middle)) (of class scala.Tuple2)
[info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.$anonfun$getAttributeToExtractValues$5(NestedColumnAliasing.scala:258)
[info] at scala.collection.StrictOptimizedMapOps.flatMap(StrictOptimizedMapOps.scala:31)
[info] at scala.collection.StrictOptimizedMapOps.flatMap$(StrictOptimizedMapOps.scala:30)
[info] at scala.collection.immutable.HashMap.flatMap(HashMap.scala:39)
[info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.getAttributeToExtractValues(NestedColumnAliasing.scala:258)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran tests mentioned above and all passed with Scala 2.13.
Closes#32848 from sarutak/followup-SPARK-35194-2.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>