ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Max Gekk	0c364e607d	[SPARK-36590][SQL] Convert special timestamp_ntz values in the session time zone In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on. Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings: ```sql $ export TZ="Europe/Amsterdam" $ ./bin/spark-sql -S spark-sql> select timestamp_ntz'now'; 2021-08-25 18:12:36.233 spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select timestamp_ntz'now'; 2021-08-25 18:14:40.547 ``` Yes. For the example above, after the changes: ```sql spark-sql> select timestamp_ntz'now'; 2021-08-25 18:47:46.832 spark-sql> set spark.sql.session.timeZone=America/Los_Angeles; spark.sql.session.timeZone America/Los_Angeles spark-sql> select timestamp_ntz'now'; 2021-08-25 09:48:05.211 ``` By running the affected test suites: ``` $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #33838 from MaxGekk/fix-ts_ntz-special-values. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `159ff9fd14`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 10:11:37 +08:00
Gengliang Wang	464841224c	[SPARK-36585][SQL][DOCS] Support setting "since" version in FunctionRegistry ### What changes were proposed in this pull request? Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`: - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp - https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like This PR is to: * Support setting `since` version in FunctionRegistry * Correct the `since` version of `regexp` and `regexp_like` ### Why are the changes needed? Correct the SQL doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Run ``` sh sql/create-docs.sh ``` and check the SQL doc manually Closes #33834 from gengliangwang/allowSQLFunVersion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `18143fb426`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-25 22:32:49 +08:00
Max Gekk	a4c5140242	[SPARK-36567][SQL] Support foldable special datetime strings by `CAST` ### What changes were proposed in this pull request? In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable. ### Why are the changes needed? 1. To avoid a breaking change. 2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance, at Spark 3.1: ```sql select ts_col > 'now'; ``` but the query fails at the moment, and users have to use typed timestamp literal: ```sql select ts_col > timestamp'now'; ``` ### Does this PR introduce _any_ user-facing change? No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714. ### How was this patch tested? 1. Manually tested via the sql command line: ```sql spark-sql> select cast('today' as date); 2021-08-24 spark-sql> select timestamp('today'); 2021-08-24 00:00:00 spark-sql> select timestamp'tomorrow' > 'today'; true ``` 2. By running new test suite: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite" ``` Closes #33816 from MaxGekk/foldable-datetime-special-values. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `df0ec56723`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-25 14:09:13 +08:00
Hyukjin Kwon	26ae9e93da	[SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization ### What changes were proposed in this pull request? This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(10).id.value_counts().to_frame().spark.explain() ``` Before: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#51L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70] +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L]) +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67] +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L]) +- Project [id#37L] +- Filter atleastnnonnulls(1, id#37L) +- Scan ExistingRDD[__index_level_0__#36L,id#37L] # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed) ``` After: ```bash == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- Sort [count#275L DESC NULLS LAST], true, 0 +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174] +- HashAggregate(keys=[id#258L], functions=[count(1)]) +- HashAggregate(keys=[id#258L], functions=[partial_count(1)]) +- Filter atleastnnonnulls(1, id#258L) +- Range (0, 10, step=1, splits=16) # ^^^ Removed the Spark job execution for `zipWithIndex` ``` ### Why are the changes needed? To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`. Closes #33807 from HyukjinKwon/SPARK-36559. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `93cec49212`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:03:00 +09:00
Gengliang Wang	a313082d67	[SPARK-35535][SQL][FOLLOWUP] Move LocalScan to Catalyst package ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/32678. It moves `LocalScan` from SQL core package to Catalyst package. ### Why are the changes needed? There are two packages for `org.apache.spark.sql.connector` SQL Core: https://github.com/apache/spark/tree/master/sql/core/src/main/java/org/apache/spark/sql/connector Catalyst: https://github.com/apache/spark/tree/master/sql/catalyst/src/main/java/org/apache/spark/sql/connector As `LocalScan` doesn't depend on the classes of SQL Core, we should move it to catalyst. ### Does this PR introduce _any_ user-facing change? No, the trait is not released yet. ### How was this patch tested? Existing UT. Closes #33826 from gengliangwang/moveLocalScan. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `5b4c216478`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:24:06 -07:00
Huaxin Gao	e48de7884d	[SPARK-34952][SQL][FOLLOWUP] Move aggregates to a separate package ### What changes were proposed in this pull request? Add `aggregate` package under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions` and move all the aggregates (e.g. `Count`, `Max`, `Min`, etc.) there. ### Why are the changes needed? Right now these aggregates are under `sql/catalyst/src/main/java/org/apache/spark/sql/connector/expressions`. It looks OK now, but we plan to add a new `filter` package under `expressions` for all the DSV2 filters. It will look strange that filters have their own package, but aggregates don't. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #33815 from huaxingao/agg_package. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `cd2342691d`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-23 15:31:35 -07:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Gengliang Wang	8e58fafb05	Preparing Spark release v3.2.0-rc1	2021-08-19 14:08:26 +00:00
gengjiaan	7041c0f9f0	[SPARK-36428][TESTS][FOLLOWUP] Revert mistake change to DateExpressionsSuite ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/33775 commits the debug code mistakely. This PR revert the test path. ### Why are the changes needed? Revoke debug code. ### Does this PR introduce _any_ user-facing change? 'No'. Just adjust test. ### How was this patch tested? Revert non-ansi test path. Closes #33787 from beliefer/SPARK-36428-followup2. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `462aa7cd3c`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-19 21:33:39 +08:00
Shixiong Zhu	54cca7f82e	[SPARK-36519][SS] Store RocksDB format version in the checkpoint for streaming queries ### What changes were proposed in this pull request? RocksDB provides backward compatibility but it doesn't always provide forward compatibility. It's better to store the RocksDB format version in the checkpoint so that it would give us more information to provide the rollback guarantee when we upgrade the RocksDB version that may introduce incompatible change in a new Spark version. A typical case is when a user upgrades their query to a new Spark version, and this new Spark version has a new RocksDB version which may use a new format. But the user hits some bug and decide to rollback. But in the old Spark version, the old RocksDB version cannot read the new format. In order to handle this case, we will write the RocksDB format version to the checkpoint. When restarting from a checkpoint, we will force RocksDB to use the format version stored in the checkpoint. This will ensure the user can rollback their Spark version if needed. We also provide a config `spark.sql.streaming.stateStore.rocksdb.formatVersion` for users who don't need to rollback their Spark versions to overwrite the format version specified in the checkpoint. ### Why are the changes needed? Provide the Spark version rollback guarantee for streaming queries when a new RocksDB introduces an incompatible format change. ### Does this PR introduce _any_ user-facing change? No. RocksDB state store is a new feature in Spark 3.2, which has not yet released. ### How was this patch tested? The new unit tests. Closes #33749 from zsxwing/SPARK-36519. Authored-by: Shixiong Zhu <zsxwing@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `ea4919801a`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-19 00:23:52 -07:00
gengjiaan	3d69d0d003	[SPARK-36428][SQL][FOLLOWUP] Simplify the implementation of make_timestamp ### What changes were proposed in this pull request? The implement of https://github.com/apache/spark/pull/33665 make `make_timestamp` could accepts integer type as the seconds parameter. This PR let `make_timestamp` accepts `decimal(16,6)` type as the seconds parameter and cast integer to `decimal(16,6)` is safe, so we can simplify the code. ### Why are the changes needed? Simplify `make_timestamp`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? New tests. Closes #33775 from beliefer/SPARK-36428-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `707eefa3c7`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-18 22:57:27 +08:00
Gengliang Wang	70635b4b26	Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases" ### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes #33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `8bfb4f1e72`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-17 20:24:09 +08:00
Max Gekk	07c6976f79	[SPARK-36524][SQL] Common class for ANSI interval types ### What changes were proposed in this pull request? Add new type `AnsiIntervalType` to `AbstractDataType.scala`, and extend it by `YearMonthIntervalType` and by `DayTimeIntervalType` ### Why are the changes needed? To improve code maintenance. The change will allow to replace checking of both `YearMonthIntervalType` and `DayTimeIntervalType` by a check of `AnsiIntervalType`, for instance: ```scala case _: YearMonthIntervalType \| _: DayTimeIntervalType => false ``` by ```scala case _: AnsiIntervalType => false ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing test suites. Closes #33753 from MaxGekk/ansi-interval-type-trait. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `82a31508af`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-17 12:28:07 +03:00
Gengliang Wang	41e5144b53	[SPARK-36521][SQL] Disallow comparison between Interval and String ### What changes were proposed in this pull request? Disallow comparison between Interval and String in the default type coercion rules. ### Why are the changes needed? If a binary comparison contains interval type and string type, we can't decide which interval type the string should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc. ### Does this PR introduce _any_ user-facing change? No, the new interval type is not released yet. ### How was this patch tested? Existing UT Closes #33750 from gengliangwang/disallowCom. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `26d6b952dc`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-16 22:41:25 +03:00
Liang-Chi Hsieh	3aa933b162	[SPARK-36465][SS] Dynamic gap duration in session window ### What changes were proposed in this pull request? This patch supports dynamic gap duration in session window. ### Why are the changes needed? The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows. ### Does this PR introduce _any_ user-facing change? Yes, users can specify dynamic gap duration. ### How was this patch tested? Modified existing tests and new test. Closes #33691 from viirya/dynamic-session-window-gap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `8b8d91cf64`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-16 11:06:16 +09:00
Gengliang Wang	c898a940e2	[SPARK-36508][SQL] ANSI type coercion: disallow binary operations between Interval and String literal ### What changes were proposed in this pull request? If a binary operation contains interval type and string literal, we can't decide which interval type the string literal should be promoted as. There are many possible interval types, such as year interval, month interval, day interval, hour interval, etc. The related binary operation for Interval contains - Add - Subtract - Comparisions Note that `Interval Multiple/Divide StringLiteral` is valid as them is not binary operators(the left and right are not of the same type). This PR also add tests for them. ### Why are the changes needed? Avoid ambiguously implicit casting string literals to interval types. ### Does this PR introduce _any_ user-facing change? No, the ANSI type coercion is not released yet. ### How was this patch tested? New tests. Closes #33737 from gengliangwang/disallowStringAndInterval. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `ecdea91602`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-14 10:50:43 +08:00
gengjiaan	eb840578f7	[SPARK-36428][SQL] the seconds parameter of make_timestamp should accept integer type ### What changes were proposed in this pull request? With ANSI mode, `SELECT make_timestamp(1, 1, 1, 1, 1, 1)` fails, because the 'seconds' parameter needs to be of type DECIMAL(8,6), and INT can't be implicitly casted to DECIMAL(8,6) under ANSI mode. ``` org.apache.spark.sql.AnalysisException cannot resolve 'make_timestamp(1, 1, 1, 1, 1, 1)' due to data type mismatch: argument 6 requires decimal(8,6) type, however, '1' is of int type.; line 1 pos 7 ``` We should update the function `make_timestamp` to allow integer type 'seconds' parameter. ### Why are the changes needed? Make `make_timestamp` could accepts integer as 'seconds' parameter. ### Does this PR introduce _any_ user-facing change? 'Yes'. `make_timestamp` could accepts integer as 'seconds' parameter. ### How was this patch tested? New tests. Closes #33665 from beliefer/SPARK-36428. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `7d82336734`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-13 13:13:15 +08:00
Maryann Xue	ca91292cf0	[SPARK-36447][SQL] Avoid inlining non-deterministic With-CTEs This PR fixes an existing correctness issue where a non-deterministic With-CTE can be executed multiple times producing different results, by deferring the inline of With-CTE to after the analysis stage. This fix also provides the future opportunity of performance improvement by executing deterministic With-CTEs only once in some circumstances. The major changes include: 1. Added new With-CTE logical nodes: `CTERelationDef`, `CTERelationRef`, `WithCTE`. Each `CTERelationDef` has a unique ID and the mapping between CTE def and CTE ref is based on IDs rather than names. `WithCTE` is a resolved version of `With`, only that: 1) `WithCTE` is a multi-children logical node so that most logical rules can automatically apply to CTE defs; 2) In the main query and each subquery, there can only be at most one `WithCTE`, which means nested With-CTEs are combined. 2. Changed `CTESubstitution` rule so that if NOT in legacy mode, CTE defs will not be inlined immediately, but rather transformed into a `CTERelationRef` per reference. 3. Added new With-CTE rules: 1) `ResolveWithCTE` - to update `CTERelationRef`s with resolved output from corresponding `CTERelationDef`s; 2) `InlineCTE` - to inline deterministic CTEs or non-deterministic CTEs with only ONE reference; 3) `UpdateCTERelationStats` - to update stats for `CTERelationRef`s that are not inlined. 4. Added a CTE physical planning strategy to plan `CTERelationRef`s as an independent shuffle with round-robin partitioning so that such CTEs will only be materialized once and different references will later be a shuffle reuse. A current limitation is that With-CTEs mixed with SQL commands or DMLs will still go through the old inline code path because of our non-standard language specs and not-unified command/DML interfaces. This is a correctness issue. Non-deterministic CTEs should produce the same output regardless of how many times it is referenced/used in query, while under the current implementation there is no such guarantee and would lead to incorrect query results. No. Added UTs. Regenerated golden files for TPCDS plan stability tests. There is NO change to the `simplified.txt` files, the only differences are expression IDs. Closes #33671 from maryannxue/spark-36447. Authored-by: Maryann Xue <maryann.xue@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `29b1e394c6`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-13 11:48:46 +08:00
Gengliang Wang	f7017694e8	[SPARK-36497][SQL] Support Interval add/subtract NULL ### What changes were proposed in this pull request? Currently, `null + interval` will become `cast(cast(null as timestamp) + interval) as null`. This is a unexpected behavior and the result should not be of null type. This weird behavior applies to `null - interval`, `interval + null`, `interval - null` as well. To change it, I propose to cast the null as the same data type of the other element in the add/subtract: ``` null + interval => cast(null as interval) + interval null - interval => cast(null as interval) - interval interval + null=> interval + cast(null as interval) interval - null => interval - cast(null as interval) ``` ### Why are the changes needed? Change the confusing behavior of `Interval +/- NULL` and `NULL +/- Interval` ### Does this PR introduce _any_ user-facing change? No, the new interval type is not released yet. ### How was this patch tested? Existing UT Closes #33727 from gengliangwang/intervalTypeCoercion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `d4466d55ca`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-13 11:10:45 +08:00
Gengliang Wang	3785738838	[SPARK-36445][SQL][FOLLOWUP] ANSI type coercion: revisit promoting string literals in datetime expressions ### What changes were proposed in this pull request? 1. Promote more string literal in subtractions. In the ANSI type coercion rule, we already promoted ``` string - timestamp => cast(string as timestamp) - timestamp ``` This PR is to promote the following string literals: ``` string - date => cast(string as date) - date date - string => date - cast(date as string) timestamp - string => timestamp ``` It is very straightforward to cast the string literal as the data type of the other side in the subtraction. 2. Merge the string promotion logic from the rule `StringLiteralCoercion`: ``` date_sub(date, string) => date_sub(date, cast(string as int)) date_add(date, string) => date_add(date, cast(string as int)) ``` ### Why are the changes needed? 1. Promote the string literal in the subtraction as the data type of the other side. This is straightforward and consistent with PostgreSQL 2. Certerize all the string literal promotion in the ANSI type coercion rule ### Does this PR introduce _any_ user-facing change? No, the new ANSI type coercion rules are not released yet. ### How was this patch tested? Existing UT Closes #33724 from gengliangwang/datetimeTypeCoercion. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `48e333af54`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-13 01:02:54 +08:00
Gengliang Wang	293a6cb1ab	[SPARK-36445][SQL] ANSI type coercion rule for date time operations ### What changes were proposed in this pull request? Implement a new rule for the date-time operations in the ANSI type coercion system: 1. Date will be converted to Timestamp when it is in the subtraction with Timestmap. 2. Promote string literals in date_add/date_sub/time_add ### Why are the changes needed? Currently the type coercion rule `DateTimeOperations` doesn't match the design of the ANSI type coercion system: 1. For date_add/date_sub, if the input is timestamp type, Spark should not convert it into date type since date type is narrower than the timestamp type. 2. For date_add/date_sub/time_add, string value can be implicit cast to date/timestamp only when it is literal. Thus, we need to have a new rule for the date-time operations in the ANSI type coercion system. ### Does this PR introduce _any_ user-facing change? No, the ANSI type coercion rules are not releaesd. ### How was this patch tested? New UT Closes #33666 from gengliangwang/datetimeOp. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `3029e62a82`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-11 11:56:00 +08:00
Jungtaek Lim	161908c10d	[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window ### What changes were proposed in this pull request? This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. ### Why are the changes needed? The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. ### Does this PR introduce _any_ user-facing change? No, as we haven't released this feature. ### How was this patch tested? Updated tests. Closes #33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `ed60aaa9f1`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-11 10:46:03 +09:00
gengjiaan	6018d44280	[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33684 from beliefer/SPARK-36429-new. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `186815be1c`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-10 22:52:39 +08:00
Angerszhuuuu	fb6f3792af	[SPARK-36431][SQL] Support TypeCoercion of ANSI intervals with different fields ### What changes were proposed in this pull request? Support TypeCoercion of ANSI intervals with different fields ### Why are the changes needed? Support TypeCoercion of ANSI intervals with different fields ### Does this PR introduce _any_ user-facing change? After this pr user can - use comparison function with different fields of DayTimeIntervalType/YearMonthIntervalType such as `INTERVAL '1' YEAR` > `INTERVAL '11' MONTH` - support different field of ansi interval type in collection function such as `array(INTERVAL '1' YEAR, INTERVAL '11' MONTH)` - support different field of ansi interval type in `coalesce` etc.. ### How was this patch tested? Added UT Closes #33661 from AngersZhuuuu/SPARK-SPARK-36431. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `89d8a4eacf`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-10 14:22:47 +03:00
Cheng Pan	45acd00dd6	[SPARK-36466][SQL] Table in unloaded catalog referenced by view should load correctly ### What changes were proposed in this pull request? Retain `spark.sql.catalog.*` confs when resolving view. ### Why are the changes needed? Currently, if a view in default catalog ref a table in another catalog (e.g. jdbc), `org.apache.spark.sql.AnalysisException: Table or view not found: cat.t` will be thrown on accessing the view if the catalog has not been loaded yet. ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? Add UT. Closes #33692 from pan3793/SPARK-36466. Authored-by: Cheng Pan <chengpan@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `7f56b73cad`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-10 17:31:36 +08:00
Terry Kim	882ef6dd73	[SPARK-36449][SQL] v2 ALTER TABLE REPLACE COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE REPLACE COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql(s"ALTER TABLE $t REPLACE COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. ### Why are the changes needed? To check the duplicate columns during analysis. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the following: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33676 from imback82/replace_cols_duplicates. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e1a5d94117`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-10 13:20:50 +08:00
Wenchen Fan	5bddafe3e0	[SPARK-36430][SQL] Adaptively calculate the target size when coalescing shuffle partitions in AQE ### What changes were proposed in this pull request? This PR fixes a performance regression introduced in https://github.com/apache/spark/pull/33172 Before #33172 , the target size is adaptively calculated based on the default parallelism of the spark cluster. Sometimes it's very small and #33172 sets a min partition size to fix perf issues. Sometimes the calculated size is reasonable, such as dozens of MBs. After #33172 , we no longer calculate the target size adaptively, and by default always coalesce the partitions into 1 MB. This can cause perf regression if the adaptively calculated size is reasonable. This PR brings back the code that adaptively calculate the target size based on the default parallelism of the spark cluster. ### Why are the changes needed? fix perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33655 from cloud-fan/minor. Lead-authored-by: Wenchen Fan <wenchen@databricks.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `9a539d5846`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-09 17:26:11 +08:00
Angerszhuuuu	a5ecf2a490	[SPARK-36352][SQL] Spark should check result plan's output schema name ### What changes were proposed in this pull request? Spark should check result plan's output schema name ### Why are the changes needed? In current code, some optimizer rule may change plan's output schema, since in the code we always use semantic equal to check output, but it may change the plan's output schema. For example, for SchemaPruning, if we have a plan ``` Project[a, B] \|--Scan[A, b, c] ``` the origin output schema is `a, B`, after SchemaPruning. it become ``` Project[A, b] \|--Scan[A, b] ``` It change the plan's schema. when we use CTAS, the schema is same as query plan's output. Then since we change the schema, it not consistent with origin SQL. So we need to check final result plan's schema with origin plan's schema ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existed UT Closes #33583 from AngersZhuuuu/SPARK-36352. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e051a540a1`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-09 16:48:10 +08:00
Kousuke Saruta	586eb5d4c6	Revert "[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported" ### What changes were proposed in this pull request? This PR reverts the change in SPARK-36429 (#33654). See [conversation](https://github.com/apache/spark/pull/33654#issuecomment-894160037). ### Why are the changes needed? To recover CIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33670 from sarutak/revert-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit `e17612d0bf`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-08-06 20:56:40 +09:00
gaoyajun02	33e4ce562a	[SPARK-36339][SQL] References to grouping that not part of aggregation should be replaced ### What changes were proposed in this pull request? Currently, references to grouping sets are reported as errors after aggregated expressions, e.g. ``` SELECT count(name) c, name FROM VALUES ('Alice'), ('Bob') people(name) GROUP BY name GROUPING SETS(name); ``` Error in query: expression 'people.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ### Why are the changes needed? Fix the map anonymous function in the constructAggregateExprs function does not use underscores to avoid ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33574 from gaoyajun02/SPARK-36339. Lead-authored-by: gaoyajun02 <gaoyajun02@gmail.com> Co-authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `888f8f03c8`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 16:35:01 +08:00
gengjiaan	be19270880	[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33654 from beliefer/SPARK-36429. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 14:01:13 +08:00
Kent Yao	0bb88c99f7	[SPARK-36421][SQL][DOCS] Use ConfigEntry.key to fix docs and set command results ### What changes were proposed in this pull request? This PR fixes the issue that `ConfigEntry` to be introduced to the doc field directly without calling `.key`, which causes malformed documents on the web site and in the result of `SET -v` 1. https://spark.apache.org/docs/3.1.2/configuration.html#static-sql-configuration - spark.sql.hive.metastore.jars 2. set -v ![image](https://user-images.githubusercontent.com/8326978/128292412-85100f95-24fd-4b40-a14f-d31a256dab7d.png) ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no, but contains doc fix ### How was this patch tested? new tests Closes #33647 from yaooqinn/SPARK-36421. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c7fa3c9090`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 11:01:54 +09:00
Angerszhuuuu	bf4edb5f5a	[SPARK-36353][SQL] RemoveNoopOperators should keep output schema ### What changes were proposed in this pull request? RemoveNoopOperators should keep output schema ### Why are the changes needed? Expand function ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33587 from AngersZhuuuu/SPARK-36355. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `02810eecbf`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-05 20:43:48 +08:00
PengLei	8c42232638	[SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table ### What changes were proposed in this pull request? Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive. ### Why are the changes needed? At now the resolver is `_ == _` of `findNestedField` called by `checkColumnNotExists` Add `alter.conf.resolver` to it. [SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Closes #33618 from Peng-Lei/sensitive-cloumn-name. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `87d49cbcb1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 10:04:25 +09:00
Wenchen Fan	7c586842d7	[SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN ### What changes were proposed in this pull request? This a followup of the recent work such as https://github.com/apache/spark/pull/33200 For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands. This PR also moves these AlterTable commands to a individual file and give them a base trait. ### Why are the changes needed? name simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33609 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `7cb9c1c241`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 10:43:15 +03:00
Hyukjin Kwon	9eec11b956	[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: Permissive mode: ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` To make the permissive mode to proceed and parse without throwing an exception. Permissive mode: ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `0bbcbc6508`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-02 10:02:09 -07:00
Angerszhuuuu	ea559adc2e	[SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name * FAILED * (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZhuuuu/SPARK-36086. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f3173956cb`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-03 00:08:30 +08:00
Linhong Liu	e26cb968bd	[SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2f700773c2`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 23:20:11 +08:00
Terry Kim	87ae397897	[SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. For v1 command, the duplication is checked before invoking the catalog. ### Why are the changes needed? To check the duplicate columns during analysis and be consistent with v1 command. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the fllowing: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33600 from imback82/alter_add_duplicate_columns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `3b713e7f61`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 17:55:07 +08:00
Wenchen Fan	f6bb75b0bc	[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33352 , to simplify the JDBC aggregate pushdown: 1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark. 2. because of 1, now we can remove the `dataType` property from the public `Sum` expression. This PR also contains some small improvements: 1. Spark should deduplicate the aggregate expressions before pushing them down. 2. Improve the `toString` of public aggregate expressions to make them more SQL. ### Why are the changes needed? code and API simplification ### Does this PR introduce _any_ user-facing change? this API is not released yet. ### How was this patch tested? existing tests Closes #33579 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `387a251a68`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-30 00:26:41 -07:00
Kousuke Saruta	d247a6cd1a	[SPARK-36323][SQL] Support ANSI interval literals for TimeWindow ### What changes were proposed in this pull request? This PR proposes to support ANSI interval literals for `TimeWindow`. ### Why are the changes needed? Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33551 from sarutak/window-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `db18866742`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-29 08:52:10 +03:00
Linhong Liu	fa521c1506	[SPARK-36286][SQL] Block some invalid datetime string ### What changes were proposed in this pull request? In PR #32959, we found some weird datetime strings that can be parsed. ([details](https://github.com/apache/spark/pull/32959#discussion_r665015489)) This PR blocks the invalid datetime string. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, below strings will have different results when cast to datetime. ```sql select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL ``` ### How was this patch tested? some new test cases Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ed0e351f05`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-29 09:17:01 +08:00
dgd-contributor	c5b0cb2d94	[SPARK-36229][SQL] conv() inconsistently handles invalid strings with more than 64 invalid characters and return wrong value on overflow ### What changes were proposed in this pull request? 1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold. ``` scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show +---------------------------+ \|conv(repeat(?, 64), 10, 16)\| +---------------------------+ \| 0\| +---------------------------+ scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0 +---------------------------+ \|conv(repeat(?, 65), 10, 16)\| +---------------------------+ \| FFFFFFFFFFFFFFFF\| +---------------------------+ scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0 +----------------------------+ \|conv(repeat(?, 65), 10, -16)\| +----------------------------+ \| -1\| +----------------------------+ scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show +----------------------------+ \|conv(repeat(?, 64), 10, -16)\| +----------------------------+ \| 0\| +----------------------------+ ``` 2/ conv should return result equal to max unsigned long value in base toBase when there is overflow ``` scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615 +-------------------------------+ \|conv(aaaaaaa0aaaaaaa0a, 16, 10)\| +-------------------------------+ \| 12297828695278266890\| +-------------------------------+ ``` ### Why are the changes needed? Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database ### Does this PR introduce _any_ user-facing change? change in result of conv() function ### How was this patch tested? add test Closes #33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e1c50ff779`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-29 00:19:19 +08:00
Terry Kim	cd6b303d0f	[SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... ADD/REPLACE COLUMNS` commands to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... ADD/REPLACE COLUMNS` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33200 from imback82/alter_add_cols. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `809b88a162`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 14:00:44 +08:00
allisonwang-db	993ffafc3e	[SPARK-36275][SQL] ResolveAggregateFunctions should works with nested fields ### What changes were proposed in this pull request? This PR fixes an issue in `ResolveAggregateFunctions` where non-aggregated nested fields in ORDER BY and HAVING are not resolved correctly. This is because nested fields are resolved as aliases that fail to be semantically equal to any grouping/aggregate expressions. ### Why are the changes needed? To fix an analyzer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33498 from allisonwang-db/spark-36275-resolve-agg-func. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `23a6ffa5dc`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 13:35:35 +08:00
allisonwang-db	aea36aa977	[SPARK-36028][SQL][3.2] Allow Project to host outer references in scalar subqueries This PR cherry picks https://github.com/apache/spark/pull/33235 to branch-3.2 to fix test failures introduced by https://github.com/apache/spark/pull/33284. ### What changes were proposed in this pull request? This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule. Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation. ### Why are the changes needed? To allow more types of correlated scalar subqueries. ### Does this PR introduce _any_ user-facing change? Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example: ```sql SELECT (SELECT c1) FROM t; ``` Before this change: ``` org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses ``` After this change: ``` +------------------+ \|scalarsubquery(c1)\| +------------------+ \|0 \| \|1 \| +------------------+ ``` ### How was this patch tested? Added unit tests and SQL tests. (cherry picked from commit `ca348e50a4`) Signed-off-by: allisonwang-db <allison.wangdatabricks.com> Closes #33527 from allisonwang-db/spark-36028-3.2. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 12:54:15 +08:00

1 2 3 4 5 ...

5652 commits