ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Karen Feng	e3bd817d65	[SPARK-34920][CORE][SQL] Add error classes with SQLSTATE ### What changes were proposed in this pull request? Unifies exceptions thrown from Spark under a single base trait `SparkError`, which unifies: - Error classes - Parametrized error messages - SQLSTATE, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Add-error-IDs-td31126.html. ### Why are the changes needed? - Adding error classes creates a consistent label for exceptions, even as error messages change - Creating a single, centralized source-of-truth for parametrized error messages improves auditing for error message quality - Adding SQLSTATE helps ODBC/JDBC users receive standardized error codes ### Does this PR introduce _any_ user-facing change? Yes, changes ODBC experience by: - Adding error classes to error messages - Adding SQLSTATE to TStatus ### How was this patch tested? Unit tests, as well as local tests with PyODBC. Closes #32850 from karenfeng/SPARK-34920. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-30 09:22:02 +00:00
Gengliang Wang	ad4b6796f6	[SPARK-35937][SQL] Extracting date field from timestamp should work in ANSI mode ### What changes were proposed in this pull request? Add a new ANSI type coercion rule: when getting a date field from a Timestamp column, cast the column as Date type. This is Spark's current hack to make the implementation simple. In the default type coercion rules, the implicit cast rule does the work. However, The ANSI implicit cast rule doesn't allow converting Timestamp type as Date type, so we need to have this additional rule to make sure the date field extraction from Timestamp columns works. ### Why are the changes needed? Fix a bug. ### Does this PR introduce _any_ user-facing change? No, the new type coercion rules are not released yet. ### How was this patch tested? Unit test Closes #33138 from gengliangwang/fixGetDateField. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-30 13:53:51 +08:00
Liang-Chi Hsieh	064230de97	[SPARK-35829][SQL] Clean up evaluates subexpressions and add more flexibility to evaluate particular subexpressoin ### What changes were proposed in this pull request? This patch refactors the evaluation of subexpressions. There are two changes: 1. Clean up subexpression code after evaluation to avoid duplicate evaluation. 2. Evaluate all children subexpressions when evaluating a subexpression. ### Why are the changes needed? Currently `subexpressionEliminationForWholeStageCodegen` return the gen-ed code of subexpressions. The caller simply puts the code into its code block. We need more flexible evaluation here. For example, for Filter operator's subexpression evaluation, we may need to evaluate particular subexpression for one predicate. Current approach cannot satisfy the requirement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32980 from viirya/subexpr-eval. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-29 22:14:37 -07:00
Yuming Wang	4a17e7a5ae	[SPARK-35906][SQL] Remove order by if the maximum number of rows less than or equal to 1 ### What changes were proposed in this pull request? This PR removes order by if the maximum number of rows less than or equal to 1. For example: ```scala spark.sql("select count(*) from range(1, 10, 2, 2) order by 1 limit 10").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Sort [count(1)#2L ASC NULLS FIRST], true, Statistics(sizeInBytes=16.0 B) +- Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [count(1) AS count(1)#2L], Statistics(sizeInBytes=16.0 B, rowCount=1) +- Project, Statistics(sizeInBytes=20.0 B) +- Range (1, 10, step=2, splits=Some(2)), Statistics(sizeInBytes=40.0 B, rowCount=5) ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33100 from wangyum/SPARK-35906. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 11:04:54 -07:00
Gengliang Wang	78e6263cce	[SPARK-35927][SQL] Remove type collection AllTimestampTypes ### What changes were proposed in this pull request? Replace the type collection `AllTimestampTypes` with the new data type `AnyTimestampType` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33115#discussion_r659866760, it is more convenient to have a new data type "AnyTimestampType" instead of using type collection `AllTimestampTypes`: 1. simplify the pattern match 2. In the default type coercion rules, when implicit casting a type to a TypeCollection type, Spark chooses the first convertible data type as the result. If we are going to make the default timestamp type configurable, having AnyTimestampType is better ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33129 from gengliangwang/allTimestampTypes. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-29 16:57:08 +08:00
Gengliang Wang	7635114d53	[SPARK-35916][SQL] Support subtraction among Date/Timestamp/TimestampWithoutTZ ### What changes were proposed in this pull request? Support the following operations: - TimestampWithoutTZ - Date - Date - TimestampWithoutTZ - TimestampWithoutTZ - Timestamp - Timestamp - TimestampWithoutTZ - TimestampWithoutTZ - TimestampWithoutTZ For subtraction between `TimestampWithoutTZ` and `Timestamp`, the `Timestamp` column is cast as TimestampWithoutTZType. ### Why are the changes needed? Support basic subtraction among Date/Timestamp/TimestampWithoutTZ. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33115 from gengliangwang/subtractTimestampWithoutTz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-29 14:45:09 +08:00
Anton Okolnychyi	8a21d2dcfe	[SPARK-35899][SQL][FOLLOWUP] Utility to convert connector expressions to Catalyst ### What changes were proposed in this pull request? This PR addresses post-review comments on PR #33096: - removes `private[sql]` modifier - removes the option to pass a resolver to simplify the API ### Why are the changes needed? These changes are needed to simply the utility API. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33120 from aokolnychyi/spark-35899-follow-up. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-28 22:22:07 -07:00
Kousuke Saruta	880bbd6aaa	[SPARK-35876][SQL] ArraysZip should retain field names to avoid being re-written by analyzer/optimizer ### What changes were proposed in this pull request? This PR fixes an issue that field names of structs generated by `arrays_zip` function could be unexpectedly re-written by analyzer/optimizer. Here is an example. ``` val df = sc.parallelize(Seq((Array(1, 2), Array(3, 4)))).toDF("a1", "b1").selectExpr("arrays_zip(a1, b1) as zipped") df.printSchema root \|-- zipped: array (nullable = true) \| \|-- element: struct (containsNull = false) \| \| \|-- a1: integer (nullable = true) // OK. a1 is expected name \| \| \|-- b1: integer (nullable = true) // OK. b1 is expected name df.explain == Physical Plan == *(1) Project [arrays_zip(_1#3, _2#4) AS zipped#12] // Not OK. field names are re-written as _1 and _2 respectively df.write.parquet("/tmp/test.parquet") val df2 = spark.read.parquet("/tmp/test.parquet") df2.printSchema root \|-- zipped: array (nullable = true) \| \|-- element: struct (containsNull = true) \| \| \|-- _1: integer (nullable = true) // Not OK. a1 is expected but got _1 \| \| \|-- _2: integer (nullable = true) // Not OK. b1 is expected but got _2 ``` This issue happens when aliases are eliminated by `AliasHelper.replaceAliasButKeepName` or `AliasHelper.trimNonTopLevelAliases` called via analyzer/optimizer `b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L883)` `b89cd8d75a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L3759)` I investigated functions which can be affected this issue but I found only `arrays_zip` so far. To fix this issue, this PR changes the definition of `ArraysZip` to retain field names to avoid being re-written by analyzer/optimizer. ### Why are the changes needed? This is apparently a bug. ### Does this PR introduce _any_ user-facing change? No. After this change, the field names are no longer re-written but it should be expected behavior for users. ### How was this patch tested? New tests. Closes #33106 from sarutak/arrays-zip-retain-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 12:28:41 +09:00
Terry Kim	620fde4767	[SPARK-34302][SQL] Migrate ALTER TABLE ... CHANGE COLUMN command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... CHANGE COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... CHANGE COLUMN` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33113 from imback82/alter_change_column. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-29 02:53:05 +00:00
PengLei	8fbbd2e6d7	[SPARK-33898][SQL] Support SHOW CREATE TABLE In V2 ### What changes were proposed in this pull request? 1. Implement V2 execution node `ShowCreateTableExec` similar to V1 `ShowCreateTableCommand` 2. No support `SHOW CREATE TABLE XXX AS SERDE` ### Why are the changes needed? [SPARK-33898](https://issues.apache.org/jira/browse/SPARK-33898) ### Does this PR introduce _any_ user-facing change? Yes. Support the user to execute `SHOW CREATE TABLE` command in V2 table ### How was this patch tested? Add two UT tests 1. ./dev/scalastyle 2. run test DataSourceV2SQLSuite Closes #32931 from Peng-Lei/SPARK-33898. Lead-authored-by: PengLei <18066542445@189.cn> Co-authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-29 10:14:46 +08:00
PengLei	356aef48b8	[SPARK-35728][SPARK-35778][SQL][TESTS] Check multiply/divide of day-time and year-month interval of any fields by a numeric ### What changes were proposed in this pull request? [SPARK-35728](https://issues.apache.org/jira/browse/SPARK-35728): Add test case to check multiply/divide of day-time intervals of any fields by numeric [SPARK-35778](https://issues.apache.org/jira/browse/SPARK-35778): Add test case to check multiply/divide of year-month intervals of any fields by numeric ### Why are the changes needed? Improve test coverage ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Lead-authored-by: Lei Peng <peng.8leigmail.com> Co-authored-by: AngersZhuuuu <angers.zhugmail.com> Closes #33080 from Peng-Lei/SPARK-35728-35778. Lead-authored-by: PengLei <peng.8lei@gmail.com> Co-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-28 13:35:54 +03:00
Yuming Wang	108635af17	Revert "[SPARK-35904][SQL] Collapse above RebalancePartitions" This reverts commit `def29e50`	2021-06-28 16:23:23 +08:00
dgd-contributor	1c81ad2029	[SPARK-35064][SQL] Group error in spark-catalyst ### What changes were proposed in this pull request? This PR group exception messages in sql/catalyst/src/main/scala/org/apache/spark/sql (except catalyst) ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce any user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32916 from dgd-contributor/SPARK-35064_catalyst_group_error. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-28 07:21:24 +00:00
Liang-Chi Hsieh	b89cd8d75a	[SPARK-35886][SQL] PromotePrecision should not overwrite genCode ### What changes were proposed in this pull request? This patch fixes `PromotePrecision` where it overwrites `genCode` where subexpression elimination should happen. ### Why are the changes needed? `PromotePrecision` overwrites `genCode` where subexpression elimination should happen. So if it is most top expression of a subexpression, it is never replaced. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added test. Closes #33103 from viirya/fix-precision. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-26 23:19:58 -07:00
Yuming Wang	def29e5075	[SPARK-35904][SQL] Collapse above RebalancePartitions ### What changes were proposed in this pull request? 1. Make `RebalancePartitions` extend `RepartitionOperation`. 2. Make `CollapseRepartition` support `RebalancePartitions`. ### Why are the changes needed? `CollapseRepartition` can optimize `RebalancePartitions` if possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33099 from wangyum/SPARK-35904. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-26 21:19:58 -07:00
Gengliang Wang	645fb59652	[SPARK-35895][SQL] Support subtracting Intervals from TimestampWithoutTZ ### What changes were proposed in this pull request? Support the following operation: - TimestampWithoutTZ - Year-Month interval The following operation is actually supported in https://github.com/apache/spark/pull/33076/. This PR is to add end-to-end tests for them: - TimestampWithoutTZ - Calendar interval - TimestampWithoutTZ - Daytime interval ### Why are the changes needed? Support subtracting all 3 interval types from a timestamp without time zone ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33086 from gengliangwang/subtract. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-26 13:19:00 +03:00
Anton Okolnychyi	63cd1314d2	[SPARK-35899][SQL] Utility to convert connector expressions to Catalyst ### What changes were proposed in this pull request? This PR adds a utility to convert public connector expressions to Catalyst expressions. Notable differences: - Switched to `QueryCompilationErrors` from an explicit `AnalysisException`. - Decoupled the resolving logic for v2 references into separate methods to use in other places. ### Why are the changes needed? These changes are needed as more and more places require this logic and it is better to implement it in a single place. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33096 from aokolnychyi/spark-35899. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-25 18:04:07 -07:00
Gengliang Wang	9814cf8853	[SPARK-35889][SQL] Support adding TimestampWithoutTZ with Interval types ### What changes were proposed in this pull request? Supprot the following operations: - TimestampWithoutTZ + Calendar interval - TimestampWithoutTZ + Year-Month interval - TimestampWithoutTZ + Daytime interval ### Why are the changes needed? Support basic '+' operator for timestamp without time zone type. ### Does this PR introduce _any_ user-facing change? No, the timestamp without time zone type is not release yet. ### How was this patch tested? Unit tests Closes #33076 from gengliangwang/addForNewTS. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-25 19:58:42 +08:00
Terry Kim	f1ad34558c	[SPARK-35883][SQL] Migrate ALTER TABLE RENAME COLUMN command to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... RENAME COLUMN` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... RENAME COLUMN` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33066 from imback82/alter_rename. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-25 05:53:56 +00:00
Kousuke Saruta	156b9b5d14	[SPARK-35736][SPARK-35774][SQL][FOLLOWUP] Prohibit to specify the same units for FROM and TO with unit-to-unit interval syntax ### What changes were proposed in this pull request? This PR change the behavior of unit-to-unit interval syntax to prohibit the case that the same units are specified for FROM and TO. ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33057 from sarutak/prohibit-unit-pattern. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 23:13:31 +03:00
Adam Binford	14b1836313	[SPARK-35290][SQL] Append new nested struct fields rather than sort for unionByName with null filling ### What changes were proposed in this pull request? This PR changes the unionByName with null filling logic to append new nested struct fields from the right side of the union to the schema versus sorting fields alphabetically. It removes the need to use UpdateField expressions, and just directly projects new nested structs from each side of the union with the correct schema. This changes the union'd schema from being alphabetically sorted previously to now "left dominant", where the fields from the left side of the union are included and then the missing ones from the right are added in the same order found originally. ### Why are the changes needed? Certain nested structs would cause unionByName with null filling to error out due to part of the logic for rewriting the expression tree to sort the structs. ### Does this PR introduce _any_ user-facing change? Yes, nested struct fields will be in a different order after unionByName with null filling than before, though shouldn't cause much effective difference. ### How was this patch tested? Updated existing tests based on the new StructField ordering and added a new test for the case that was broken originally. Closes #33040 from Kimahriman/union-by-name-struct-order. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-24 09:21:30 -07:00
Terry Kim	5b4816cfc8	[SPARK-34320][SQL] Migrate ALTER TABLE DROP COLUMNS commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... DROP COLUMNS` command to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... DROP COLUMNS` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #32854 from imback82/alter_alternative. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 14:59:25 +00:00
Angerszhuuuu	de35675c61	[SPARK-35871][SQL] Literal.create(value, dataType) should support fields ### What changes were proposed in this pull request? Current Literal.create(data, dataType) for Period to YearMonthIntervalType and Duration to DayTimeIntervalType is not correct. if data type is Period/Duration, it will create converter of default YearMonthIntervalType/DayTimeIntervalType, then the result is not correct, this pr fix this bug. ### Why are the changes needed? Fix bug when use Literal.create() ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33056 from AngersZhuuuu/SPARK-35871. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 17:36:48 +03:00
Max Gekk	d40a1a2552	Revert "[SPARK-35728][SQL][TESTS] Check multiply/divide of day-time intervals of any fields by numeric" ### What changes were proposed in this pull request? Revert `8a1995f936` ### Why are the changes needed? The merged test doesn't check different interval fields, actually. Need to apply this https://github.com/apache/spark/pull/33056 first of all. ### Does this PR introduce _any_ user-facing change? No. This is tests. ### How was this patch tested? By existing GAs. Closes #33060 from MaxGekk/revert-Peng-Lei-SPARK-35728. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 14:36:07 +03:00
Max Gekk	345d3db83d	Revert "[SPARK-35778][SQL][TESTS] Check multiply/divide of year month interval of any fields by numeric" ### What changes were proposed in this pull request? Revert `3904c0edba` ### Why are the changes needed? The merged test doesn't check different interval fields, actually. Need to apply this https://github.com/apache/spark/pull/33056 first of all. ### Does this PR introduce _any_ user-facing change? No. This is tests. ### How was this patch tested? By existing GAs. Closes #33059 from MaxGekk/revert-Peng-Lei-SPARK-35778. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 14:34:42 +03:00
PengLei	3904c0edba	[SPARK-35778][SQL][TESTS] Check multiply/divide of year month interval of any fields by numeric ### What changes were proposed in this pull request? Check multiply/divide of year-month intervals of any fields by numeric. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Expanded existed test cases. Closes #33051 from Peng-Lei/SPARK-35778. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 12:25:06 +03:00
PengLei	8a1995f936	[SPARK-35728][SQL][TESTS] Check multiply/divide of day-time intervals of any fields by numeric ### What changes were proposed in this pull request? 1. The testcase is just cover the DayTimeIntervalType() / numeric 2. Add testcase for following intervals / numeric: INTERVAL DAY INTERVAL DAY TO HOUR INTERVAL DAY TO MINUTE INTERVAL HOUR INTERVAL HOUR TO MINUTE INTERVAL HOUR TO SECOND INTERVAL MINUTE INTERVAL MINUTE TO SECOND INTERVAL SECOND ### Why are the changes needed? Add testcase coverage. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existed testcase Closes #33014 from Peng-Lei/SPARK-35728. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 12:11:47 +03:00
ulysses-you	1295e8876c	[SPARK-35786][SQL] Add a new operator to distingush if AQE can optimize safely ### What changes were proposed in this pull request? * Add a new repartition operator `RebalanceRepartition`. * Support a new hint `REBALANCE` After this patch, user can run this query: ```sql SELECT /+ REBALANCE(c) / * FROM t ``` ### Why are the changes needed? Add a new hint to distingush if we can optimize it safely. This new hint can let AQE optimize with `CustomShuffleReaderExec` safely. Currently, AQE can only coalesce shuffle partitions but can not expand shuffle partitions due to the semantics of output partitioning. Let's say we have a query: ```sql SELECT /+ REPARTITION(col) / * FROM t ``` AQE can not expand the shuffle partitions even if `col` is skewed because expanding shuffle partitions will break the hashed output paritioning of `RepartitionByExpression`. But if the query is use`REPARTITION_BY_AQE`, AQE can optimize it without considering the semantics of output partitioning. ### Does this PR introduce _any_ user-facing change? Yes, a new hint. ### How was this patch tested? Add test. Closes #32932 from ulysses-you/SPARK-35786. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 09:04:38 +00:00
dgd-contributor	5b9c5c126f	[SPARK-35841][SQL] Casting string to decimal type doesn't work if the… … sum of the digits is greater than 38 ### What changes were proposed in this pull request? Since Spark 3.1.1, NULL is returned when casting a string with many decimal places to a decimal type. If the sum of the digits before and after the decimal point is less than 39, a value is returned. From 39 digits, however, NULL is returned. This worked until Spark 3.0.X. Code to reproduce: A string with 2 decimal places in front of the decimal point and 37 decimal places after the decimal point returns null ``` val data = Seq( "28.9259999999999983799625624669715762138", "28.925999999999998379962562466971576213", "2.9259999999999983799625624669715762138" ) val df = data.toDF("num") df.withColumn("numConverted", col("num").cast("decimal(38, 5)")).show() ``` before this pull request, the result is +----------------------+---------------+ \| num \|numConverted\| +----------------------+---------------+ \|28.92599999999999...\| null\| \|28.92599999999999...\| 28.92600\| \|2.925999999999998...\| 2.92600\| +----------------------+---------------+ the correct result should be +----------------------+---------------+ \| num \|numConverted\| +----------------------+---------------+ \|28.92599999999999...\| 28.92600\| \|28.92599999999999...\| 28.92600\| \|2.925999999999998...\| 2.92600\| +----------------------+---------------+ The problem occur since https://issues.apache.org/jira/browse/SPARK-32706, it because the fast fail is checking precision length, which should only check the whole number part length of the input value, not the precision length ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? test added Closes #33011 from dgd-contributor/SPARK-35841_castStringToDecimalTypeError. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 16:44:58 +08:00
ulysses-you	ff9ba89dcb	[SPARK-35282][SQL][FOLLOWUP] Simplify condition code of shuffled hash join ### What changes were proposed in this pull request? Simplify the condition code which is introduced by [SPARK-35282](https://issues.apache.org/jira/browse/SPARK-35282). ### Why are the changes needed? Reduce the code size and make code more readable. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CI Closes #33046 from ulysses-you/simplify-shj. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 08:42:24 +00:00
Angerszhuuuu	5e77ca8071	[SPARK-35768][SQL] Take into account year-month interval fields in cast ### What changes were proposed in this pull request? Support take into account year-month interval field in cast ##### Rule cast to target YearMonthIntervalType \| string \| demo \| strict target type \| months \| \|---\|---\|---\|---\| \| [+\\|-]y-m \| 1-1 \| YearMonthIntervalType(YEAR. MONTH) \| 13 \| \| [+\\|-]y\| 1 \| YearMonthIntervalType(YEAR. YEAR) \| 12 \| \| [+\\|-]m \| 1 \| YearMonthIntervalType(MONTH. MONTH) \| 1 \| \| INTERVAL [+\\|-]'[+\\|-]y-m' YEAR TO MONTH \| interval '1-1' year to month \| YearMonthIntervalType(YEAR. MONTH) \| 13 \| \| INTERVAL [+\\|-]'[+\\|-]m' MONTH \| interval '1' month \| YearMonthIntervalType(MONTH. MONTH) \| 1 \| \| INTERVAL [+\\|-]'[+\\|-]y' YEAR \| interval '1' year \| YearMonthIntervalType(YEAR.YEAR) \| 12 \| ### Why are the changes needed? Support take into account year-month interval field in cast ### Does this PR introduce _any_ user-facing change? user can use `cast(str, YearMonthInterval(YEAR, YEAR))` etc ### How was this patch tested? Added UT Closes #32940 from AngersZhuuuu/SPARK-35768. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-24 07:48:47 +00:00
PengLei	61bd036cb9	[SPARK-35852][SQL] Use DateAdd instead of TimeAdd for DateType +/- INTERVAL DAY ### What changes were proposed in this pull request? We use `DateAdd` to impl `DateType` `+`/`-` `INTERVAL DAY` ### Why are the changes needed? To improve the impl of `DateType` `+`/`-` `INTERVAL DAY` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut test Closes #33033 from Peng-Lei/SPARK-35852. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-24 08:47:29 +03:00
tanel.kiis@gmail.com	b3a2cebc2b	[SPARK-34807][SQL] Transpose Window nodes with Project between them ### What changes were proposed in this pull request? Extend the `TransposeWindow` rule to transpose `Window` nodes, that have `Project` between them. ### Why are the changes needed? The analyzer will turn a `dataset.withColumn("colName", expressionWithWindowFunction)` method call to a `Project - Window - Project` chain in the logical plan. When this method is called multiple times in a row, then the projects can block the `Window` nodes from being transposed by the current `TransposeWindow` rule. TPCDS q47 and q57 are also improved by this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31980 from tanelk/SPARK-34807_transpose_window. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-06-24 10:28:57 +08:00
Gengliang Wang	6f51e37eb5	[SPARK-35857][SQL] The ANSI flag of Cast should be kept after being copied ### What changes were proposed in this pull request? Make the ANSI flag part of expression `Cast`'s parameter list, instead of fetching it from the sessional SQLConf. ### Why are the changes needed? For Views, it is important to show consistent results even the ANSI configuration is different in the running session. This is why many expressions like 'Add'/'Divide' making the ANSI flag part of its case class parameter list. We should make it consistent for the expression `Cast` ### Does this PR introduce _any_ user-facing change? Yes, the `Cast` inside a View always behaves the same, independent of the ANSI model SQL configuration in the current session. ### How was this patch tested? Existing UT Closes #33027 from gengliangwang/ansiFlagInCast. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 16:52:33 +08:00
Angerszhuuuu	758b423a31	[SPARK-35860][SQL] Support UpCast between different field of YearMonthIntervalType/DayTimeIntervalType ### What changes were proposed in this pull request? Support UpCast between different field of YearMonthIntervalType/DayTimeIntervalType ### Why are the changes needed? Since in our encoder we handle Period/Duration as default YearMonthIntervalType/DayTimeIntervalType, when we use udf to handle this type, it will upcast all type of YearMonthIntervalType/DayTimeIntervalType to default YearMonthIntervalType/DayTimeIntervalType, so we need to support this. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Ut Closes #33035 from AngersZhuuuu/SPARK-35860. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 11:32:13 +03:00
Peter Toth	79e3d0d98f	[SPARK-35855][SQL] Unify reuse map data structures in non-AQE and AQE rules ### What changes were proposed in this pull request? This PR unifies reuse map data structures in non-AQE and AQE rules to a simple `Map[<canonicalized plan>, <plan>]` based on the discussion here: https://github.com/apache/spark/pull/28885#discussion_r655073897 ### Why are the changes needed? The proposed `Map[<canonicalized plan>, <plan>]` is simpler than the currently used `Map[<schema>, ArrayBuffer[<plan>]]` in `ReuseMap`/`ReuseExchangeAndSubquery` (non-AQE) and consistent with the `ReuseAdaptiveSubquery` (AQE) subquery reuse rule. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing UTs. Closes #33021 from peter-toth/SPARK-35855-unify-reuse-map-data-structures. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-23 07:20:47 +00:00
Wenchen Fan	20edfdd39a	[SPARK-35845][SQL] OuterReference resolution should reject ambiguous column names ### What changes were proposed in this pull request? The current OuterReference resolution is a bit weird: when the outer plan has more than one child, it resolves OuterReference from the output of each child, one by one, left to right. This is incorrect in the case of join, as the column name can be ambiguous if both left and right sides output this column. This PR fixes this bug by resolving OuterReference with `outerPlan.resolveChildren`, instead of something like `outerPlan.children.foreach(_.resolve(...))` ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? The problem only occurs in join, and join condition doesn't support correlated subquery yet. So this PR only improves the error message. Before this PR, people see ``` java.lang.UnsupportedOperationException Cannot generate code for expression: outer(t1a#291) ``` ### How was this patch tested? a new test Closes #33004 from cloud-fan/outer-ref. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 14:32:34 +08:00
Kousuke Saruta	4416b4b8ba	[SPARK-35734][SQL][FOLLOWUP] IntervalUtils.toDayTimeIntervalString should consider the case a day-time type is casted as another day-time type ### What changes were proposed in this pull request? This PR fixes an issue that `IntervalUtils.toDayTimeIntervalString` doesn't consider the case that a day-time interval type is casted as another day-time interval type. if data of `interval day to second` is casted as `interval hour to second`, the value of the day is multiplied by 24 and added to the value of hour. For example, `INTERVAL '1 2' DAY TO HOUR` will be `INTERVAL '26' HOUR` if it's casted. If this behavior is intended, it should be stringified as `INTERVAL '26' HOUR` but currently, it will be `INTERVAL '2' HOUR` ### Why are the changes needed? t's a bug if the behavior of cast is intended. ### Does this PR introduce _any_ user-facing change? No, because this feature is not released yet. ### How was this patch tested? Modified the tests added in SPARK-35734 (#32891) Closes #33031 from sarutak/fix-toDayTimeIntervalString. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-23 08:00:35 +03:00
Gengliang Wang	960a7e5fce	[SPARK-35856][SQL][TESTS] Move new interval type test cases from CastSuite to CastBaseSuite ### What changes were proposed in this pull request? There are a few test cases that are supposed to be in CastSuiteBase instead of CastSuite: - SPARK-35112: Cast string to day-time interval - SPARK-35111: Cast string to year-month interval - SPARK-35820: Support cast DayTimeIntervalType in different fields - SPARK-35819: Support cast YearMonthIntervalType in different fields This PR is to move them to CastSuiteBase. Also, it adds comments for the scope of CastSuiteBase/CastSuite/AnsiCastSuiteBase. ### Why are the changes needed? Increase test coverage so that we can test the casting under ANSI mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33022 from gengliangwang/moveTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-23 11:20:50 +08:00
Gengliang Wang	ce53b7199d	[SPARK-35854][SQL] Improve the error message of to_timestamp_ntz with invalid format pattern ### What changes were proposed in this pull request? When SQL function `to_timestamp_ntz` has invalid format pattern input, throw a runtime exception with hints for the valid patterns, instead of throwing an upgrade exception with suggestions to use legacy formatters. ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/32995/files#r655148980, there is an error message saying "You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyy-MM-dd GGGGG' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0" This is not true for function to_timestamp_ntz, which only uses the Iso8601TimestampFormatter and added since Spark 3.2. We should improve it. ### Does this PR introduce _any_ user-facing change? No, the new SQL function is not released yet. ### How was this patch tested? Unit test Closes #33019 from gengliangwang/improveError. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-22 23:45:54 +08:00
Lei Peng	bc61b62a55	[SPARK-35727][SQL] Return INTERVAL DAY from dates subtraction What changes were proposed in this pull request? 1. Change the return value type from DayTimeIntervalType(DAY, SECOND) to DayTimeIntervalType(DAY, DAY) of SubtractDates. Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-35727 Does this PR introduce any user-facing change? no How was this patch tested? existed ut test Closes #32999 from Peng-Lei/SPARK-35727. Lead-authored-by: Lei Peng <peng.8lei@gmail.com> Co-authored-by: PengLei <18066542445@189.cn> Co-authored-by: Peng-Lei <peng.8lei@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-22 13:43:25 +00:00
Gengliang Wang	2bdd9fe5e3	[SPARK-35839][SQL] New SQL function: to_timestamp_ntz ### What changes were proposed in this pull request? Implement new SQL function: `to_timestamp_ntz`. The syntax is similar to the built-in function `to_timestamp`: ``` to_timestamp_ntz ( <date_expr> ) to_timestamp_ntz ( <timestamp_expr> ) to_timestamp_ntz ( <string_expr> [ , <format> ] ) ``` The naming is from snowflake: https://docs.snowflake.com/en/sql-reference/functions/to_timestamp.html ### Why are the changes needed? Adds a new SQL function to create a literal/column of timestamp without time zone. It's convenient for both end-users and developers. ### Does this PR introduce _any_ user-facing change? Yes, a new SQL function `to_timestamp_ntz`. ### How was this patch tested? Unit tests Closes #32995 from gengliangwang/toTimestampNtz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-22 09:50:48 +08:00
tanel.kiis@gmail.com	f80be4187e	[SPARK-34565][SQL] Collapse Window nodes with Project between them ### What changes were proposed in this pull request? Extend the `CollapseWindow` rule to collapse `Window` nodes, that have `Project` between them. ### Why are the changes needed? The analyzer will turn a `dataset.withColumn("colName", expressionWithWindowFunction)` method call to a `Project - Window - Project` chain in the logical plan. When this method is called multiple times in a row, then the projects can block the `Window` nodes from being collapsed by the current `CollapseWindow` rule. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31677 from tanelk/SPARK-34565_collapse_windows. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-21 22:10:49 +09:00
Max Gekk	37ef7bb98c	[SPARK-35840][SQL] Add `apply()` for a single field to `YearMonthIntervalType` and `DayTimeIntervalType` ### What changes were proposed in this pull request? In the PR, I propose to add 2 new methods that accept one field and produce either `YearMonthIntervalType` or `DayTimeIntervalType`. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By existing test suites. Closes #32997 from MaxGekk/ansi-interval-types-single-field. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-21 14:15:33 +03:00
Angerszhuuuu	1488ea9a8c	[SPARK-35820][SQL] Support Cast between different field DayTimeIntervalType ### What changes were proposed in this pull request? Support Cast between different field DayTimeIntervalType ### Why are the changes needed? Make user convenient to get different field DayTimeIntervalType ### Does this PR introduce _any_ user-facing change? User can call cast DayTimeIntervalType(DAY, SECOND) to DayTimeIntervalType(DAY, MINUTE) etc ### How was this patch tested? Added UT Closes #32975 from AngersZhuuuu/SPARK-35820. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-21 12:36:38 +03:00
Peter Toth	682e7f2033	[SPARK-29375][SPARK-28940][SPARK-32041][SQL] Whole plan exchange and subquery reuse ### What changes were proposed in this pull request? This PR: 1. Fixes an issue in `ReuseExchange` rule that can result a `ReusedExchange` node pointing to an invalid exchange. This can happen due to the 2 separate traversals in `ReuseExchange` when the 2nd traversal modifies an exchange that has already been referenced (reused) in the 1st traversal. Consider the following query: ``` WITH t AS ( SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k WHERE df2.id < 2 ) SELECT * FROM t AS a JOIN t AS b ON a.id = b.id ``` Before this PR the plan of the query was (note the `<== this reuse node points to a non-existing node` marker): ``` == Physical Plan == (7) SortMergeJoin [id#14L], [id#18L], Inner :- (3) Sort [id#14L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#14L, 5), true, [id=#298] : +- (2) Project [id#14L, k#17L] : +- (2) BroadcastHashJoin [k#15L], [k#17L], Inner, BuildRight : :- (2) Project [id#14L, k#15L] : : +- (2) Filter isnotnull(id#14L) : : +- (2) ColumnarToRow : : +- FileScan parquet default.df1[id#14L,k#15L] Batched: true, DataFilters: [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#15L), dynamicpruningexpression(k#15L IN dynamicpruning#26)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- SubqueryBroadcast dynamicpruning#26, 0, [k#17L], [id=#289] : : +- ReusedExchange [k#17L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#179] : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#179] : +- (1) Project [k#17L] : +- (1) Filter ((isnotnull(id#16L) AND (id#16L < 2)) AND isnotnull(k#17L)) : +- (1) ColumnarToRow : +- FileScan parquet default.df2[id#16L,k#17L] Batched: true, DataFilters: [isnotnull(id#16L), (id#16L < 2), isnotnull(k#17L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> +- (6) Sort [id#18L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#18L, k#21L], Exchange hashpartitioning(id#14L, 5), true, [id=#184] <== this reuse node points to a non-existing node ``` After this PR: ``` == Physical Plan == (7) SortMergeJoin [id#14L], [id#18L], Inner :- (3) Sort [id#14L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#14L, 5), true, [id=#231] : +- (2) Project [id#14L, k#17L] : +- (2) BroadcastHashJoin [k#15L], [k#17L], Inner, BuildRight : :- (2) Project [id#14L, k#15L] : : +- (2) Filter isnotnull(id#14L) : : +- (2) ColumnarToRow : : +- FileScan parquet default.df1[id#14L,k#15L] Batched: true, DataFilters: [isnotnull(id#14L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#15L), dynamicpruningexpression(k#15L IN dynamicpruning#26)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- SubqueryBroadcast dynamicpruning#26, 0, [k#17L], [id=#103] : : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#102] : : +- (1) Project [k#17L] : : +- (1) Filter ((isnotnull(id#16L) AND (id#16L < 2)) AND isnotnull(k#17L)) : : +- (1) ColumnarToRow : : +- FileScan parquet default.df2[id#16L,k#17L] Batched: true, DataFilters: [isnotnull(id#16L), (id#16L < 2), isnotnull(k#17L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> : +- ReusedExchange [k#17L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#102] +- (6) Sort [id#18L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#18L, k#21L], Exchange hashpartitioning(id#14L, 5), true, [id=#231] ``` 2. Fixes an issue with separate consecutive `ReuseExchange` and `ReuseSubquery` rules that can result a `ReusedExchange` node pointing to an invalid exchange. This can happen due to the 2 separate rules when `ReuseSubquery` rule modifies an exchange that has already been referenced (reused) in `ReuseExchange` rule. Consider the following query: ``` WITH t AS ( SELECT df1.id, df2.k FROM df1 JOIN df2 ON df1.k = df2.k WHERE df2.id < 2 ), t2 AS ( SELECT * FROM t UNION SELECT * FROM t ) SELECT * FROM t2 AS a JOIN t2 AS b ON a.id = b.id ``` Before this PR the plan of the query was (note the `<== this reuse node points to a non-existing node` marker): ``` == Physical Plan == (15) SortMergeJoin [id#46L], [id#58L], Inner :- (7) Sort [id#46L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#46L, 5), true, [id=#979] : +- (6) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Exchange hashpartitioning(id#46L, k#49L, 5), true, [id=#975] : +- (5) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Union : :- (2) Project [id#46L, k#49L] : : +- (2) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : : :- (2) Project [id#46L, k#47L] : : : +- (2) Filter isnotnull(id#46L) : : : +- (2) ColumnarToRow : : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : : +- SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#926] : : : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#656] : : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#656] : : +- (1) Project [k#49L] : : +- (1) Filter ((isnotnull(id#48L) AND (id#48L < 2)) AND isnotnull(k#49L)) : : +- (1) ColumnarToRow : : +- FileScan parquet default.df2[id#48L,k#49L] Batched: true, DataFilters: [isnotnull(id#48L), (id#48L < 2), isnotnull(k#49L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> : +- (4) Project [id#46L, k#49L] : +- (4) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : :- (4) Project [id#46L, k#47L] : : +- (4) Filter isnotnull(id#46L) : : +- (4) ColumnarToRow : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- ReusedSubquery SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#926] : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#656] +- (14) Sort [id#58L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#58L, k#61L], Exchange hashpartitioning(id#46L, 5), true, [id=#761] <== this reuse node points to a non-existing node ``` After this PR: ``` == Physical Plan == (15) SortMergeJoin [id#46L], [id#58L], Inner :- (7) Sort [id#46L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#46L, 5), true, [id=#793] : +- (6) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Exchange hashpartitioning(id#46L, k#49L, 5), true, [id=#789] : +- (5) HashAggregate(keys=[id#46L, k#49L], functions=[]) : +- Union : :- (2) Project [id#46L, k#49L] : : +- (2) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : : :- (2) Project [id#46L, k#47L] : : : +- (2) Filter isnotnull(id#46L) : : : +- (2) ColumnarToRow : : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : : +- SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#485] : : : +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#484] : : : +- (1) Project [k#49L] : : : +- (1) Filter ((isnotnull(id#48L) AND (id#48L < 2)) AND isnotnull(k#49L)) : : : +- (1) ColumnarToRow : : : +- FileScan parquet default.df2[id#48L,k#49L] Batched: true, DataFilters: [isnotnull(id#48L), (id#48L < 2), isnotnull(k#49L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [], PushedFilters: [IsNotNull(id), LessThan(id,2), IsNotNull(k)], ReadSchema: struct<id:bigint,k:bigint> : : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#484] : +- (4) Project [id#46L, k#49L] : +- (4) BroadcastHashJoin [k#47L], [k#49L], Inner, BuildRight : :- (4) Project [id#46L, k#47L] : : +- (4) Filter isnotnull(id#46L) : : +- (4) ColumnarToRow : : +- FileScan parquet default.df1[id#46L,k#47L] Batched: true, DataFilters: [isnotnull(id#46L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/petertoth/git/apache/spark/sql/core/spark-warehouse/org.apache.spar..., PartitionFilters: [isnotnull(k#47L), dynamicpruningexpression(k#47L IN dynamicpruning#66)], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> : : +- ReusedSubquery SubqueryBroadcast dynamicpruning#66, 0, [k#49L], [id=#485] : +- ReusedExchange [k#49L], BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])), [id=#484] +- (14) Sort [id#58L ASC NULLS FIRST], false, 0 +- ReusedExchange [id#58L, k#61L], Exchange hashpartitioning(id#46L, 5), true, [id=#793] ``` (This example contains issue 1 as well.) 3. Improves the reuse of exchanges and subqueries by enabling reuse across the whole plan. This means that the new combined rule utilizes the reuse opportunities between parent and subqueries by traversing the whole plan. The traversal is started on the top level query only. 4. Due to the order of traversal this PR does while adding reuse nodes, the reuse nodes appear in parent queries if reuse is possible between different levels of queries (typical for DPP). This is not an issue from execution perspective, but this also means "forward references" in explain formatted output where parent queries come first. The changes I made to `ExplainUtils` are to handle these references properly. This PR fixes the above 3 issues by unifying the separate rules into a `ReuseExchangeAndSubquery` rule that does a 1 pass, whole-plan, bottom-up traversal. ### Why are the changes needed? Performance improvement. ### How was this patch tested? - New UTs in `ReuseExchangeAndSubquerySuite` to cover 1. and 2. - New UTs in `DynamicPartitionPruningSuite`, `SubquerySuite` and `ExchangeSuite` to cover 3. - New `ReuseMapSuite` to test `ReuseMap`. - Checked new golden files of `PlanStabilitySuite`s for invalid reuse references. - TPCDS benchmarks. Closes #28885 from peter-toth/SPARK-29375-SPARK-28940-whole-plan-reuse. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-21 04:53:19 +00:00
Kousuke Saruta	af20474c67	[SPARK-35827][SQL] Show proper error message when update column types to year-month/day-time interval ### What changes were proposed in this pull request? This PR fixes error message shown when changing a column type to year-month/day-time interval type is attempted. ### Why are the changes needed? It's for consistent behavior. Updating column types to interval types are prohibited for V2 source tables. So, if we attempt to update the type of a column to the conventional interval type, an error message like `Error in query: Cannot update <table> field <column> to interval type;`. But, for year-month/day-time interval types, another error message like `Error in query: Cannot update <table> field <column>:<type> cannot be cast to interval year;`. You can reproduce with the following procedure. ``` $ bin/spark-sql spark-sql> SET spark.sql.catalog.mycatalog=<a catalog implementation class>; spark-sql> CREATE TABLE mycatalog.t1(c1 int) USING <V2 datasource implementation class>; spark-sql> ALTER TABLE mycatalog.t1 ALTER COLUMN c1 TYPE interval year to month; ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified an existing test. Closes #32978 from sarutak/err-msg-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-20 23:39:46 +03:00
Kousuke Saruta	4758dc78a2	[SPARK-35771][SQL][FOLLOWUP] IntervalUtils.toYearMonthIntervalString should consider the case year-month type is casted as month type ### What changes were proposed in this pull request? This PR fixes an issue that `IntervalUtils.toYearMonthIntervalString` doesn't consider the case that year-month interval type is casted as month interval type. If a year-month interval data is casted as month interval, the value of the year is multiplied by `12` and added to the value of month. For example, `INTERVAL '1-2' YEAR TO MONTH` will be `INTERVAL '14' MONTH` if it's casted. If this behavior is intended, it's stringified to be `'INTERVAL 14' MONTH` but currently, it will be `INTERVAL '2' MONTH` ### Why are the changes needed? It's a bug if the behavior of cast is intended. ### Does this PR introduce _any_ user-facing change? No, because this feature is not released yet. ### How was this patch tested? Modified the tests added in SPARK-35771 (#32924). Closes #32982 from sarutak/fix-toYearMonthIntervalString. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-20 10:32:21 +03:00
Angerszhuuuu	86bcd1fba0	[SPARK-35819][SQL] Support Cast between different field YearMonthIntervalType ### What changes were proposed in this pull request? Support Cast between different field YearMonthIntervalType ### Why are the changes needed? Make user convenient to get different field YearMonthIntervalType ### Does this PR introduce _any_ user-facing change? User can call cast YearMonthIntervalType(YEAR, MONTH) to YearMonthIntervalType(YEAR, YEAR) etc ### How was this patch tested? Added UT Closes #32974 from AngersZhuuuu/SPARK-35819. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-19 21:43:06 +03:00
Angerszhuuuu	2ebad72758	[SPARK-35726][SQL] Truncate java.time.Duration by fields of day-time interval type ### What changes were proposed in this pull request? Support truncate java.time.Duration by fields of day-time interval type. ### Why are the changes needed? To respect fields of the target day-time interval types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32950 from AngersZhuuuu/SPARK-35726. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-19 13:51:21 +03:00
Liang-Chi Hsieh	882122d6b7	[SPARK-35565][SS] Add config for ignoring metadata directory of FileStreamSink ### What changes were proposed in this pull request? This patch proposes to add an internal config for ignoring metadata of `FileStreamSink` when reading the output path. ### Why are the changes needed? `FileStreamSink` produces a metadata directory which logs output files per micro-batch. When we read from the output path, Spark will look at the metadata and ignore other files not in the log. Normally it works well. But for some use-cases, we may need to ignore the metadata when reading the output path. For example, when we change the streaming query and must to run it with new checkpoint directory, we cannot use previous metadata. If we create a new metadata too, when we read the output path later in Spark, Spark only reads the files listed in the new metadata. The files written before we use new checkpoint and metadata are ignored by Spark. Although seems we can output to different output directory every time, but it is bad idea as we will produce many directories unnecessarily. We need a config for ignoring the metadata of `FileStreamSink` when reading the output path. ### Does this PR introduce _any_ user-facing change? Added a config for ignoring metadata of FileStreamSink when reading the output. ### How was this patch tested? Unit tests. Closes #32702 from viirya/ignore-metadata. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-19 08:20:58 +09:00
Yuming Wang	7be8d8a164	[SPARK-35185][SQL] Improve Distinct statistics estimation ### What changes were proposed in this pull request? This PR improves `Distinct` statistics estimation by rewrite it to `Aggregate`. ### Why are the changes needed? 1. The current implementation will lack column statistics. 2. Some rules before the `ReplaceDistinctWithAggregate` may use it. For example: https://github.com/apache/spark/pull/31113/files#diff-11264d807efa58054cca2d220aae8fba644ee0f0f2a4722c46d52828394846efR1808 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32291 from wangyum/SPARK-35185. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-06-18 21:48:44 +08:00
ulysses-you	2c4598d02e	[SPARK-35608][SQL] Support AQE optimizer side transformUpWithPruning ### What changes were proposed in this pull request? Change `AQEPropagateEmptyRelation` from `transformUp` to `transformUpWithPruning ### Why are the changes needed? To avoid unnecessary iteration during AQE optimizer. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass CI. Closes #32742 from ulysses-you/aqe-transformUpWithPruning. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-18 20:31:11 +08:00
Angerszhuuuu	071566caf3	[SPARK-35769][SQL] Truncate java.time.Period by fields of year-month interval type ### What changes were proposed in this pull request? Support truncate java.time.Period by fields of year-month interval type ### Why are the changes needed? To follow the SQL standard and respect the field restriction of the target year-month type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32945 from AngersZhuuuu/SPARK-35769. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-18 11:55:57 +03:00
Gengliang Wang	05e2b76852	[SPARK-35720][SQL] Support casting of String to timestamp without time zone type ### What changes were proposed in this pull request? Extend the Cast expression and support StringType in casting to TimestampWithoutTZType. Closes #32898 ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32936 from gengliangwang/castStringToTswtz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-18 02:02:10 +08:00
allisonwang-db	0d900b6cfa	[SPARK-35789][SQL] Refine lateral join syntax to only allow subqueries ### What changes were proposed in this pull request? This PR is a follow-up for SPARK-34382. It refines the lateral join syntax to only allow the LATERAL keyword to be in front of subqueries, instead of all `relationPriamry`. For example, `SELECT * FROM t1, LATERAL t2` should not be allowed. ### Why are the changes needed? To be consistent with Postgres. ### Does this PR introduce _any_ user-facing change? Yes. After this PR, the LATERAL keyword can only be in front of subqueries. ```scala sql("SELECT * FROM t1, LATERAL t2") org.apache.spark.sql.catalyst.parser.ParseException: LATERAL can only be used with subquery(line 1, pos 26) == SQL == select * from t1, lateral t2 --------------------------^^^ ``` ### How was this patch tested? New unit tests. Closes #32937 from allisonwang-db/spark-35789-lateral-join-parser. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-17 16:47:30 +00:00
Linhong Liu	b86a69f026	[SPARK-35792][SQL] View should not capture configs used in `RelationConversions` ### What changes were proposed in this pull request? `RelationConversions` is actually an optimization rule while it's executed in the analysis phase. For view, it's designed to only capture semantic configs, so we should ignore the optimization configs that will be used in the analysis phase. This PR also fixes the issue that view resolution will always use the default value for uncaptured config ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, after this PR view resolution will respect the values set in the current session for the below configs ``` "spark.sql.hive.convertMetastoreParquet" "spark.sql.hive.convertMetastoreOrc" "spark.sql.hive.convertInsertingPartitionedTable" "spark.sql.hive.convertMetastoreCtas" ``` ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveSQLViewSuite" ``` Closes #32941 from linhongliu-db/SPARK-35792-ignore-convert-configs. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-17 21:40:53 +08:00
Angerszhuuuu	234163fbe0	[SPARK-35732][SQL] Parse DayTimeIntervalType from JSON ### What changes were proposed in this pull request? Support Parse DayTimeIntervalType from JSON ### Why are the changes needed? this will allow to store day-second intervals as table columns into Hive external catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32930 from AngersZhuuuu/SPARK-35732. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-17 12:54:34 +03:00
copperybean	939ae91e00	[SPARK-35130][SQL] Add make_dt_interval function to construct DayTimeIntervalType value ### What changes were proposed in this pull request? Providing a new function make_dt_interval to construct DayTimeIntervalType value ### Why are the changes needed? As the JIRA described, we should provide a function to construct DayTimeIntervalType value ### Does this PR introduce _any_ user-facing change? Yes, a new make_dt_interval function provided ### How was this patch tested? Updated UTs, manual testing Closes #32601 from copperybean/work. Authored-by: copperybean <copperybean.zhang@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-17 10:01:16 +03:00
Angerszhuuuu	0e554d44df	[SPARK-35770][SQL] Parse YearMonthIntervalType from JSON ### What changes were proposed in this pull request? Parse YearMonthIntervalType from JSON. ### Why are the changes needed? This will allow to store year-month intervals as table columns into Hive external catalog. ### Does this PR introduce _any_ user-facing change? People can store year-month interval types as json string. ### How was this patch tested? Added UT. Closes #32929 from AngersZhuuuu/SPARK-35770. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-17 09:51:47 +03:00
Wenchen Fan	a2961ddfdf	[SPARK-35712][SQL] Simplify ResolveAggregateFunctions ### What changes were proposed in this pull request? Currently, `ResolveAggregateFunctions` is a complicated rule that recursively calls the entire analyzer to resolve aggregate functions in parent nodes of aggregate. It's kind of necessary as we need to do many things to identify the aggregate function and push it down to the aggregate node: resolve columns as if they are in the aggregate node, resolve functions, apply type coercion, etc. However, this is overly complicated and it's hard to fully understand how the resolution is done there. It also leads to hacks such as the [char/varchar hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2396-L2401), [subquery hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2274-L2277), [grouping function hack](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala#L2465-L2467), etc. This PR simplifies the `ResolveAggregateFunctions` rule and clarifies the resolution logic. To resolve aggregate functions/grouping columns in HAVING, ORDER BY and `df.where`, we expand the aggregate node below to output these required aggregate functions/grouping columns. In details, when resolving an expression from the parent of an aggregate node: 1. try to resolve columns with `agg.child` and wrap the result with `TempResolvedColumn`. 2. try to resolve subqueries with `agg.child` 3. if the expression is not resolved, return it and wait for other rules to resolve it, such as resolve functions, type coercions, etc. 4. if the expression is resolved, we transform it and push aggregate functions/grouping columns into the aggregate node below. 4.1 the expression may already present in `agg.aggregateExpressions`, we can simply replace the expression with attr ref. 4.2 if a `TempResolvedColumn` is neither inside an aggregate function, or wrap a grouping column, turn it back to an `UnresolvedAttribute` 5. after the main resolution batch, remove all `TempResolvedColumn` and turn them back to `UnresolvedAttribute`. ### Why are the changes needed? Code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing test Closes #32470 from cloud-fan/agg2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-16 09:52:19 +00:00
Kousuke Saruta	184f65e7c7	[SPARK-35771][SQL] Format year-month intervals using type fields ### What changes were proposed in this pull request? This PR proposes to format year-month interval to strings using the start and end fields of `YearMonthIntervalType`. ### Why are the changes needed? Currently, they are ignored, and any `YearMonthIntervalType` is formatted as `INTERVAL YEAR TO MONTH`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32924 from sarutak/year-month-interval-format. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-16 11:08:02 +03:00
Kousuke Saruta	4530760c40	[SPARK-35774][SQL] Parse any year-month interval types in SQL ### What changes were proposed in this pull request? This PR extends the parser rules to be able to parse the following types: * INTERVAL YEAR * INTERVAL YEAR TO MONTH * INTERVAL MONTH ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New assertion. Closes #32922 from sarutak/parse-any-year-month. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-16 09:41:57 +03:00
Yuming Wang	b08cf6e822	[SPARK-35203][SQL] Improve Repartition statistics estimation ### What changes were proposed in this pull request? This PR improves `Repartition` and `RepartitionByExpr` statistics estimation using child statistics. ### Why are the changes needed? The current implementation will missing column stat. For example: ```sql CREATE TABLE t1 USING parquet AS SELECT id % 10 AS key FROM range(100); ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS; set spark.sql.cbo.enabled=true; EXPLAIN COST SELECT key FROM (SELECT key FROM t1 DISTRIBUTE BY key) t GROUP BY key; ``` Before this PR: ``` == Optimized Logical Plan == Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=1600.0 B) +- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100) +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100) ``` After this PR: ``` == Optimized Logical Plan == Aggregate [key#2950L], [key#2950L], Statistics(sizeInBytes=160.0 B, rowCount=10) +- RepartitionByExpression [key#2950L], Statistics(sizeInBytes=1600.0 B, rowCount=100) +- Relation default.t1[key#2950L] parquet, Statistics(sizeInBytes=1600.0 B, rowCount=100) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32309 from wangyum/SPARK-35203. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-16 10:20:13 +09:00
Wenchen Fan	11e96dc843	[SPARK-35669][SQL] Quote the pushed column name only when nested column predicate pushdown is enabled ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/31964 We should only quote the column name when nested column predicate pushdown is enabled, otherwise the data source side may not have the logic to parse the quoted column name and fail. This is not a problem before #31964 , as we don't quote the column name if there is no dot in the name. But #31964 changed it. ### Why are the changes needed? fix a query failure ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #32807 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-16 09:43:28 +09:00
Max Gekk	61ce8f7649	[SPARK-35680][SQL] Add fields to `YearMonthIntervalType` ### What changes were proposed in this pull request? Extend `YearMonthIntervalType` to support interval fields. Valid interval field values: - 0 (YEAR) - 1 (MONTH) After the changes, the following year-month interval types are supported: 1. `YearMonthIntervalType(0, 0)` or `YearMonthIntervalType(YEAR, YEAR)` 2. `YearMonthIntervalType(0, 1)` or `YearMonthIntervalType(YEAR, MONTH)`. It is the default one. 3. `YearMonthIntervalType(1, 1)` or `YearMonthIntervalType(MONTH, MONTH)` Closes #32825 ### Why are the changes needed? In the current implementation, Spark supports only `interval year to month` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely. ### Does this PR introduce _any_ user-facing change? Yes but `YearMonthIntervalType` has not been released yet. ### How was this patch tested? By existing test suites. Closes #32909 from MaxGekk/add-fields-to-YearMonthIntervalType. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 23:08:12 +03:00
Angerszhuuuu	8a02f3a413	[SPARK-35129][SQL] Construct year-month interval column from integral fields ### What changes were proposed in this pull request? Add a new function to support construct YearMonthIntervalType from integral fields ### Why are the changes needed? Add a new function to support construct YearMonthIntervalType from integral fields ### Does this PR introduce _any_ user-facing change? Yea user can use `make_ym_interval` to construct TearMonthIntervalType from years/months integral fields ### How was this patch tested? Added UT Closes #32645 from AngersZhuuuu/SPARK-35129. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 19:19:41 +03:00
Gengliang Wang	c382d4009b	[SPARK-35766][SQL][TESTS] Break down CastSuite/AnsiCastSuite into multiple files ### What changes were proposed in this pull request? Currently, the file CastSuite.scala becomes big: 2000 lines, 2 base classes, 4 test suites. In my previous work of Timestamp without time zone, I planned to put new test cases in CastSuiteBase, but they were accidentally added in AnsiCastSuiteBase. This PR is to break the file down into 3 files. It also moves the test cases about timestamp without time zone to the right base class. ### Why are the changes needed? Make development and review easier. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #32918 from gengliangwang/refactorCastSuite. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-16 00:17:04 +08:00
Tanel Kiis	b74260f67f	[SPARK-35765][SQL] Distinct aggs are not duplicate sensitive ### What changes were proposed in this pull request? Extended `RemoveRedundantAggregates` to remove deduplicating aggregations before aggregations that ignore duplicates. ### Why are the changes needed? Performance imporovement. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Extending existing UT Closes #32904 from tanelk/SPARK-33122_followup2_distinct_agg. Authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 22:25:04 +09:00
gengjiaan	b191d720e1	[SPARK-35056][SQL] Group exception messages in execution/streaming ### What changes were proposed in this pull request? This PR group exception messages in `sql/core/src/main/scala/org/apache/spark/sql/execution/streaming`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32880 from beliefer/SPARK-35056. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 12:19:52 +03:00
Gengliang Wang	195090afcc	[SPARK-35764][SQL] Assign pretty names to TimestampWithoutTZType ### What changes were proposed in this pull request? In the PR, I propose to override the typeName() method in TimestampWithoutTZType, and assign it a name according to the ANSI SQL standard ![image](https://user-images.githubusercontent.com/1097932/122013859-2cf50680-cdf1-11eb-9fcd-0ec1b59fb5c0.png) ### Why are the changes needed? To improve Spark SQL user experience, and have readable types in error messages. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32915 from gengliangwang/typename. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-15 12:15:13 +03:00
Wenchen Fan	a50bd8f810	[SPARK-35742][SQL] Expression.semanticEquals should be symmetrical ### What changes were proposed in this pull request? Currently, there are some expressions that overwrite `semanticEquals`, which makes it not symmetrical. Ideally, expressions should overwrite `canonicalized` instead of `semanticEquals`. This PR marks `semanticEquals` as final, and implement `canonicalized` for the few expressions that overwrote `semanticEquals` before. ### Why are the changes needed? To avoid subtle bugs (I haven't found a real bug yet). ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #32885 from cloud-fan/attr. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-15 08:53:04 +00:00
Kousuke Saruta	aab0c2bf66	[SPARK-35736][SPARK-35737][SQL][FOLLOWUP] Move a common logic to DayTimeIntervalType ### What changes were proposed in this pull request? This is a followup PR for SPARK-35736(#32893) and SPARK-35737(#32892). This PR moves a common logic to `object DayTimeIntervalType`. That logic is like `val strToFieldIndex = DayTimeIntervalType.dayTimeFields.map(i => DayTimeIntervalType.fieldToString(i) -> (i).toMap`, a `Map` which maps each time unit to the corresponding day-time field index. ### Why are the changes needed? That logic appeared in the change in SPARK-35736 and SPARK-35737 so it can be a common logic and it's better to avoid the similar logic scattered. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32905 from sarutak/followup-SPARK-35736-35737. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 20:51:18 +03:00
Kousuke Saruta	82af318c31	[SPARK-35748][SS][SQL] Fix StreamingJoinHelper to be able to handle day-time interval ### What changes were proposed in this pull request? This PR fixes `StreamingJoinHelper` to be able to handle day-time interval. ### Why are the changes needed? In the current master, `StreamingJoinHelper.getStateValueWatermark` can't handle conditions which contain day-time interval literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New assertions added to `StreamingJoinHlelperSuite`. Closes #32896 from sarutak/streamingjoinhelper-daytime. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 15:45:36 +03:00
Kousuke Saruta	439e94c171	[SPARK-35737][SQL] Parse day-time interval literals to tightest types ### What changes were proposed in this pull request? This PR add a feature which parse day-time interval literals to tightest type. ### Why are the changes needed? To comply with the ANSI behavior. For example, `INTERVAL '10 20:30' DAY TO MINUTE` should be parsed as `DayTimeIntervalType(DAY, MINUTE)` but not as `DayTimeIntervalType(DAY, SECOND)`. ### Does this PR introduce _any_ user-facing change? No because `DayTimeIntervalType` will be introduced in `3.2.0`. ### How was this patch tested? New tests. Closes #32892 from sarutak/tight-daytime-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 10:06:19 +03:00
Kousuke Saruta	7978fdc97b	[SPARK-35736][SQL] Parse any day-time interval types in SQL ### What changes were proposed in this pull request? This PR adda a feature which allow the parser parse any day-time interval types in SQL. ### Why are the changes needed? To comply with ANSI standard, we additionally need to support the following types. * INTERVAL DAY * INTERVAL DAY TO HOUR * INTERVAL DAY TO MINUTE * INTERVAL HOUR * INTERVAL HOUR TO MINUTE * INTERVAL HOUR TO SECOND * INTERVAL MINUTE * INTERVAL MINUTE TO SECOND * INTERVAL SECOND ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #32893 from sarutak/parse-any-day-time. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-14 00:13:50 +03:00
Gengliang Wang	6272222bc0	[SPARK-35719][SQL] Support type conversion between timestamp and timestamp without time zone type ### What changes were proposed in this pull request? 1. Extend the Cast expression and support TimestampType in casting to TimestampWithoutTZType. 2. There was a mistake in casting TimestampWithoutTZType as TimestampType in https://github.com/apache/spark/pull/32864. The target value should be `sourceValue - timeZoneOffset` instead of being the same value. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32878 from gengliangwang/timestampToTimestampWithoutTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-13 18:44:24 +03:00
Haiyang Sun	0ba1d3852b	[SPARK-35701][SQL] Use copy-on-write semantics for SQLConf registered configurations ### What changes were proposed in this pull request? Using copy-on-write for `SQLConf.sqlConfEntries` and `SQLConf.staticConfKeys` to reduce contention in concurrent workloads. ### Why are the changes needed? The global locks used to protect `SQLConf.sqlConfEntries` map and the `SQLConf.staticConfKeys` set can cause significant contention on the `SQLConf` instance in a concurrent setting. Using copy-on-write versions should reduce the contention given that modifications to the configs are relatively rare. Closes #32865 from haiyangsun-db/SPARK-35701. Authored-by: Haiyang Sun <haiyang.sun@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-12 14:59:48 -07:00
Kousuke Saruta	80f7989d9a	[SPARK-35734][SQL] Format day-time intervals using type fields ### What changes were proposed in this pull request? This PR add a feature which formats day-time interval to strings using the start and end fields of `DayTimeIntervalType`. ### Why are the changes needed? Currently, they are ignored, and any `DayTimeIntervalType` is formatted as `INTERVAL DAY TO SECOND.` ### Does this PR introduce _any_ user-facing change? Yes. The format of day-time intervals is determined the start and end fields. ### How was this patch tested? New test. Closes #32891 from sarutak/interval-format. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-12 21:45:12 +03:00
Max Gekk	d53831ff5c	[SPARK-35704][SQL] Add fields to `DayTimeIntervalType` ### What changes were proposed in this pull request? Extend DayTimeIntervalType to support interval fields. Valid interval field values: - 0 (DAY) - 1 (HOUR) - 2 (MINUTE) - 3 (SECOND) After the changes, the following day-time interval types are supported: 1. `DayTimeIntervalType(0, 0)` or `DayTimeIntervalType(DAY, DAY)` 2. `DayTimeIntervalType(0, 1)` or `DayTimeIntervalType(DAY, HOUR)` 3. `DayTimeIntervalType(0, 2)` or `DayTimeIntervalType(DAY, MINUTE)` 4. `DayTimeIntervalType(0, 3)` or `DayTimeIntervalType(DAY, SECOND)`. It is the default one. The second fraction precision is microseconds. 5. `DayTimeIntervalType(1, 1)` or `DayTimeIntervalType(HOUR, HOUR)` 6. `DayTimeIntervalType(1, 2)` or `DayTimeIntervalType(HOUR, MINUTE)` 7. `DayTimeIntervalType(1, 3)` or `DayTimeIntervalType(HOUR, SECOND)` 8. `DayTimeIntervalType(2, 2)` or `DayTimeIntervalType(MINUTE, MINUTE)` 9. `DayTimeIntervalType(2, 3)` or `DayTimeIntervalType(MINUTE, SECOND)` 10. `DayTimeIntervalType(3, 3)` or `DayTimeIntervalType(SECOND, SECOND)` ### Why are the changes needed? In the current implementation, Spark supports only `interval day to second` but the SQL standard allows to specify the start and end fields. The changes will allow to follow ANSI SQL standard more precisely. ### Does this PR introduce _any_ user-facing change? Yes but `DayTimeIntervalType` has not been released yet. ### How was this patch tested? By existing test suites. Closes #32849 from MaxGekk/day-time-interval-type-units. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-11 16:16:33 +03:00
RoryQi	57ce64c511	[SPARK-35706][SQL] Consider making the ':' in STRUCT data type definition optional ### What changes were proposed in this pull request? The STRUCT type syntax is defined like this: STRUCT(fieldNmae: fileType [NOT NULL][COMMENT stringLiteral][,.....]) So the field list is nearly the same as a column list if we could make ':' optional it would be so much cleaner an less proprietary ### Why are the changes needed? ease of use ### Does this PR introduce _any_ user-facing change? Yes, you can use Struct type list is nearly the same as a column list ### How was this patch tested? unit tests Closes #32858 from jerqi/master. Authored-by: RoryQi <1242949407@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 12:58:32 +00:00
Liang-Chi Hsieh	c463472e85	[SPARK-35439][SQL][FOLLOWUP] ExpressionContainmentOrdering should not sort unrelated expressions ### What changes were proposed in this pull request? This is a followup of #32586. We introduced `ExpressionContainmentOrdering` to sort common expressions according to their parent-child relations. For unrelated expressions, previously the ordering returns -1 which is not correct and can possibly lead to transitivity issue. ### Why are the changes needed? To fix the possible transitivity issue of `ExpressionContainmentOrdering`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32870 from viirya/SPARK-35439-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-06-11 16:13:46 +09:00
Gengliang Wang	e9af4576d5	[SPARK-35718][SQL] Support casting of Date to timestamp without time zone type ### What changes were proposed in this pull request? Extend the Cast expression and support DateType in casting to TimestampWithoutTZType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32873 from gengliangwang/dateToTswtz. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-11 05:41:28 +00:00
Gengliang Wang	d21ff1318f	[SPARK-35716][SQL] Support casting of timestamp without time zone to date type ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to DateType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32869 from gengliangwang/castToDate. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-10 23:37:02 +03:00
Emil Ejbyfeldt	e2e3fe7782	[SPARK-35653][SQL] Fix CatalystToExternalMap interpreted path fails for Map with case classes as keys or values ### What changes were proposed in this pull request? Use the key/value LambdaFunction to convert the elements instead of using CatalystTypeConverters.createToScalaConverter. This is how it is done in MapObjects and that correctly handles Arrays with case classes. ### Why are the changes needed? Before these changes the added test cases would fail with the following: ``` [info] - encode/decode for map with case class as value: Map(1 -> IntAndString(1,a)) (interpreted path) * FAILED * (64 milliseconds) [info] Encoded/Decoded data does not match input data [info] [info] in: Map(1 -> IntAndString(1,a)) [info] out: Map(1 -> [1,a]) [info] types: scala.collection.immutable.Map$Map1 [info] [info] Encoded Data: [org.apache.spark.sql.catalyst.expressions.UnsafeMapData5ecf5d9e] [info] Schema: value#823 [info] root [info] -- value: map (nullable = true) [info] \|-- key: integer [info] \|-- value: struct (valueContainsNull = true) [info] \| \|-- i: integer (nullable = false) [info] \| \|-- s: string (nullable = true) [info] [info] [info] fromRow Expressions: [info] catalysttoexternalmap(lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178), lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179), if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString), input[0, map<int,struct<i:int,s:string>>, true], interface scala.collection.immutable.Map [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_key, IntegerType, false, 178) [info] :- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] :- if (isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179))) null else newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- isnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179)) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : :- null [info] : +- newInstance(class org.apache.spark.sql.catalyst.encoders.IntAndString) [info] : :- assertnotnull(lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i) [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).i [info] : : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s.toString [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179).s [info] : +- lambdavariable(CatalystToExternalMap_value, StructField(i,IntegerType,false), StructField(s,StringType,true), true, 179) [info] +- input[0, map<int,struct<i:int,s:string>>, true] (ExpressionEncoderSuite.scala:627) ``` So using a map with cases classes for keys or values and using the interpreted path would incorrect deserialize data from the catalyst representation. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the bug. ### How was this patch tested? Existing and new unit tests in the ExpressionEncoderSuite Closes #32783 from eejbyfeldt/fix-interpreted-path-for-map-with-case-classes. Authored-by: Emil Ejbyfeldt <eejbyfeldt@liveintent.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-10 09:37:27 -07:00
Gengliang Wang	4180692135	[SPARK-35711][SQL] Support casting of timestamp without time zone to timestamp type ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to TimestampType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32864 from gengliangwang/castToTimestamp. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-10 23:03:52 +08:00
Terry Kim	88f1d82a46	[SPARK-34524][SQL][FOLLOWUP] Remove unused checkAlterTablePartition in CheckAnalysis.scala ### What changes were proposed in this pull request? #31637 removed the usage of `CheckAnalysis.checkAlterTablePartition` but didn't remove the function. ### Why are the changes needed? To removed an unused function. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32855 from imback82/SPARK-34524-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 12:43:09 +00:00
Fu Chen	5280f02747	[SPARK-35673][SQL] Fix user-defined hint and unrecognized hint in subquery ### What changes were proposed in this pull request? Use `UnresolvedHint.resolved = child.resolved` instead `UnresolvedHint.resolved = false`, then the plan contains `UnresolvedHint` child can be optimized by rule in batch `Resolution`. For instance, before this pr, the following plan can't be optimized by `ResolveReferences`. ``` !'Project [*] +- SubqueryAlias __auto_generated_subquery_name +- UnresolvedHint use_hash +- Project [42 AS 42#10] +- OneRowRelation ``` ### Why are the changes needed? fix hint in subquery bug ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32841 from cfmcgrady/SPARK-35673. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 15:32:10 +08:00
dgd-contributor	aa3de40773	[SPARK-35679][SQL] instantToMicros overflow ### Why are the changes needed? With Long.minValue cast to an instant, secs will be floored in function microsToInstant and cause overflow when multiply with Micros_per_second ``` def microsToInstant(micros: Long): Instant = { val secs = Math.floorDiv(micros, MICROS_PER_SECOND) // Unfolded Math.floorMod(us, MICROS_PER_SECOND) to reuse the result of // the above calculation of `secs` via `floorDiv`. val mos = micros - secs * MICROS_PER_SECOND <- it will overflow here Instant.ofEpochSecond(secs, mos * NANOS_PER_MICROS) } ``` But the overflow is acceptable because it won't produce any change to the result However, when convert the instant back to micro value, it will raise Overflow Error ``` def instantToMicros(instant: Instant): Long = { val us = Math.multiplyExact(instant.getEpochSecond, MICROS_PER_SECOND) <- It overflow here val result = Math.addExact(us, NANOSECONDS.toMicros(instant.getNano)) result } ``` Code to reproduce this error ``` instantToMicros(microToInstant(Long.MinValue)) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Test added Closes #32839 from dgd-contributor/SPARK-35679_instantToMicro. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-10 08:08:51 +03:00
Linhong Liu	87d2ffbbcf	[MINOR][SQL] No need to normolize name for built-in functions ### What changes were proposed in this pull request? Add an `internalRegisterFunction` for the built-in function registry. So that we can skip the unnecessary function normalization. ### Why are the changes needed? small refactor ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing ut Closes #32842 from linhongliu-db/function-refactor. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 04:35:26 +00:00
Kousuke Saruta	7e99b65295	[SPARK-35194][SQL][FOLLOWUP] Change Seq to collections.Seq in NestedColumnAliasing to work with Scala 2.13 ### What changes were proposed in this pull request? This PR changes an occurrence of `Seq` to `collections.Seq` in `NestedColumnAliasing`. ### Why are the changes needed? In the current master, `NestedColumnAliasing` doesn't work with Scala 2.13 and the relevant tests fail. The following are examples. * `NestedColumnAliasingSuite` * Subclasses of `SchemaPruningSuite` * `ColumnPruningSuite` ``` NestedColumnAliasingSuite: [info] - Pushing a single nested field projection * FAILED * (14 milliseconds) [info] scala.MatchError: (none#211451,ArrayBuffer(name#211451.middle)) (of class scala.Tuple2) [info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.$anonfun$getAttributeToExtractValues$5(NestedColumnAliasing.scala:258) [info] at scala.collection.StrictOptimizedMapOps.flatMap(StrictOptimizedMapOps.scala:31) [info] at scala.collection.StrictOptimizedMapOps.flatMap$(StrictOptimizedMapOps.scala:30) [info] at scala.collection.immutable.HashMap.flatMap(HashMap.scala:39) [info] at org.apache.spark.sql.catalyst.optimizer.NestedColumnAliasing$.getAttributeToExtractValues(NestedColumnAliasing.scala:258) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran tests mentioned above and all passed with Scala 2.13. Closes #32848 from sarutak/followup-SPARK-35194-2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-10 02:14:40 +00:00
Kousuke Saruta	94b66f5e28	[MINOR][SQL] Modify the example of rand and randn ### What changes were proposed in this pull request? This PR fixes the examples of `rand` and `randn`. ### Why are the changes needed? SPARK-23643 (#20793) fixes an issue which is related to the seed and it causes the result of `rand` and `randn`. Now the results of `SELECT rand(0)` and `SELECT randn((null)` are `0.7604953758285915` and `1.6034991609278433` respectively, and they should be deterministic because the number of partitions are always 1 (the leaf node is `OneRowRelation`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Built the doc and confirmed it. ![rand-doc](https://user-images.githubusercontent.com/4736016/121359059-145a9b80-c96e-11eb-84c2-2f2b313614f3.png) Closes #32844 from sarutak/rand-example. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:37:38 +09:00
Gengliang Wang	74b3df86f3	[SPARK-35698][SQL] Support casting of timestamp without time zone to strings ### What changes were proposed in this pull request? Extend the Cast expression and support TimestampWithoutTZType in casting to StringType. ### Why are the changes needed? To conform the ANSI SQL standard which requires to support such casting. ### Does this PR introduce _any_ user-facing change? No, the new timestamp type is not released yet. ### How was this patch tested? Unit test Closes #32846 from gengliangwang/tswtzToString. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-10 02:29:37 +08:00
allisonwang-db	f49bf1a072	[SPARK-34382][SQL] Support LATERAL subqueries ### What changes were proposed in this pull request? This PR adds support for lateral subqueries. A lateral subquery is a subquery preceded by the `LATERAL` keyword in the FROM clause of a query that can reference columns in the preceding FROM items. For example: ```sql SELECT * FROM t1, LATERAL (SELECT * FROM t2 WHERE t1.a = t2.c) ``` A new subquery expression`LateralSubquery` is used to represent a lateral subquery. It is similar to `ScalarSubquery` but can return multiple rows and columns. A new logical unary node `LateralJoin` is used to represent a lateral join. Here is the analyzed plan for the above query: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a], Inner : +- Project [c, d] : +- Filter (outer(a) = c) : +- Relation [c, d] +- Relation [a, b] ``` Similar to a correlated subquery, a lateral subquery can be viewed as a dependent (nested loop) join where the evaluation of the right subtree depends on the current value of the left subtree. The same technique to decorrelate a subquery is used to decorrelate a lateral join: ```scala Project [a, b, c, d] +- LateralJoin lateral-subquery [a && a = c], Inner // pull up correlated predicates as join conditions : +- Project [c, d] : +- Relation [c, d] +- Relation [a, b] ``` Then the lateral join can be rewritten into a normal join: ```scala Join Inner (a = c) :- Relation [a, b] +- Relation [c, d] ``` #### Follow-ups: 1. Similar to rewriting correlated scalar subqueries, rewriting lateral joins is also subject to the COUNT bug (See SPARK-15370 for more details). This is not handled in the current PR as it requires a sizeable amount of refactoring. It will be addressed in a subsequent PR (SPARK-35551). 2. Currently Spark does use outer query references to resolve star expressions in subqueries. This is not lateral subquery specific and can be handled in a separate PR (SPARK-35618) ### Why are the changes needed? To support an ANSI SQL feature. ### Does this PR introduce _any_ user-facing change? Yes. It allows users to use lateral subqueries in the FROM clause of a query. ### How was this patch tested? - Parser test: `PlanParserSuite.scala` - Analyzer test: `ResolveSubquerySuite.scala` - Optimizer test: `PullupCorrelatedPredicatesSuite.scala` - SQL test: `join-lateral.sql`, `postgreSQL/join.sql` Closes #32303 from allisonwang-db/spark-34382-lateral. Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 17:08:32 +00:00
Gengliang Wang	313dc2d4ed	[SPARK-35697][SQL][TESTS] Test TimestampWithoutTZType as ordered and atomic type ### What changes were proposed in this pull request? Add `TimestampWithoutTZType` to `DataTypeTestUtils.ordered`/`atomicTypes`, and implement values generation of those types in `LiteralGenerator`/`RandomDataGenerator`. In this way, the types will be tested automatically in: 1. ArithmeticExpressionSuite: - "function least" - "function greatest" 2. PredicateSuite - "BinaryComparison consistency check" - "AND, OR, EqualTo, EqualNullSafe consistency check" 3. ConditionalExpressionSuite - "if" 4. RandomDataGeneratorSuite - "Basic types" 5. CastSuite - "null cast" - "up-cast" - "SPARK-27671: cast from nested null type in struct" 6. OrderingSuite - "GenerateOrdering with TimestampWithoutTZType" 7. PredicateSuite - "IN with different types" 8. UnsafeRowSuite - "calling get(ordinal, datatype) on null columns" 9. SortSuite - "sorting on TimestampWithoutTZType ..." ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites. Closes #32843 from gengliangwang/atomicTest. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 15:19:25 +00:00
Chao Sun	7d8181b62f	[SPARK-35390][SQL] Handle type coercion when resolving V2 functions ### What changes were proposed in this pull request? Handle type coercion when resolving V2 function. In particular: - prior to evaluating function arguments, insert cast whenever the argument type doesn't match the expected input type. - use `BoundFunction.inputTypes()` to lookup magic method for scalar function ### Why are the changes needed? For V2 functions, the actual argument types should not necessarily match those of the input types, and Spark should handle type coercion whenever it is needed. ### Does this PR introduce _any_ user-facing change? Yes. Now V2 function resolution should be able to handle type coercion properly. ### How was this patch tested? Added a few new tests. Closes #32764 from sunchao/SPARK-35390. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 13:22:19 +00:00
beliefer	ebb4858f71	[SPARK-35058][SQL] Group exception messages in hive/client ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/client`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32763 from beliefer/SPARK-35058. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-09 08:23:09 +00:00
Gengliang Wang	84c5ca33f9	[SPARK-35664][SQL] Support java.time.LocalDateTime as an external type of TimestampWithoutTZ type ### What changes were proposed in this pull request? In the PR, I propose to extend Spark SQL API to accept `java.time.LocalDateTime` as an external type of recently added new Catalyst type - `TimestampWithoutTZ`. The Java class `java.time.LocalDateTime` has a similar semantic to ANSI SQL timestamp without timezone type, and it is the most suitable to be an external type for `TimestampWithoutTZType`. In more details: * Added `TimestampWithoutTZConverter` which converts java.time.LocalDateTime instances to/from internal representation of the Catalyst type `TimestampWithoutTZType` (to Long type). The `TimestampWithoutTZConverter` object uses new methods of DateTimeUtils: * localDateTimeToMicros() converts the input date time to the total length in microseconds. * microsToLocalDateTime() obtains a java.time.LocalDateTime * Support new type `TimestampWithoutTZType` in RowEncoder via the methods createDeserializerForLocalDateTime() and createSerializerForLocalDateTime(). * Extended the Literal API to construct literals from `java.time.LocalDateTime` instances. ### Why are the changes needed? To allow users parallelization of `java.time.LocalDateTime` collections, and construct timestamp without time zone columns. Also to collect such columns back to the driver side. ### Does this PR introduce _any_ user-facing change? The PR extends existing functionality. So, users can parallelize instances of the java.time.LocalDateTime class and collect them back. ``` scala> val ds = Seq(java.time.LocalDateTime.parse("1970-01-01T00:00:00")).toDS ds: org.apache.spark.sql.Dataset[java.time.LocalDateTime] = [value: timestampwithouttz] scala> ds.collect() res0: Array[java.time.LocalDateTime] = Array(1970-01-01T00:00) ``` ### How was this patch tested? New unit tests Closes #32814 from gengliangwang/LocalDateTime. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-09 14:59:46 +08:00
Chao Sun	66e38f48fe	[SPARK-35384][SQL][FOLLOWUP] Fix Scala doc for removed method parameters ### What changes were proposed in this pull request? Fix Scala doc for removed parameters for `InvokeLike.invoke`. ### Why are the changes needed? #32532 forgot to update the Scala doc after removing 2 parameters for `InvokeLike.invoke`. This fixes it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #32827 from sunchao/SPARK-35384-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-08 15:52:10 -07:00
Satish Gopalani	2a331177ba	[SPARK-35312][SS] Introduce new Option in Kafka source to specify minimum number of records to read per trigger ### What changes were proposed in this pull request? This patch introduces a new option to specify the minimum number of offsets to read per trigger i.e. minOffsetsPerTrigger and maxTriggerDelay to avoid the infinite wait for the trigger. This new option will allow skipping trigger/batch when the number of records available in Kafka is low. This is a very useful feature in cases where we have a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. 'maxTriggerDelay' option will help to avoid cases of infinite delay in scheduling trigger and the trigger will happen irrespective of records available if the maxTriggerDelay time exceeds the last trigger. It would be an optional parameter with a default value of 15 mins. This option will be only applicable if minOffsetsPerTrigger is set. minOffsetsPerTrigger option would be optional of course, but once specified it would take precedence over maxOffestsPerTrigger which will be honored only after minOffsetsPerTrigger is satisfied. ### Why are the changes needed? There are many scenarios where there is a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. Tunning such jobs is difficult as decreasing trigger processing time increasing the number of batches and hence cluster resource usage and adds to small file issues. Increasing trigger processing time adds consumer lag. This patch tries to address this issue. ### How was this patch tested? This patch was tested by adding test cases as well as manually on a cluster where the job was running for a full one day with a data burst happening once a day. Here is the picture of databurst and hence consumer lag: <img width="1198" alt="Screenshot 2021-04-29 at 11 39 35 PM" src="https://user-images.githubusercontent.com/1044003/116997587-9b2ab180-acfa-11eb-91fd-524802ce3316.png"> This is how the job behaved at burst time running every 4.5 mins (which is the specified trigger time): <img width="1154" alt="Burst Time" src="https://user-images.githubusercontent.com/1044003/116997919-12f8dc00-acfb-11eb-9b0a-98387fc67560.png"> This is job behavior during the non-burst time where it is skipping 2 to 3 triggers and running once every 9 to 13.5 mins <img width="1154" alt="Non Burst Time" src="https://user-images.githubusercontent.com/1044003/116998244-8b5f9d00-acfb-11eb-8340-33d47149ef81.png"> Here are some more stats from the two-run i.e. one normal run and the other with minOffsetsperTrigger set: \| Run \| Data Size \| Number of Batch Runs \| Number of Files \| \| ------------- \| ------------- \|------------- \|------------- \| \| Normal Run \| 54.2 GB \| 320 \| 21968 \| \| Run with minOffsetsperTrigger \| 54.2 GB \| 120 \| 12104 \| Closes #32653 from satishgopalani/SPARK-35312. Authored-by: Satish Gopalani <satish.gopalani@pubmatic.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-06-08 23:48:09 +09:00
Cheng Pan	eee02739ed	[SPARK-34290][SQL][FOLLOWUP] Cleanup truncate table not supported for V2Table error ### What changes were proposed in this pull request? Cleanup unreachable code. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existed test. Closes #32791 from pan3793/cleanup. Authored-by: Cheng Pan <379377944@qq.com> Signed-off-by: Kent Yao <yao@apache.org>	2021-06-08 13:24:11 +08:00
Gengliang Wang	33f26275f4	[SPARK-35663][SQL] Add Timestamp without time zone type ### What changes were proposed in this pull request? Extend Catalyst's type system by a new type that conforms to the SQL standard (see SQL:2016, section 4.6.2): TimestampWithoutTZType represents the timestamp without time zone type ### Why are the changes needed? Spark SQL today supports the TIMESTAMP data type. However the semantics provided actually match TIMESTAMP WITH LOCAL TIMEZONE as defined by Oracle. Timestamps embedded in a SQL query or passed through JDBC are presumed to be in session local timezone and cast to UTC before being processed. These are desirable semantics in many cases, such as when dealing with calendars. In many (more) other cases, such as when dealing with log files it is desirable that the provided timestamps not be altered. SQL users expect that they can model either behavior and do so by using TIMESTAMP WITHOUT TIME ZONE for time zone insensitive data and TIMESTAMP WITH LOCAL TIME ZONE for time zone sensitive data. Most traditional RDBMS map TIMESTAMP to TIMESTAMP WITHOUT TIME ZONE and will be surprised to see TIMESTAMP WITH LOCAL TIME ZONE, a feature that does not exist in the standard. In this new feature, we will introduce TIMESTAMP WITH LOCAL TIMEZONE to describe the existing timestamp type and add TIMESTAMP WITHOUT TIME ZONE for standard semantic. Using these two types will provide clarity. This is a starting PR. See more details in https://issues.apache.org/jira/browse/SPARK-35662 ### Does this PR introduce _any_ user-facing change? Yes, a new data type for Timestamp without time zone type. It is still in development. ### How was this patch tested? Unit test Closes #32802 from gengliangwang/TimestampNTZType. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-07 14:21:31 +00:00
Wenchen Fan	a70e66ecfa	[SPARK-35665][SQL] Resolve UnresolvedAlias in CollectMetrics ### What changes were proposed in this pull request? It's a long-standing bug that we forgot to resolve `UnresolvedAlias` in `CollectMetrics`. It's a bit hard to trigger this bug before 3.2 as most likely people won't create `UnresolvedAlias` when calling `Dataset.observe`. However things have been changed after https://github.com/apache/spark/pull/30974 This PR proposes to handle `CollectMetrics` in the rule `ResolveAliases`. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated test Closes #32803 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 21:05:11 +09:00
Alkis Polyzotis	6f8c62047c	[SPARK-35558] Optimizes for multi-quantile retrieval ### What changes were proposed in this pull request? Optimizes the retrieval of approximate quantiles for an array of percentiles. * Adds an overload for QuantileSummaries.query that accepts an array of percentiles and optimizes the computation to do a single pass over the sketch and avoid redundant computation. * Modifies the ApproximatePercentiles operator to call into the new method. All formatting changes are the result of running ./dev/scalafmt ### Why are the changes needed? The existing implementation does repeated calls per input percentile resulting in redundant computation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests for the new method. Closes #32700 from alkispoly-db/spark_35558_approx_quants_array. Authored-by: Alkis Polyzotis <alkis.polyzotis@databricks.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-06-05 14:25:33 -05:00
Yingyi Bu	7bc364beed	[SPARK-35621][SQL] Add rule id pruning to the TypeCoercion rule ### What changes were proposed in this pull request? - Added TreeNode.transformUpWithBeforeAndAfterRuleOnChildren(...); - Call transformUpWithBeforeAndAfterRuleOnChildren in TypeCoercionRule. ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff : <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment (wo. ruleId) \| Experiment (wo. ruleId)/Baseline \| Experiment (w. ruleId) \| Experiment (w. ruleId)/Baseline -- \| -- \| -- \| -- \| -- \| -- CombinedTypeCoercionRule \| 665020354 \| 567320034 \| 0.85 \| 330798240 \| 0.50 </google-sheets-html-origin> Closes #32761 from sigmod/transform. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-05 14:49:16 +08:00
Kent Yao	dc3317fdf9	[SPARK-21957][SQL][FOLLOWUP] Support CURRENT_USER without tailing parentheses ### What changes were proposed in this pull request? A followup for `345d35ed1a`, in this PR we support CURRENT_USER without tailing parentheses in default mode. And for ANSI mode, we can only use CURRENT_USER without tailing parentheses because it is a reserved keyword that cannot be used as a function name ### Why are the changes needed? 1. make it the same as current_date/current_timestamp 2. better ANSI compliance ### Does this PR introduce _any_ user-facing change? no, just a followup ### How was this patch tested? new tests Closes #32770 from yaooqinn/SPARK-21957-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-04 13:32:56 +00:00
ulysses-you	c7fb0e18be	[SPARK-35629][SQL] Use better exception type if database doesn't exist on `drop database` ### What changes were proposed in this pull request? Add database if exists check in `SeesionCatalog` ### Why are the changes needed? Curently execute `drop database test` will throw unfriendly error msg. ``` Error in query: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.metastore.api.NoSuchObjectException: test at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropDatabase(HiveExternalCatalog.scala:200) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.dropDatabase(ExternalCatalogWithListener.scala:53) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.dropDatabase(SessionCatalog.scala:273) at org.apache.spark.sql.execution.command.DropDatabaseCommand.run(ddl.scala:111) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:75) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:73) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:84) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:228) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3707) ``` ### Does this PR introduce _any_ user-facing change? Yes, more cleaner error msg. ### How was this patch tested? Add test. Closes #32768 from ulysses-you/SPARK-35629. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-04 15:52:21 +08:00
Karen Feng	53a758b51b	[SPARK-35636][SQL] Lambda keys should not be referenced outside of the lambda function ### What changes were proposed in this pull request? Sets `references` for `NamedLambdaVariable` and `LambdaFunction`. \| Expression \| NamedLambdaVariable \| LambdaFunction \| \| --- \| --- \| --- \| \| References before \| None \| All function references \| \| References after \| self.toAttribute \| Function references minus arguments' references \| In `NestedColumnAliasing`, this means that `ExtractValue(ExtractValue(attr, lv: NamedLambdaVariable), ...)` now references both `attr` and `lv`, rather than just `attr`. As a result, it will not be included in the nested column references. ### Why are the changes needed? Before, lambda key was referenced outside of lambda function. #### Example 1 Before: ``` Project [transform(keys#0, lambdafunction(_extract_v1#0, lambda key#0, false)) AS a#0] +- 'Join Cross :- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0] : +- LocalRelation <empty>, [kvs#0] +- LocalRelation <empty>, [keys#0] ``` After: ``` Project [transform(keys#418, lambdafunction(kvs#417[lambda key#420].v1, lambda key#420, false)) AS a#419] +- Join Cross :- LocalRelation <empty>, [kvs#417] +- LocalRelation <empty>, [keys#418] ``` #### Example 2 Before: ``` Project [transform(keys#0, lambdafunction(kvs#0[lambda key#0].v1, lambda key#0, false)) AS a#0] +- GlobalLimit 5 +- LocalLimit 5 +- Project [keys#0, _extract_v1#0 AS _extract_v1#0] +- GlobalLimit 5 +- LocalLimit 5 +- Project [kvs#0[lambda key#0].v1 AS _extract_v1#0, keys#0] +- LocalRelation <empty>, [kvs#0, keys#0] ``` After: ``` Project [transform(keys#428, lambdafunction(kvs#427[lambda key#430].v1, lambda key#430, false)) AS a#429] +- GlobalLimit 5 +- LocalLimit 5 +- Project [keys#428, kvs#427] +- GlobalLimit 5 +- LocalLimit 5 +- LocalRelation <empty>, [kvs#427, keys#428] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Scala unit tests for the examples above Closes #32773 from karenfeng/SPARK-35636. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 15:44:32 +09:00
fornaix	878527d9fa	[SPARK-35612][SQL] Support LZ4 compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support LZ4 compression in the ORC data source. ### Why are the changes needed? Apache ORC supports LZ4 compression, but we cannot set LZ4 compression in the ORC data source BEFORE ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") java.lang.IllegalArgumentException: Codec [lz4] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none, zstd. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "lz4").orc("/tmp/lz4") ``` ```bash $ orc-tools meta /tmp/lz4 Processing data file file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc [length: 222] Structure for file:/tmp/lz4/part-00000-6a244eee-b092-4c79-a977-fb8a69dde2eb-c000.lz4.orc File Version: 0.12 with ORC_517 Rows: 10 Compression: LZ4 Compression size: 262144 Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 File Statistics: Column 0: count: 10 hasNull: false Column 1: count: 10 hasNull: false bytesOnDisk: 7 min: 0 max: 9 sum: 45 Stripes: Stripe: offset: 3 data: 7 rows: 10 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 7 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 222 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Pass the newly added test case. Closes #32751 from fornaix/spark-35612. Authored-by: fornaix <foxnaix@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-03 14:07:26 -07:00
Liang-Chi Hsieh	0342dcb628	[SPARK-35580][SQL] Implement canonicalized method for HigherOrderFunction ### What changes were proposed in this pull request? This patch implements `canonicalized` method for `HigherOrderFunction`. Basically it canonicalizes the name of all `NamedLambdaVariable`s and their `ExprId`. The name and `ExprId` of `NamedLambdaVariable` are unque. But to compare semantic equality between `HigherOrderFunction`, we can canonicalize them. ### Why are the changes needed? The default `canonicalized` method does not work for `HigherOrderFunction`. It makes subexpression elimination not work for higher functions. Manual check gen-ed code for: ```scala val df = Seq(Seq(1, 2, 3)).toDF("a") df.select(transform($"a", x => x + 1), transform($"a", x => x + 1)).collect() ``` The code for `transform(input[0, array<int>, true], lambdafunction((lambda x_20#19041 + 1), lambda x_20#19041, false)),transform(input[0, array<int>, true], lambdafunction((lambda x_21#19042 + 1), lambda x_21#19042, false))`, generated by `GenerateUnsafeProjection`. Before: ```java /* 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 028 / public UnsafeRow apply(InternalRow i) { ... / 034 / Object obj_0 = ((Expression) references[0]).eval(i); ... / 062 / Object obj_1 = ((Expression) references[1]).eval(i); ... / 093 / } ``` After: ```java / 005 / class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection { ... / 031 / public UnsafeRow apply(InternalRow i) { ... / 033 / subExpr_0(i); ... / 086 / private void subExpr_0(InternalRow i) { / 087 / Object obj_0 = ((Expression) references[0]).eval(i); / 088 / boolean isNull_0 = obj_0 == null; / 089 / ArrayData value_0 = null; / 090 / if (!isNull_0) { / 091 / value_0 = (ArrayData) obj_0; / 092 / } / 093 / subExprIsNull_0 = isNull_0; / 094 / mutableStateArray_0[0] = value_0; / 095 / } / 096 / / 097 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test and manual check gen-ed code. Closes #32735 from viirya/higher-func-canonicalize. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-03 09:16:47 -07:00
Fu Chen	cfde117c6f	[SPARK-35316][SQL] UnwrapCastInBinaryComparison support In/InSet predicate ### What changes were proposed in this pull request? This pr add in/inset predicate support for `UnwrapCastInBinaryComparison`. Current implement doesn't pushdown filters for `In/InSet` which contains `Cast`. For instance: ```scala spark.range(50).selectExpr("cast(id as int) as id").write.mode("overwrite").parquet("/tmp/parquet/t1") spark.read.parquet("/tmp/parquet/t1").where("id in (1L, 2L, 4L)").explain ``` before this pr: ``` == Physical Plan == (1) Filter cast(id#5 as bigint) IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#5] Batched: true, DataFilters: [cast(id#5 as bigint) IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:int> ``` after this pr: ``` == Physical Plan == (1) Filter id#95 IN (1,2,4) +- (1) ColumnarToRow +- FileScan parquet [id#95] Batched: true, DataFilters: [id#95 IN (1,2,4)], Format: Parquet, Location: InMemoryFileIndex(1 paths)[file:/tmp/parquet/t1], PartitionFilters: [], PushedFilters: [In(id, [1,2,4])], ReadSchema: struct<id:int> ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #32488 from cfmcgrady/SPARK-35316. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-03 14:45:17 +00:00
Yuming Wang	8041aed296	[SPARK-34808][SQL][FOLLOWUP] Remove canPlanAsBroadcastHashJoin check in EliminateOuterJoin ### What changes were proposed in this pull request? This PR removes `canPlanAsBroadcastHashJoin` check in `EliminateOuterJoin. ### Why are the changes needed? We can always removes outer join if it only has DISTINCT on streamed side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32744 from wangyum/SPARK-34808-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 14:14:37 +00:00
gengjiaan	9f7cdb89f7	[SPARK-35059][SQL] Group exception messages in hive/execution ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive/src/main/scala/org/apache/spark/sql/hive/execution`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32694 from beliefer/SPARK-35059. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:06:55 +00:00
Kent Yao	345d35ed1a	[SPARK-21957][SQL] Support current_user function ### What changes were proposed in this pull request? Currently, we do not have a suitable definition of the `user` concept in Spark. We only have a `sparkUser` app widely but do not support identify or retrieve the user information from a session in STS or a runtime query execution. `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. This PR add `current_user()` as a SQL function. And, they are the same. In this PR, we add these functions w/o ambiguity. 1. For a normal single-threaded Spark application, clearly the `sparkUser` is always equivalent to `current_user()` . 2. For a multi-threaded Spark application, e.g. Spark thrift server, we use a `ThreadLocal` variable to store the client-side user(after authenticated) before running the query and retrieve it in the parser. ### Why are the changes needed? `current_user()` is very popular and supported by plenty of other modern or old school databases, and also ANSI compliant. ### Does this PR introduce _any_ user-facing change? yes, added `current_user()` as a SQL function ### How was this patch tested? new tests in thrift server and sql/catalyst Closes #32718 from yaooqinn/SPARK-21957. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-06-02 13:04:40 +00:00
Yingyi Bu	3f6322f9aa	[SPARK-35077][SQL] Migrate to transformWithPruning for leftover optimizer rules ### What changes were proposed in this pull request? Migrate to transformWithPruning for the following queries: - SimplifyExtractValueOps - NormalizeFloatingNumbers - PushProjectionThroughUnion - PushDownPredicates - ExtractPythonUDFFromAggregate - ExtractPythonUDFFromJoinCondition - ExtractGroupingPythonUDFFromAggregate - ExtractPythonUDFs - CleanupDynamicPruningFilters </google-sheets-html-origin> ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- SimplifyExtractValueOps \| 99367049 \| 3679579 \| 0.04 NormalizeFloatingNumbers \| 24717928 \| 20451094 \| 0.83 PushProjectionThroughUnion \| 14130245 \| 7913551 \| 0.56 PushDownPredicates \| 276333542 \| 261246842 \| 0.95 ExtractPythonUDFFromAggregate \| 6459451 \| 2683556 \| 0.42 ExtractPythonUDFFromJoinCondition \| 5695404 \| 2504573 \| 0.44 ExtractGroupingPythonUDFFromAggregate \| 5546701 \| 1858755 \| 0.34 ExtractPythonUDFs \| 58726458 \| 1598518 \| 0.03 CleanupDynamicPruningFilters \| 26606652 \| 15417936 \| 0.58 OptimizeSubqueries \| 3072287940 \| 2876462708 \| 0.94 </google-sheets-html-origin> Closes #32721 from sigmod/pushdown. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 11:46:33 +08:00
Liang-Chi Hsieh	dbf0b50757	[SPARK-35560][SQL] Remove redundant subexpression evaluation in nested subexpressions ### What changes were proposed in this pull request? This patch proposes to improve subexpression evaluation under whole-stage codegen for the cases of nested subexpressions. ### Why are the changes needed? In the cases of nested subexpressions, whole-stage codegen's subexpression elimination will do redundant subexpression evaluation. We should reduce it. For example, if we have two sub-exprs: 1. `simpleUDF($"id")` 2. `functions.length(simpleUDF($"id"))` We should only evaluate `simpleUDF($"id")` once, i.e. ```java subExpr1 = simpleUDF($"id"); subExpr2 = functions.length(subExpr1); ``` Snippets of generated codes: Before: ```java /* 040 / private int project_subExpr_1(long project_expr_0_0) { / 041 / boolean project_isNull_6 = false; / 042 / UTF8String project_value_6 = null; / 043 / if (!false) { / 044 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 045 / } / 046 / / 047 / Object project_arg_1 = null; / 048 / if (project_isNull_6) { / 049 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 050 / } else { / 051 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 052 / } / 053 / / 054 / UTF8String project_result_1 = null; / 055 / try { / 056 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 057 / } catch (Throwable e) { / 058 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 059 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 060 / } / 061 / / 062 / boolean project_isNull_5 = project_result_1 == null; / 063 / UTF8String project_value_5 = null; / 064 / if (!project_isNull_5) { / 065 / project_value_5 = project_result_1; / 066 / } / 067 / boolean project_isNull_4 = project_isNull_5; / 068 / int project_value_4 = -1; / 069 / / 070 / if (!project_isNull_5) { / 071 / project_value_4 = (project_value_5).numChars(); / 072 / } / 073 / project_subExprIsNull_1 = project_isNull_4; / 074 / return project_value_4; / 075 / } ... / 149 / private UTF8String project_subExpr_0(long project_expr_0_0) { / 150 / boolean project_isNull_2 = false; / 151 / UTF8String project_value_2 = null; / 152 / if (!false) { / 153 / project_value_2 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 154 / } / 155 / / 156 / Object project_arg_0 = null; / 157 / if (project_isNull_2) { / 158 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(null); / 159 / } else { / 160 / project_arg_0 = ((scala.Function1[]) references[1] / converters /)[0].apply(project_value_2); / 161 / } / 162 / / 163 / UTF8String project_result_0 = null; / 164 / try { / 165 / project_result_0 = (UTF8String)((scala.Function1[]) references[1] / converters /)[1].apply(((scala.Function1) references[2] / udf /).apply(project_arg_0) ); / 166 / } catch (Throwable e) { / 167 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 168 / "DataFrameSuite$$Lambda$6418/1507986601", "string", "string", e); / 169 / } / 170 / / 171 / boolean project_isNull_1 = project_result_0 == null; / 172 / UTF8String project_value_1 = null; / 173 / if (!project_isNull_1) { / 174 / project_value_1 = project_result_0; / 175 / } / 176 / project_subExprIsNull_0 = project_isNull_1; / 177 / return project_value_1; / 178 / } ``` After: ```java / 041 / private void project_subExpr_1(long project_expr_0_0) { / 042 / boolean project_isNull_8 = project_subExprIsNull_0; / 043 / int project_value_8 = -1; / 044 / / 045 / if (!project_subExprIsNull_0) { / 046 / project_value_8 = (project_mutableStateArray_0[0]).numChars(); / 047 / } / 048 / project_subExprIsNull_1 = project_isNull_8; / 049 / project_subExprValue_0 = project_value_8; / 050 / } / 056 / ... / 123 / / 124 / private void project_subExpr_0(long project_expr_0_0) { / 125 / boolean project_isNull_6 = false; / 126 / UTF8String project_value_6 = null; / 127 / if (!false) { / 128 / project_value_6 = UTF8String.fromString(String.valueOf(project_expr_0_0)); / 129 / } / 130 / / 131 / Object project_arg_1 = null; / 132 / if (project_isNull_6) { / 133 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(null); / 134 / } else { / 135 / project_arg_1 = ((scala.Function1[]) references[3] / converters /)[0].apply(project_value_6); / 136 / } / 137 / / 138 / UTF8String project_result_1 = null; / 139 / try { / 140 / project_result_1 = (UTF8String)((scala.Function1[]) references[3] / converters /)[1].apply(((scala.Function1) references[4] / udf /).apply(project_arg_1) ); / 141 / } catch (Throwable e) { / 142 / throw QueryExecutionErrors.failedExecuteUserDefinedFunctionError( / 143 / "DataFrameSuite$$Lambda$6430/2004847941", "string", "string", e); / 144 / } / 145 / / 146 / boolean project_isNull_5 = project_result_1 == null; / 147 / UTF8String project_value_5 = null; / 148 / if (!project_isNull_5) { / 149 / project_value_5 = project_result_1; / 150 / } / 151 / project_subExprIsNull_0 = project_isNull_5; / 152 / project_mutableStateArray_0[0] = project_value_5; / 153 */ } ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32699 from viirya/improve-subexpr. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-01 19:13:12 -07:00
Gengliang Wang	9d0d4edb43	[SPARK-35595][TESTS] Support multiple loggers in testing method withLogAppender ### What changes were proposed in this pull request? A test case of AdaptiveQueryExecSuite becomes flaky since there are too many debug logs in RootLogger: https://github.com/Yikun/spark/runs/2715222392?check_suite_focus=true https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/139125/testReport/ To fix it, I suggest supporting multiple loggers in the testing method withLogAppender. So that the LogAppender gets clean target log outputs. ### Why are the changes needed? Fix a flaky test case. Also, reduce unnecessary memory cost in tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #32725 from gengliangwang/fixFlakyLogAppender. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-02 10:05:29 +08:00
Max Gekk	a59063d544	[SPARK-35581][SQL] Support special datetime values in typed literals only ### What changes were proposed in this pull request? In the PR, I propose to support special datetime values introduced by #25708 and by #25716 only in typed literals, and don't recognize them in parsing strings to dates/timestamps. The following string values are supported only in typed timestamp literals: - `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)` - `today [zoneId]` - midnight today. - `yesterday [zoneId]` - midnight yesterday - `tomorrow [zoneId]` - midnight tomorrow - `now` - current query start time. For example: ```sql spark-sql> SELECT timestamp 'tomorrow'; 2019-09-07 00:00:00 ``` Similarly, the following special date values are supported only in typed date literals: - `epoch [zoneId]` - `1970-01-01` - `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`. - `yesterday [zoneId]` - the current date -1 - `tomorrow [zoneId]` - the current date + 1 - `now` - the date of running the current query. It has the same notion as `today`. For example: ```sql spark-sql> SELECT date 'tomorrow' - date 'yesterday'; 2 ``` ### Why are the changes needed? In the current implementation, Spark supports the special date/timestamp value in any input strings casted to dates/timestamps that leads to the following problems: - If executors have different system time, the result is inconsistent, and random. Column values depend on where the conversions were performed. - The special values play the role of distributed non-deterministic functions though users might think of the values as constants. ### Does this PR introduce _any_ user-facing change? Yes but the probability should be small. ### How was this patch tested? By running existing test suites: ``` $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z date.sql" $ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z timestamp.sql" $ build/sbt "test:testOnly *DateTimeUtilsSuite" ``` Closes #32714 from MaxGekk/remove-datetime-special-values. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-06-01 15:29:05 +03:00
Yingyi Bu	1dd0ca23f6	[SPARK-35544][SQL] Add tree pattern pruning to Analyzer rules ### What changes were proposed in this pull request? Added the following TreePattern enums: - AGGREGATE_EXPRESSION - ALIAS - GROUPING_ANALYTICS - GENERATOR - HIGH_ORDER_FUNCTION - LAMBDA_FUNCTION - NEW_INSTANCE - PIVOT - PYTHON_UDF - TIME_WINDOW - TIME_ZONE_AWARE_EXPRESSION - UP_CAST - COMMAND - EVENT_TIME_WATERMARK - UNRESOLVED_RELATION - WITH_WINDOW_DEFINITION - UNRESOLVED_ALIAS - UNRESOLVED_ATTRIBUTE - UNRESOLVED_DESERIALIZER - UNRESOLVED_ORDINAL - UNRESOLVED_FUNCTION - UNRESOLVED_HINT - UNRESOLVED_SUBQUERY_COLUMN_ALIAS - UNRESOLVED_FUNC Added tree pattern pruning to the following Analyzer rules: - ResolveBinaryArithmetic - WindowsSubstitution - ResolveAliases - ResolveGroupingAnalytics - ResolvePivot - ResolveOrdinalInOrderByAndGroupBy - LookupFunction - ResolveSubquery - ResolveSubqueryColumnAliases - ApplyCharTypePadding - UpdateOuterReferences - ResolveCreateNamedStruct - TimeWindowing - CleanupAliases - EliminateUnions - EliminateSubqueryAliases - HandleAnalysisOnlyCommand - ResolveNewInstances - ResolveUpCast - ResolveDeserializer - ResolveOutputRelation - ResolveEncodersInUDF - HandleNullInputsForUDF - ResolveGenerate - ExtractGenerator - GlobalAggregates - ResolveAggregateFunctions ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Performance diff: <google-sheets-html-origin><style type="text/css"></style>   \| Baseline \| Experiment \| Experiment/Baseline -- \| -- \| -- \| -- ResolveBinaryArithmetic \| 43264874 \| 34707117 \| 0.80 WindowsSubstitution \| 3322996 \| 2734192 \| 0.82 ResolveAliases \| 24859263 \| 21359941 \| 0.86 ResolveGroupingAnalytics \| 39249143 \| 25417569 \| 0.80 ResolvePivot \| 6393408 \| 2843314 \| 0.44 ResolveOrdinalInOrderByAndGroupBy \| 10750806 \| 3386715 \| 0.32 LookupFunction \| 22087384 \| 15481294 \| 0.70 ResolveSubquery \| 1129139340 \| 944402323 \| 0.84 ResolveSubqueryColumnAliases \| 5055038 \| 2808210 \| 0.56 ApplyCharTypePadding \| 76285576 \| 63785681 \| 0.84 UpdateOuterReferences \| 6548321 \| 3092539 \| 0.47 ResolveCreateNamedStruct \| 38111477 \| 17350249 \| 0.46 TimeWindowing \| 41694190 \| 3739134 \| 0.09 CleanupAliases \| 48683506 \| 39584921 \| 0.81 EliminateUnions \| 3405069 \| 2372506 \| 0.70 EliminateSubqueryAliases \| 9626649 \| 9518216 \| 0.99 HandleAnalysisOnlyCommand \| 2562123 \| 2661432 \| 1.04 ResolveNewInstances \| 16208966 \| 1982314 \| 0.12 ResolveUpCast \| 14067843 \| 1868615 \| 0.13 ResolveDeserializer \| 17991103 \| 2320308 \| 0.13 ResolveOutputRelation \| 5815277 \| 2088787 \| 0.36 ResolveEncodersInUDF \| 14182892 \| 1045113 \| 0.07 HandleNullInputsForUDF \| 19850838 \| 1329528 \| 0.07 ResolveGenerate \| 5587345 \| 1953192 \| 0.35 ExtractGenerator \| 120378046 \| 3386286 \| 0.03 GlobalAggregates \| 16510455 \| 13553155 \| 0.82 ResolveAggregateFunctions \| 1041848509 \| 828049280 \| 0.79 </google-sheets-html-origin> Closes #32686 from sigmod/analyzer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-06-01 11:39:42 +08:00
Wenchen Fan	bb2a0747d2	[SPARK-35578][SQL][TEST] Add a test case for a bug in janino ### What changes were proposed in this pull request? This PR adds a unit test to show a bug in the latest janino version which fails to compile valid Java code. Unfortunately, I can't share the exact query that can trigger this bug (includes some custom expressions), but this pattern is not very uncommon and I believe can be triggered by some real queries. A follow-up is needed before the 3.2 release, to either fix this bug in janino, or revert the janino version upgrade, or work around it in Spark. ### Why are the changes needed? make it easy for people to debug janino, as I'm not a janino expert. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32716 from cloud-fan/janino. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:51:05 +09:00
Gengliang Wang	8e11f5f007	[SPARK-35576][SQL] Redact the sensitive info in the result of Set command ### What changes were proposed in this pull request? Currently, the results of following SQL queries are not redacted: ``` SET [KEY]; SET; ``` For example: ``` scala> spark.sql("set javax.jdo.option.ConnectionPassword=123456").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set javax.jdo.option.ConnectionPassword").show() +--------------------+------+ \| key\| value\| +--------------------+------+ \|javax.jdo.option....\|123456\| +--------------------+------+ scala> spark.sql("set").show() +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|javax.jdo.option....\| 123456\| ``` We should hide the sensitive information and redact the query output. ### Why are the changes needed? Security. ### Does this PR introduce _any_ user-facing change? Yes, the sensitive information in the output of Set commands are redacted ### How was this patch tested? Unit test Closes #32712 from gengliangwang/redactSet. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-31 14:50:18 -07:00
Tengfei Huang	1603775934	[SPARK-35411][SQL][FOLLOWUP] Handle Currying Product while serializing TreeNode to JSON ### What changes were proposed in this pull request? Handle Currying Product while serializing TreeNode to JSON. While processing [Product](https://github.com/apache/spark/blob/v3.1.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L820), we may get an assert error for cases like Currying Product because of the mismatch of sizes between field name and field values. Fallback to use reflection to get all the values for constructor parameters when we meet such cases. ### Why are the changes needed? Avoid throwing error while serializing TreeNode to JSON, try to output as much information as possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT case added. Closes #32713 from ivoson/SPARK-35411-followup. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 22:15:26 +08:00
Yuming Wang	6cd6c438f2	[SPARK-34808][SQL] Removes outer join if it only has DISTINCT on streamed side ### What changes were proposed in this pull request? This pr add new rule to removes outer join if it only has distinct on streamed side. For example: ```scala spark.range(200L).selectExpr("id AS a").createTempView("t1") spark.range(300L).selectExpr("id AS b").createTempView("t2") spark.sql("SELECT DISTINCT a FROM t1 LEFT JOIN t2 ON a = b").explain(true) ``` Before this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [a#2L] +- Join LeftOuter, (a#2L = b#6L) :- Project [id#0L AS a#2L] : +- Range (0, 200, step=1, splits=Some(2)) +- Project [id#4L AS b#6L] +- Range (0, 300, step=1, splits=Some(2)) ``` After this pr: ``` == Optimized Logical Plan == Aggregate [a#2L], [a#2L] +- Project [id#0L AS a#2L] +- Range (0, 200, step=1, splits=Some(2)) ``` ### Why are the changes needed? Improve query performance. [DB2](https://www.ibm.com/docs/en/db2-for-zos/11?topic=manipulation-how-db2-simplifies-join-operations) support this feature: ![image](https://user-images.githubusercontent.com/5399861/119594277-0d7c4680-be0e-11eb-8bd4-366d8c4639f0.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31908 from wangyum/SPARK-34808. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-05-31 18:14:15 +08:00
allisonwang-db	806da9d6fa	[SPARK-35545][SQL] Split SubqueryExpression's children field into outer attributes and join conditions ### What changes were proposed in this pull request? This PR refactors `SubqueryExpression` class. It removes the children field from SubqueryExpression's constructor and adds `outerAttrs` and `joinCond`. ### Why are the changes needed? Currently, the children field of a subquery expression is used to store both collected outer references in the subquery plan and join conditions after correlated predicates are pulled up. For example: `SELECT (SELECT max(c1) FROM t1 WHERE t1.c1 = t2.c1) FROM t2` During the analysis phase, outer references in the subquery are stored in the children field: `scalar-subquery [t2.c1]`, but after the optimizer rule `PullupCorrelatedPredicates`, the children field will be used to store the join conditions, which contain both the inner and the outer references: `scalar-subquery [t1.c1 = t2.c1]`. This is why the references of SubqueryExpression excludes the inner plan's output: `29ed1a2de4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala (L68-L69)` This can be confusing and error-prone. The references for a subquery expression should always be defined as outer attribute references. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32687 from allisonwang-db/refactor-subquery-expr. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-31 04:57:24 +00:00
Yingyi Bu	5c8a141d03	[SPARK-35538][SQL] Migrate transformAllExpressions call sites to use transformAllExpressionsWithPruning ### What changes were proposed in this pull request? Added the following TreePattern enums: - EXCHANGE - IN_SUBQUERY_EXEC - UPDATE_FIELDS Migrated `transformAllExpressions` call sites to use `transformAllExpressionsWithPruning` ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Perf diff: Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline OptimizeUpdateFields \| 54646396 \| 27444424 \| 0.5 ReplaceUpdateFieldsExpression \| 24694303 \| 2087517 \| 0.08 Closes #32643 from sigmod/all_expressions. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2021-05-28 15:36:25 -07:00
Kousuke Saruta	b763db3efd	[SPARK-35194][SQL][FOLLOWUP] Recover build error with Scala 2.13 on GA ### What changes were proposed in this pull request? This PR fixes a build error with Scala 2.13 on GA. #32301 seems to bring this error. ### Why are the changes needed? To recover CI. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA Closes #32696 from sarutak/followup-SPARK-35194. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-29 00:11:16 +09:00
Karen Feng	e8631660ec	[SPARK-35194][SQL] Refactor nested column aliasing for readability ### What changes were proposed in this pull request? Refactors `NestedColumnAliasing` and `GeneratorNestedColumnAliasing` for readability. ### Why are the changes needed? Improves readability for future maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32301 from karenfeng/refactor-nested-column-aliasing. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-28 13:18:44 +00:00
dgd-contributor	52a1f8c000	[SPARK-33428][SQL] Match the behavior of conv function to MySQL's ### What changes were proposed in this pull request? Spark conv function is from MySQL and it's better to follow the MySQL behavior. MySQL returns the max unsigned long if the input string is too big, and Spark should follow it. However, seems Spark has different behavior in two cases: MySQL allows leading spaces but Spark does not. If the input string is way too long, Spark fails with ArrayIndexOutOfBoundException This patch now help conv follow behavior in those two cases conv allows leading spaces conv will return the max unsigned long when the input string is way too long ### Why are the changes needed? fixing it to match the behavior of conv function to the (almost) only one reference of another DBMS, MySQL ### Does this PR introduce _any_ user-facing change? Yes, as pointed out above ### How was this patch tested? Add test Closes #32684 from dgd-contributor/SPARK-33428. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-27 12:12:39 +00:00
gengjiaan	3e190807bc	[SPARK-35057][SQL] Group exception messages in hive/thriftserver ### What changes were proposed in this pull request? This PR group exception messages in `sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32646 from beliefer/SPARK-35057. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-27 07:31:14 +00:00
ulysses-you	dc7b5a99f0	[SPARK-35282][SQL] Support AQE side shuffled hash join formula using rule ### What changes were proposed in this pull request? The main code change is: * Change rule `DemoteBroadcastHashJoin` to `DynamicJoinSelection` and add shuffle hash join selection code. * Specify a join strategy hint `SHUFFLE_HASH` if AQE think a join can be converted to SHJ. * Skip `preferSortMerge` config check in AQE side if a join can be converted to SHJ. ### Why are the changes needed? Use AQE runtime statistics to decide if we can use shuffled hash join instead of sort merge join. Currently, the formula of shuffled hash join selection dose not work due to the dymanic shuffle partition number. Add a new config spark.sql.adaptive.shuffledHashJoinLocalMapThreshold to decide if join can be converted to shuffled hash join safely. ### Does this PR introduce _any_ user-facing change? Yes, add a new config. ### How was this patch tested? Add test. Closes #32550 from ulysses-you/SPARK-35282-2. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-26 14:16:04 +00:00
Linhong Liu	af1dba7ca5	[SPARK-35440][SQL] Add function type to `ExpressionInfo` for UDF ### What changes were proposed in this pull request? Add the function type, such as "scala_udf", "python_udf", "java_udf", "hive", "built-in" to the `ExpressionInfo` for UDF. ### Why are the changes needed? Make the `ExpressionInfo` of UDF more meaningful ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing and newly added UT Closes #32587 from linhongliu-db/udf-language. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-26 04:40:53 +00:00
ulysses-you	631077db08	[SPARK-35455][SQL] Unify empty relation optimization between normal and AQE optimizer ### What changes were proposed in this pull request? * remove `EliminateUnnecessaryJoin`, using `AQEPropagateEmptyRelation` instead. * eliminate join, aggregate, limit, repartition, sort, generate which is beneficial. ### Why are the changes needed? Make `EliminateUnnecessaryJoin` available with more case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #32602 from ulysses-you/SPARK-35455. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-25 08:59:59 +00:00
tanel.kiis@gmail.com	548e37b00b	[SPARK-33122][SQL][FOLLOWUP] Extend RemoveRedundantAggregates optimizer rule to apply to more cases ### What changes were proposed in this pull request? Addressed the dongjoon-hyun comments on the previous PR #30018. Extended the `RemoveRedundantAggregates` rule to remove redundant aggregations in even more queries. For example in ``` dataset .dropDuplicates() .groupBy('a) .agg(max('b)) ``` the `dropDuplicates` is not needed, because the result on `max` does not depend on duplicate values. ### Why are the changes needed? Improve performance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #31914 from tanelk/SPARK-33122_redundant_aggs_followup. Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Co-authored-by: Tanel Kiis <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-25 10:04:37 +09:00
Chao Sun	c709efc1e7	[SPARK-34981][SQL][FOLLOWUP] Use SpecificInternalRow in ApplyFunctionExpression ### What changes were proposed in this pull request? Use `SpecificInternalRow` instead of `GenericInternalRow` to avoid boxing / unboxing cost. ### Why are the changes needed? Since it doesn't know the input row schema, `GenericInternalRow` potentially need to apply boxing for input arguments. It's better to use `SpecificInternalRow` instead since we know input data types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32647 from sunchao/specific-input-row. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-24 17:25:24 +09:00
Adam Binford	6c0c617bd0	[SPARK-35449][SQL] Only extract common expressions from CaseWhen values if elseValue is set ### What changes were proposed in this pull request? This PR fixes a bug with subexpression elimination for CaseWhen statements. https://github.com/apache/spark/pull/30245 added support for creating subexpressions that are present in all branches of conditional statements. However, for a statement to be in "all branches" of a CaseWhen statement, it must also be in the elseValue. ### Why are the changes needed? Fix a bug where a subexpression can be created and run for branches of a conditional that don't pass. This can cause issues especially with a UDF in a branch that gets executed assuming the condition is true. ### Does this PR introduce _any_ user-facing change? Yes, fixes a potential bug where a UDF could be eagerly executed even though it might expect to have already passed some form of validation. For example: ``` val col = when($"id" < 0, myUdf($"id")) spark.range(1).select(when(col > 0, col)).show() ``` `myUdf($"id")` is considered a subexpression and eagerly evaluated, because it is pulled out as a common expression from both executions of the when clause, but if `id >= 0` it should never actually be run. ### How was this patch tested? Updated existing test with new case. Closes #32595 from Kimahriman/bug-case-subexpr-elimination. Authored-by: Adam Binford <adamq43@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-24 00:27:41 -07:00
Liang-Chi Hsieh	9e1b204bcc	[SPARK-35410][SQL] SubExpr elimination should not include redundant children exprs in conditional expression ### What changes were proposed in this pull request? This patch fixes a bug when dealing with common expressions in conditional expressions such as `CaseWhen` during subexpression elimination. For example, previously we find common expressions among conditions of `CaseWhen`, but children expressions are also counted into. We should not count these children expressions as common expressions. ### Why are the changes needed? If the redundant children expressions are counted as common expressions too, they will be redundantly evaluated and miss the subexpression elimination opportunity. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #32559 from viirya/SPARK-35410. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-23 08:24:44 -07:00
Hyukjin Kwon	1d9f09decb	[SPARK-35480][SQL] Make percentile_approx work with pivot ### What changes were proposed in this pull request? This PR proposes to avoid wrapping if-else to the constant literals for `percentage` and `accuracy` in `percentile_approx`. They are expected to be literals (or foldable expressions). Pivot works by two phrase aggregations, and it works with manipulating the input to `null` for non-matched values (pivot column and value). Note that pivot supports an optimized version without such logic with changing input to `null` for some types (non-nested types basically). So the issue fixed by this PR is only for complex types. ```scala val df = Seq( ("a", -1.0), ("a", 5.5), ("a", 2.5), ("b", 3.0), ("b", 5.2)).toDF("type", "value") .groupBy().pivot("type", Seq("a", "b")).agg( percentile_approx(col("value"), array(lit(0.5)), lit(10000))) df.show() ``` Before: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'percentile_approx((IF((type <=> CAST('a' AS STRING)), value, CAST(NULL AS DOUBLE))), (IF((type <=> CAST('a' AS STRING)), array(0.5D), NULL)), (IF((type <=> CAST('a' AS STRING)), 10000, CAST(NULL AS INT))))' due to data type mismatch: The accuracy or percentage provided must be a constant literal; 'Aggregate [percentile_approx(if ((type#7 <=> cast(a as string))) value#8 else cast(null as double), if ((type#7 <=> cast(a as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(a as string))) 10000 else cast(null as int), 0, 0) AS a#16, percentile_approx(if ((type#7 <=> cast(b as string))) value#8 else cast(null as double), if ((type#7 <=> cast(b as string))) array(0.5) else cast(null as array<double>), if ((type#7 <=> cast(b as string))) 10000 else cast(null as int), 0, 0) AS b#18] +- Project [_1#2 AS type#7, _2#3 AS value#8] +- LocalRelation [_1#2, _2#3] ``` After: ``` +-----+-----+ \| a\| b\| +-----+-----+ \|[2.5]\|[3.0]\| +-----+-----+ ``` ### Why are the changes needed? To make percentile_approx work with pivot as expected ### Does this PR introduce _any_ user-facing change? Yes. It threw an exception but now it returns a correct result as shown above. ### How was this patch tested? Manually tested and unit test was added. Closes #32619 from HyukjinKwon/SPARK-35480. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-23 07:35:43 +09:00
Liang-Chi Hsieh	066944c1bd	[SPARK-35439][SQL] Children subexpr should come first than parent subexpr ### What changes were proposed in this pull request? This patch sorts equivalent expressions based on their child-parent relation. ### Why are the changes needed? `EquivalentExpressions` maintains a map of equivalent expressions. It is `HashMap` now so the insertion order is not guaranteed to be preserved later. Subexpression elimination relies on retrieving subexpressions from the map. If there is child-parent relationships among the subexpressions, we want the child expressions come first than parent expressions, so we can replace child expressions in parent expressions with subexpression evaluation. For example, we have two different expressions `Add(Literal(1), Literal(2))` and `Add(Literal(3), add)`. Case 1: child subexpr comes first. ```scala addExprTree(add) addExprTree(Add(Literal(3), add)) addExprTree(Add(Literal(3), add)) ``` Case 2: parent subexpr comes first. For this case, we need to sort equivalent expressions. ``` addExprTree(Add(Literal(3), add)) => We add `Add(Literal(3), add)` into the map first, then add `add` into the map addExprTree(add) addExprTree(Add(Literal(3), add)) ``` As we are going to sort equivalent expressions at all, we don't need `LinkedHashMap` but just do sorting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests. Closes #32586 from viirya/use-listhashmap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-21 10:49:35 -07:00
ulysses-you	83737852f0	[SPARK-35063][SQL][FOLLOWUP] Fix scala 2.13 error ### What changes were proposed in this pull request? ### Why are the changes needed? Fix scala compile error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GA Closes #32617 from ulysses-you/scala2-13. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-21 07:59:26 +00:00
gengjiaan	c740c097e0	[SPARK-35063][SQL] Group exception messages in sql/catalyst ### What changes were proposed in this pull request? This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32478 from beliefer/SPARK-35063. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-21 06:15:26 +00:00
Takeshi Yamamuro	1a923f5319	[SPARK-35479][SQL] Format PartitionFilters IN strings in scan nodes ### What changes were proposed in this pull request? This PR proposes to format strings correctly for `PushedFilters`. For example, `explain()` for a query below prints `v in (array('a'))` as `PushedFilters: [In(v, [WrappedArray(a)])]`; ``` scala> sql("create table t (v array<string>) using parquet") scala> sql("select * from t where v in (array('a'), null)").explain() == Physical Plan == (1) Filter v#4 IN ([a],null) +- FileScan parquet default.t[v#4] Batched: false, DataFilters: [v#4 IN ([a],null)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t], PartitionFilters: [], PushedFilters: [In(v, [WrappedArray(a),null])], ReadSchema: struct<v:array<string>> ``` This PR makes `explain()` print it as `PushedFilters: [In(v, [[a]])]`; ``` scala> sql("select from t where v in (array('a'), null)").explain() == Physical Plan == *(1) Filter v#4 IN ([a],null) +- FileScan parquet default.t[v#4] Batched: false, DataFilters: [v#4 IN ([a],null)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-3.1.1-bin-hadoop2.7/spark-warehouse/t], PartitionFilters: [], PushedFilters: [In(v, [[a],null])], ReadSchema: struct<v:array<string>> ``` NOTE: This PR includes a bugfix caused by #32577 (See the cloud-fan comment: https://github.com/apache/spark/pull/32577/files#r636108150). ### Why are the changes needed? To improve explain strings. ### Does this PR introduce _any_ user-facing change? Yes, this PR improves the explain strings for pushed-down filters. ### How was this patch tested? Added tests in `SQLQueryTestSuite`. Closes #32615 from maropu/ExplainPartitionFilters. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-21 05:45:45 +00:00
yi.wu	e1296eab5f	[SPARK-35445][SQL] Reduce the execution time of DeduplicateRelations ### What changes were proposed in this pull request? This PR reduces the execution time of `DeduplicateRelations` by: 1) use `Set` instead `Seq` to check duplicate relations 2) avoid plan output traverse and attribute rewrites when there are no changes in the children plan ### Why are the changes needed? Rule `DeduplicateRelations` is slow. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run `TPCDSQuerySuite` and checked the run time of `DeduplicateRelations`. The time has been reduced by 77.9% after this PR. Closes #32590 from Ngone51/improve-dedup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-21 13:25:37 +08:00
Vinod KC	bdd8e1dbb1	[SPARK-28551][SQL] CTAS with LOCATION should not allow to a non-empty directory ### What changes were proposed in this pull request? CTAS with location clause acts as an insert overwrite. This can cause problems when there are subdirectories within a location directory. This causes some users to accidentally wipe out directories with very important data. We should not allow CTAS with location to a non-empty directory. ### Why are the changes needed? Hive already handled this scenario: HIVE-11319 Steps to reproduce: ```scala sql("""create external table `demo_CTAS`( `comment` string) PARTITIONED BY (`col1` string, `col2` string) STORED AS parquet location '/tmp/u1/demo_CTAS'""") sql("""INSERT OVERWRITE TABLE demo_CTAS partition (col1='1',col2='1') VALUES ('abc')""") sql("select* from demo_CTAS").show sql("""create table ctas1 location '/tmp/u2/ctas1' as select * from demo_CTAS""") sql("select* from ctas1").show sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""") ``` Before the fix: Both create table operations will succeed. But values in table ctas1 will be replaced by ctas2 accidentally. After the fix: `create table ctas2...` will throw `AnalysisException`: ``` org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true. ``` ### Does this PR introduce _any_ user-facing change? Yes, if the location directory is not empty, CTAS with location will throw AnalysisException ``` sql("""create table ctas2 location '/tmp/u2' as select * from demo_CTAS""") ``` ``` org.apache.spark.sql.AnalysisException: CREATE-TABLE-AS-SELECT cannot create table with location to a non-empty directory /tmp/u2 . To allow overwriting the existing non-empty directory, set 'spark.sql.legacy.allowNonEmptyLocationInCTAS' to true. ``` `CREATE TABLE AS SELECT` with non-empty `LOCATION` will throw `AnalysisException`. To restore the behavior before Spark 3.2, need to set `spark.sql.legacy.allowNonEmptyLocationInCTAS` to `true`. , default value is `false`. Updated SQL migration guide. ### How was this patch tested? Test case added in SQLQuerySuite.scala Closes #32411 from vinodkc/br_fixCTAS_nonempty_dir. Authored-by: Vinod KC <vinod.kc.in@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-20 06:13:18 +00:00
shahid	12142130cd	[SPARK-35362][SQL] Update null count in the column stats for UNION operator stats estimation ### What changes were proposed in this pull request? Updating column stats for Union operator stats estimation ### Why are the changes needed? This is a followup PR to update the null count also in the Union stats operator estimation. https://github.com/apache/spark/pull/30334 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated UTs, manual testing Closes #32494 from shahidki31/shahid/updateNullCountForUnion. Lead-authored-by: shahid <shahidki31@gmail.com> Co-authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-19 21:23:19 +09:00
shahid	46f7d780d3	[SPARK-35368][SQL] Update histogram statistics for RANGE operator for stats estimation ### What changes were proposed in this pull request? Update histogram statistics for RANGE operator stats estimation ### Why are the changes needed? If histogram optimization is enabled, this statistics can be used in various cost based optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs. Manual test. Closes #32498 from shahidki31/shahid/histogram. Lead-authored-by: shahid <shahidki31@gmail.com> Co-authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-19 16:49:32 +09:00
Yuzhou Sun	a72d05c7e6	[SPARK-35106][CORE][SQL] Avoid failing rename caused by destination directory not exist ### What changes were proposed in this pull request? 1. In HadoopMapReduceCommitProtocol, create parent directory before renaming custom partition path staging files 2. In InMemoryCatalog and HiveExternalCatalog, create new partition directory before renaming old partition path 3. Check return value of FileSystem#rename, if false, throw exception to avoid silent data loss cause by rename failure 4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior (return false without rename when dst parent directory not exist) ### Why are the changes needed? Depends on FileSystem#rename implementation, when destination directory does not exist, file system may 1. return false without renaming file nor throwing exception (e.g. HDFS), or 2. create destination directory, rename files, and return true (e.g. LocalFileSystem) In the first case above, renames in HadoopMapReduceCommitProtocol for custom partition path will fail silently if the destination partition path does not exist. Failed renames can happen when 1. dynamicPartitionOverwrite == true, the custom partition path directories are deleted by the job before the rename; or 2. the custom partition path directories do not exist before the job; or 3. something else is wrong when file system handle `rename` The renames in MemoryCatalog and HiveExternalCatalog for partition renaming also have similar issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified DebugFilesystem#rename, and added new unit tests. Without the fix in src code, five InsertSuite tests and one AlterTableRenamePartitionSuite test failed: InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition path (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: insert overwrite with custom partition path ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![2,1,1] ``` InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition path ``` == Results == !== Correct Answer - 2 == == Spark Answer - 1 == !struct<> struct<i:int,part1:int,part2:int> [1,1,1] [1,1,1] ![1,1,2] ``` InsertSuite.SPARK-35106: Throw exception when rename custom partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` InsertSuite.SPARK-35106: Throw exception when rename dynamic partition paths returns false ``` Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown ``` AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi part partition (existing test with modified FS) ``` == Results == !== Correct Answer - 1 == == Spark Answer - 0 == struct<> struct<> ![3,123,3] ``` Closes #32530 from YuzhouSun/SPARK-35106. Authored-by: Yuzhou Sun <yuzhosun@amazon.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-19 15:46:27 +08:00
yangjie01	b1493d82dd	[SPARK-35398][SQL] Simplify the way to get classes from ClassBodyEvaluator in `CodeGenerator.updateAndGetCompilationStats` method ### What changes were proposed in this pull request? SPARK-35253 upgraded janino from 3.0.16 to 3.1.4, `ClassBodyEvaluator` provides the `getBytecodes` method to get the mapping from `ClassFile#getThisClassName` to `ClassFile#toByteArray` directly in this version and we don't need to get this variable by reflection api anymore. So the main purpose of this pr is simplify the way to get `bytecodes` from `ClassBodyEvaluator` in `CodeGenerator#updateAndGetCompilationStats` method. ### Why are the changes needed? Code simplification. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test: 1. Define a code fragment to be tested, for example: ``` val codeBody = s""" public java.lang.Object generate(Object[] references) { return new TestMetricCode(references); } class TestMetricCode { public TestMetricCode(Object[] references) { } public long sumOfSquares(long left, long right) { return left * left + right * right; } } """ ``` 2. Create a `ClassBodyEvaluator` and `cook` the `codeBody` as above, the process of creating `ClassBodyEvaluator` can extract from `CodeGenerator#doCompile` method. 3. Get `bytecodes` using `ClassBodyEvaluator#getBytecodes` api(after this pr) and reflection api(before this pr) respectively, then assert that they are the same. If the `bytecodes` not changed, we can be sure that metrics state will not change. The test code example as follows: ``` import scala.collection.JavaConverters._ val bytecodesFromApi = evaluator.getBytecodes.asScala val bytecodesFromReflectionApi = { val scField = classOf[ClassBodyEvaluator].getDeclaredField("sc") scField.setAccessible(true) val compiler = scField.get(evaluator).asInstanceOf[SimpleCompiler] val loader = compiler.getClassLoader.asInstanceOf[ByteArrayClassLoader] val classesField = loader.getClass.getDeclaredField("classes") classesField.setAccessible(true) classesField.get(loader).asInstanceOf[java.util.Map[String, Array[Byte]]].asScala } assert(bytecodesFromApi == bytecodesFromReflectionApi) ``` Closes #32536 from LuciferYang/SPARK-35253-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-19 13:03:35 +09:00
Tengfei Huang	9804f07c17	[SPARK-35411][SQL] Add essential information while serializing TreeNode to json ### What changes were proposed in this pull request? Write out Seq of product objects which contain TreeNode, to avoid the cases as described in https://issues.apache.org/jira/browse/SPARK-35411 that essential information will be ignored and just written out as null values. These information are necessary to understand the query plans. ### Why are the changes needed? Information like cteRelations in With node, and branches in CaseWhen expression are necessary to understand the query plans, they should be written out to the result json string. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT case added. Closes #32557 from ivoson/plan-json-fix. Authored-by: Tengfei Huang <tengfei.h@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-18 23:20:12 +08:00
Takeshi Yamamuro	746d80d87a	[SPARK-35422][SQL] Fix plan-printing issues to pass the TPCDS plan stability tests in Scala v2.13 ### What changes were proposed in this pull request? To pass the TPCDS-related plan stability tests in scala-2.13, this PR proposes to fix two things below; - (1) Sorts elements in the predicate `InSet` and the source filter `In` for printing their nodes. - (2) Formats nested collection elements (`Seq`, `Array`, and `Set`) recursively in `TreeNode.argString`. As for (1), it seems v2.12/v2.13 prints `Set` elements with a different order, so we need to sort them explicitly. As for (2), the `Seq` implementation is different between v2.12/v2.13, so we need to format nested `Seq` elements correctly to hide the name of its implementation (See an example below); ``` (74) Expand [codegen id : 20] Input [5]: [sales#41, RETURNS#42, profit#43, channel#44, id#45] -Arguments: [ArrayBuffer(sales#41, returns#42, ... <-- scala-2.12 +Arguments: [Vector(sales#41, returns#42, ... <-- scala-2.13 +Arguments: [[(sales#41, returns#42, ... <-- the proposed fix to hide the name of its implementation ``` ### Why are the changes needed? To pass the tests in Scala v2.13. ### Does this PR introduce _any_ user-facing change? Yes, this fix changes query explain strings. ### How was this patch tested? Manually checked. Closes #32577 from maropu/FixTPCDSTestIssueInScala213. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-18 08:50:00 +00:00
Chao Sun	44d762abc6	[SPARK-35389][SQL] V2 ScalarFunction should support magic method with null arguments ### What changes were proposed in this pull request? When creating `Invoke` and `StaticInvoke` for `ScalarFunction`'s magic method, set `propagateNull` to false. ### Why are the changes needed? When `propgagateNull` is true (which is the default value), `Invoke` and `StaticInvoke` will return null if any of the argument is null. For scalar function this is incorrect, as we should leave the logic to function implementation instead. ### Does this PR introduce _any_ user-facing change? Yes. Now null arguments shall be properly handled with magic method. ### How was this patch tested? Added new tests. Closes #32553 from sunchao/SPARK-35389. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-18 08:45:55 +00:00
Hyukjin Kwon	747fe7282c	[SPARK-35419][PYTHON] Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` Before ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` After* ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Note that the traceback (`return f(args, *kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging. In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it. ### Why are the changes needed? To show simplified exception messages to end users. ### Does this PR introduce _any_ user-facing change? Yes, it will hide the internal Python worker traceback. ### How was this patch tested? Existing test cases should cover. Closes #32569 from HyukjinKwon/SPARK-35419. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-18 12:27:09 +09:00
fhygh	3a3f8ca6f4	[SPARK-35359][SQL] Insert data with char/varchar datatype will fail when data length exceed length limitation ### What changes were proposed in this pull request? This PR is used to fix this bug: ``` set spark.sql.legacy.charVarcharAsString=true; create table chartb01(a char(3)); insert into chartb01 select 'aaaaa'; ``` here we expect the data of table chartb01 is 'aaa', but it runs failed. ### Why are the changes needed? Improve backward compatibility ``` spark-sql> > create table tchar01(col char(2)) using parquet; Time taken: 0.767 seconds spark-sql> > insert into tchar01 select 'aaa'; ERROR \| Executor task launch worker for task 0.0 in stage 0.0 (TID 0) \| Aborting task \| org.apache.spark.util.Utils.logError(Logging.scala:94) java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2 at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31) at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279) at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` ### Does this PR introduce _any_ user-facing change? No (the legacy config is false by default). ### How was this patch tested? Added unit tests. Closes #32501 from fhygh/master. Authored-by: fhygh <283452027@qq.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-18 00:13:40 +08:00
Wenchen Fan	3b63f32601	[SPARK-35400][SQL] Simplify getOuterReferences and improve error message for correlated subquery ### What changes were proposed in this pull request? Spark doesn't support aggregate functions with mixed outer and local references. This PR applies this check earlier to fail with a clear error message instead of some weird ones, and simplifies the related code in `SubExprUtils.getOuterReferences`. This PR also refines the error message a bit. ### Why are the changes needed? better error message ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #32503 from cloud-fan/try. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-17 14:13:44 +00:00
Jungtaek Lim	7c13636be3	[SPARK-34888][SS] Introduce UpdatingSessionIterator adjusting session window on elements Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces UpdatingSessionsIterator, which analyzes neighbor elements and adjust session information on elements. UpdatingSessionsIterator calculates and updates the session window for each element in the given iterator, which makes elements in the same session window having same session spec. Downstream can apply aggregation to finally merge these elements bound to the same session window. UpdatingSessionsIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort. UpdatingSessionsIterator copies the elements to safely update on each element, as well as buffers elements which are bound to the same session window. Due to such overheads, MergingSessionsIterator which will be introduced via SPARK-34889 should be used whenever possible. This PR also introduces UpdatingSessionsExec which is the physical node on leveraging UpdatingSessionsIterator to sort the input rows and updates session information on input rows. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test suite added. Closes #31986 from HeartSaVioR/SPARK-34888-SPARK-10816-PR-31570-part-1. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-17 21:05:49 +09:00
Yuming Wang	d2d1f0b580	[SPARK-32792][SQL] Improve Parquet In filter pushdown ### What changes were proposed in this pull request? Support push down `GreaterThanOrEqual` minimum value and `LessThanOrEqual` maximum value for Parquet when [sources.In](`a744fea3be/sql/catalyst/src/main/scala/org/apache/spark/sql/sources/filters.scala (L162-L181)`)'s values exceeds `spark.sql.optimizer.inSetRewriteMinMaxThreshold`. For example: ```sql SELECT * FROM t WHERE id IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 15) ``` We will push down `id >= 1 and id <= 15`. Impala also has this improvement: https://issues.apache.org/jira/browse/IMPALA-3654 ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test, [manual test](https://github.com/apache/spark/pull/29642#issuecomment-743109098) and benchmark test. Before this PR: ``` ================================================================================================ Pushdown benchmark for InSet -> InFilters ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5995 6026 53 2.6 381.2 1.0X Parquet Vectorized (Pushdown) 423 440 11 37.2 26.9 14.2X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5767 5887 154 2.7 366.7 1.0X Parquet Vectorized (Pushdown) 419 428 6 37.6 26.6 13.8X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5764 5857 96 2.7 366.4 1.0X Parquet Vectorized (Pushdown) 408 419 9 38.6 25.9 14.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5895 5949 41 2.7 374.8 1.0X Parquet Vectorized (Pushdown) 5908 5986 114 2.7 375.6 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5893 5988 106 2.7 374.7 1.0X Parquet Vectorized (Pushdown) 5875 5939 57 2.7 373.5 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5891 5954 42 2.7 374.5 1.0X Parquet Vectorized (Pushdown) 5901 5976 99 2.7 375.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6128 6158 40 2.6 389.6 1.0X Parquet Vectorized (Pushdown) 6145 6190 37 2.6 390.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6142 6217 64 2.6 390.5 1.0X Parquet Vectorized (Pushdown) 6149 6235 90 2.6 391.0 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6148 6218 64 2.6 390.9 1.0X Parquet Vectorized (Pushdown) 6145 6177 30 2.6 390.7 1.0X ``` After this PR: ``` ================================================================================================ Pushdown benchmark for InSet -> InFilters ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5745 5768 28 2.7 365.2 1.0X Parquet Vectorized (Pushdown) 401 412 12 39.2 25.5 14.3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5796 5861 61 2.7 368.5 1.0X Parquet Vectorized (Pushdown) 417 482 37 37.7 26.5 13.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 10, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative --------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5754 5777 20 2.7 365.8 1.0X Parquet Vectorized (Pushdown) 408 418 9 38.6 25.9 14.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5878 5915 40 2.7 373.7 1.0X Parquet Vectorized (Pushdown) 929 940 10 16.9 59.1 6.3X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5886 5917 29 2.7 374.2 1.0X Parquet Vectorized (Pushdown) 3091 3114 20 5.1 196.5 1.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 100, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ---------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 5913 5948 48 2.7 375.9 1.0X Parquet Vectorized (Pushdown) 5330 5427 98 3.0 338.9 1.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 10): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6147 6228 72 2.6 390.8 1.0X Parquet Vectorized (Pushdown) 1023 1029 4 15.4 65.1 6.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 50): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6164 6224 47 2.6 391.9 1.0X Parquet Vectorized (Pushdown) 3332 3360 45 4.7 211.9 1.8X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.15.7 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz InSet -> InFilters (values count: 2000, distribution: 90): Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ----------------------------------------------------------------------------------------------------------------------------------------- Parquet Vectorized 6154 6192 38 2.6 391.3 1.0X Parquet Vectorized (Pushdown) 5588 5679 92 2.8 355.3 1.1X ``` Closes #29642 from wangyum/SPARK-32792. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <yumwang@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-16 21:20:52 -07:00
Chao Sun	a8032e7efa	[SPARK-35384][SQL][FOLLOWUP] Move `HashMap.get` out of `InvokeLike.invoke` ### What changes were proposed in this pull request? Move hash map lookup operation out of `InvokeLike.invoke` since it doesn't depend on the input. ### Why are the changes needed? We shouldn't need to look up the hash map for every input row evaluated by `InvokeLike.invoke` since it doesn't depend on input. This could speed up the performance a bit. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32532 from sunchao/SPARK-35384-follow-up. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-14 14:00:39 -07:00
yi.wu	94bd480761	[SPARK-35206][TESTS][SQL] Extract common used get project path into a function in SparkFunctionSuite ### What changes were proposed in this pull request? Add a common functions `getWorkspaceFilePath` (which prefixed with spark home) to `SparkFunctionSuite`, and applies these the function to where they're extracted from. ### Why are the changes needed? Spark sql has test suites to read resources when running tests. The way of getting the path of resources is commonly used in different suites. We can extract them into a function to ease the code maintenance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass existing tests. Closes #32315 from Ngone51/extract-common-file-path. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-14 22:17:50 +08:00
Pablo Langa	9ea55fe771	[SPARK-35207][SQL] Normalize hash function behavior with negative zero (floating point types) ### What changes were proposed in this pull request? Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types. ``` scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show +-------------------------+--------------------------+ \|hash(CAST(0.0 AS DOUBLE))\|hash(CAST(-0.0 AS DOUBLE))\| +-------------------------+--------------------------+ \| -1670924195\| -853646085\| +-------------------------+--------------------------+ scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show +--------------------------------------------+ \|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))\| +--------------------------------------------+ \| true\| +--------------------------------------------+ ``` Here is an extract from IEEE 754: > The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases From this, I deduce that the hash function must produce the same result for 0 and -0. ### Why are the changes needed? It is a correctness issue ### Does this PR introduce _any_ user-facing change? This changes only affect to the hash function applied to -0 value in float and double types ### How was this patch tested? Unit testing and manual testing Closes #32496 from planga82/feature/spark35207_hashnegativezero. Authored-by: Pablo Langa <soypab@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-14 12:40:36 +08:00
Liang-Chi Hsieh	6a949d1659	[SPARK-35397][SQL] Replace sys.err usage with explicit exception type ### What changes were proposed in this pull request? This patch replaces `sys.err` usages with explicit exception types. ### Why are the changes needed? Motivated by the previous comment https://github.com/apache/spark/pull/32519#discussion_r630787080, it sounds better to replace `sys.err` usages with explicit exception type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32535 from viirya/replace-sys-err. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-13 10:37:24 -07:00
gengjiaan	c2e15cccab	[SPARK-35062][SQL] Group exception messages in sql/streaming ### What changes were proposed in this pull request? This PR group exception messages in `sql/core/src/main/scala/org/apache/spark/sql/streaming`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32464 from beliefer/SPARK-35062. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 15:04:03 +00:00
ulysses-you	6f63057ede	[SPARK-35332][SQL] Make cache plan disable configs configurable ### What changes were proposed in this pull request? Add a new config to make cache plan disable configs configurable. ### Why are the changes needed? The disable configs of cache plan if to avoid the perfermance regression, but not all the query will slow than before due to AQE or bucket scan enabled. It's useful to make a new config so that user can decide if some configs should be disabled during cache plan. ### Does this PR introduce _any_ user-facing change? Yes, a new config. ### How was this patch tested? Add test. Closes #32482 from ulysses-you/SPARK-35332. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 14:49:05 +00:00
Gengliang Wang	02c99f15ee	[SPARK-35162][SQL] New SQL functions: TRY_ADD/TRY_DIVIDE ### What changes were proposed in this pull request? Add New SQL functions: * TRY_ADD * TRY_DIVIDE These expressions are identical to the following expression under ANSI mode except that it returns null if error occurs: * ADD * DIVIDE Note: it is easy to add other expressions like `TRY_SUBTRACT`/`TRY_MULTIPLY` but let's control the number of these new expressions and just add `TRY_ADD` and `TRY_DIVIDE` for now. ### Why are the changes needed? 1. Users can manage to finish queries without interruptions in ANSI mode. 2. Users can get NULLs instead of unreasonable results if overflow occurs when ANSI mode is off. For example, the behavior of the following SQL operations is unreasonable: ``` 2147483647 + 2 => -2147483647 ``` With the new safe version SQL functions: ``` TRY_ADD(2147483647, 2) => null ``` Note: We should only add new expressions to important operators, instead of adding new safe expressions for all the expressions that can throw errors. ### Does this PR introduce _any_ user-facing change? Yes, new SQL functions: TRY_ADD/TRY_DIVIDE ### How was this patch tested? Unit test Closes #32292 from gengliangwang/try_add. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-13 22:26:08 +08:00
Wenchen Fan	d1b8bd7d11	[SPARK-34720][SQL] MERGE ... UPDATE/INSERT * should do by-name resolution ### What changes were proposed in this pull request? In Spark, we have an extension in the MERGE syntax: INSERT/UPDATE . This is not from ANSI standard or any other mainstream databases, so we need to define the behaviors by our own. The behavior today is very weird: assume the source table has `n1` columns, target table has `n2` columns. We generate the assignments by taking the first `min(n1, n2)` columns from source & target tables and pairing them by ordinal. This PR proposes a more reasonable behavior: take all the columns from target table as keys, and find the corresponding columns from source table by name as values. ### Why are the changes needed? Fix the MEREG INSERT/UPDATE to be more user-friendly and easy to do schema evolution. ### Does this PR introduce _any_ user-facing change? Yes, but MERGE is only supported by very few data sources. ### How was this patch tested? new tests Closes #32192 from cloud-fan/merge. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-13 12:58:24 +00:00
Chao Sun	0ab9bd79b3	[SPARK-35384][SQL] Improve performance for InvokeLike.invoke ### What changes were proposed in this pull request? Change `map` in `InvokeLike.invoke` to a while loop to improve performance, following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex). ### Why are the changes needed? `InvokeLike.invoke`, which is used in non-codegen path for `Invoke` and `StaticInvoke`, currently uses `map` to evaluate arguments: ```scala val args = arguments.map(e => e.eval(input).asInstanceOf[Object]) if (needNullCheck && args.exists(_ == null)) { // return null if one of arguments is null null } else { ... ``` which is pretty expensive if the method itself is trivial. We can change it to a plain while loop. <img width="871" alt="Screen Shot 2021-05-12 at 12 19 59 AM" src="https://user-images.githubusercontent.com/506679/118055719-7f985a00-b33d-11eb-943b-cf85eab35f44.png"> Benchmark results show this can improve as much as 3x from `V2FunctionBenchmark`: Before ``` OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative -------------------------------------------------------------------------------------------------------------------------------------------------------------- native_long_add 36506 36656 251 13.7 73.0 1.0X java_long_add_default 47151 47540 370 10.6 94.3 0.8X java_long_add_magic 178691 182457 1327 2.8 357.4 0.2X java_long_add_static_magic 177151 178258 1151 2.8 354.3 0.2X ``` After ``` OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.4.0-1046-azure Intel(R) Xeon(R) CPU E5-2673 v3 2.40GHz scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative -------------------------------------------------------------------------------------------------------------------------------------------------------------- native_long_add 29897 30342 568 16.7 59.8 1.0X java_long_add_default 40628 41075 664 12.3 81.3 0.7X java_long_add_magic 54553 54755 182 9.2 109.1 0.5X java_long_add_static_magic 55410 55532 127 9.0 110.8 0.5X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32527 from sunchao/SPARK-35384. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-12 20:57:21 -07:00
Chao Sun	bc95c3a69b	[SPARK-35361][SQL][FOLLOWUP] Switch to use while loop ### What changes were proposed in this pull request? Switch to plain `while` loop following Spark [style guide](https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex). ### Why are the changes needed? `while` loop may yield better performance comparing to `foreach`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #32522 from sunchao/SPARK-35361-follow-up. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-12 12:41:12 -07:00
Liang-Chi Hsieh	f156a95641	[SPARK-35347][SQL][FOLLOWUP] Throw exception with an explicit exception type when cannot find the method instead of sys.error ### What changes were proposed in this pull request? A simple follow-up of #32474 to throw exception instead of sys.error. ### Why are the changes needed? An exception only fails the query, instead of sys.error. ### Does this PR introduce _any_ user-facing change? Yes, if `Invoke` or `StaticInvoke` cannot find the method, instead of original `sys.error` now we only throw an exception. ### How was this patch tested? Existing tests. Closes #32519 from viirya/SPARK-35347-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-12 09:56:08 -07:00
Takeshi Yamamuro	101b0cc313	[SPARK-35253][SQL][BUILD] Bump up the janino version to v3.1.4 ### What changes were proposed in this pull request? This PR proposes to bump up the janino version from 3.0.16 to v3.1.4. The major changes of this upgrade are as follows: - Fixed issue #131: Janino 3.1.2 is 10x slower than 3.0.11: The Compiler's IClassLoader was initialized way too eagerly, thus lots of classes were loaded from the class path, which is very slow. - Improved the encoding of stack map frames according to JVMS11 4.7.4: Previously, only "full_frame"s were generated. - Fixed issue #107: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package - Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions). For all the changes, please see the change log: http://janino-compiler.github.io/janino/changelog.html NOTE1: I've checked that there is no obvious performance regression. For all the data, see a link: https://docs.google.com/spreadsheets/d/1srxT9CioGQg1fLKM3Uo8z1sTzgCsMj4pg6JzpdcG6VU/edit?usp=sharing NOTE2: We upgraded janino to 3.1.2 (#27860) once before, but the commit had been reverted in #29495 because of the correctness issue. Recently, #32374 had checked if Spark could land on v3.1.3 or not, but a new bug was found there. These known issues has been fixed in v3.1.4 by following PRs: - janino-compiler/janino#145 - janino-compiler/janino#146 ### Why are the changes needed? janino v3.0.X is no longer maintained. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA passed. Closes #32455 from maropu/janino_v3.1.4. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-12 08:57:57 -05:00
Angerszhuuuu	ed059541eb	[SPARK-29145][SQL][FOLLOWUP] Clean up code about support sub-queries in join conditions ### What changes were proposed in this pull request? According to discuss https://github.com/apache/spark/pull/25854#discussion_r629451135 ### Why are the changes needed? Clean code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #32499 from AngersZhuuuu/SPARK-29145-fix. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-12 13:45:53 +00:00
Yingyi Bu	d92018ee35	[SPARK-35298][SQL] Migrate to transformWithPruning for rules in Optimizer.scala ### What changes were proposed in this pull request? Added the following TreePattern enums: - ALIAS - AND_OR - AVERAGE - GENERATE - INTERSECT - SORT - SUM - DISTINCT_LIKE - PROJECT - REPARTITION_OPERATION - UNION Added tree traversal pruning to the following rules in Optimizer.scala: - EliminateAggregateFilter - RemoveRedundantAggregates - RemoveNoopOperators - RemoveNoopUnion - LimitPushDown - ColumnPruning - CollapseRepartition - OptimizeRepartition - OptimizeWindowFunctions - CollapseWindow - TransposeWindow - InferFiltersFromGenerate - InferFiltersFromConstraints - CombineUnions - CombineFilters - EliminateSorts - PruneFilters - EliminateLimits - DecimalAggregates - ConvertToLocalRelation - ReplaceDistinctWithAggregate - ReplaceIntersectWithSemiJoin - ReplaceExceptWithAntiJoin - RewriteExceptAll - RewriteIntersectAll - RemoveLiteralFromGroupExpressions - RemoveRepetitionFromGroupExpressions - OptimizeLimitZero ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. perf diff: Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline RemoveRedundantAggregates \| 51290766 \| 67070477 \| 1.31 RemoveNoopOperators \| 192371141 \| 196631275 \| 1.02 RemoveNoopUnion \| 49222561 \| 43266681 \| 0.88 LimitPushDown \| 40885185 \| 21672646 \| 0.53 ColumnPruning \| 2003406120 \| 1285562149 \| 0.64 CollapseRepartition \| 40648048 \| 72646515 \| 1.79 OptimizeRepartition \| 37813850 \| 20600803 \| 0.54 OptimizeWindowFunctions \| 174426904 \| 46741409 \| 0.27 CollapseWindow \| 38959957 \| 24542426 \| 0.63 TransposeWindow \| 33533191 \| 20414930 \| 0.61 InferFiltersFromGenerate \| 21758688 \| 15597344 \| 0.72 InferFiltersFromConstraints \| 518009794 \| 493282321 \| 0.95 CombineUnions \| 67694022 \| 70550382 \| 1.04 CombineFilters \| 35265060 \| 29005424 \| 0.82 EliminateSorts \| 57025509 \| 19795776 \| 0.35 PruneFilters \| 433964815 \| 465579200 \| 1.07 EliminateLimits \| 44275393 \| 24476859 \| 0.55 DecimalAggregates \| 83143172 \| 28816090 \| 0.35 ReplaceDistinctWithAggregate \| 21783760 \| 18287489 \| 0.84 ReplaceIntersectWithSemiJoin \| 22311271 \| 16566393 \| 0.74 ReplaceExceptWithAntiJoin \| 23838520 \| 16588808 \| 0.70 RewriteExceptAll \| 32750296 \| 29421957 \| 0.90 RewriteIntersectAll \| 29760454 \| 21243599 \| 0.71 RemoveLiteralFromGroupExpressions \| 28151861 \| 25270947 \| 0.90 RemoveRepetitionFromGroupExpressions \| 29587030 \| 23447041 \| 0.79 OptimizeLimitZero \| 18081943 \| 15597344 \| 0.86 Accumulated \| 4129959311 \| 3112676285 \| 0.75 ### How was this patch tested? Existing tests. Closes #32439 from sigmod/optimizer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-12 20:42:47 +08:00
Chao Sun	78221bda95	[SPARK-35361][SQL] Improve performance for ApplyFunctionExpression ### What changes were proposed in this pull request? In `ApplyFunctionExpression`, move `zipWithIndex` out of the loop for each input row. ### Why are the changes needed? When the `ScalarFunction` is trivial, `zipWithIndex` could incur significant costs, as shown below: <img width="899" alt="Screen Shot 2021-05-11 at 10 03 42 AM" src="https://user-images.githubusercontent.com/506679/117866421-fb19de80-b24b-11eb-8c94-d5e8c8b1eda9.png"> By removing it out of the loop, I'm seeing sometimes 2x speedup from `V2FunctionBenchmark`. For instance: Before: ``` scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative native_long_add 32437 32896 434 15.4 64.9 1.0X java_long_add_default 85675 97045 NaN 5.8 171.3 0.4X ``` After: ``` scalar function (long + long) -> long, result_nullable = false codegen = false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative native_long_add 30182 30387 279 16.6 60.4 1.0X java_long_add_default 42862 43009 209 11.7 85.7 0.7X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #32507 from sunchao/SPARK-35361. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 10:16:35 +09:00
Yingyi Bu	7c9a9ec04f	[SPARK-35146][SQL] Migrate to transformWithPruning or resolveWithPruning for rules in finishAnalysis.scala ### What changes were proposed in this pull request? Added the following TreePattern enums: - BOOL_AGG - COUNT_IF - CURRENT_LIKE - RUNTIME_REPLACEABLE Added tree traversal pruning to the following rules: - ReplaceExpressions - RewriteNonCorrelatedExists - ComputeCurrentTime - GetCurrentDatabaseAndCatalog ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. Performance improvement (org.apache.spark.sql.TPCDSQuerySuite): Rule name \| Total Time (baseline) \| Total Time (experiment) \| experiment/baseline ReplaceExpressions \| 27546369 \| 19753804 \| 0.72 RewriteNonCorrelatedExists \| 17304883 \| 2086194 \| 0.12 ComputeCurrentTime \| 35751301 \| 19984477 \| 0.56 GetCurrentDatabaseAndCatalog \| 37230787 \| 18874013 \| 0.51 ### How was this patch tested? Existing tests. Closes #32461 from sigmod/finish_analysis. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-11 17:11:38 +08:00
gengjiaan	44bd0a8bd3	[SPARK-35088][SQL][FOLLOWUP] Improve the error message for Sequence expression ### What changes were proposed in this pull request? Sequence expression output a message looks confused. This PR will fix the issue. ### Why are the changes needed? Improve the error message for Sequence expression ### Does this PR introduce _any_ user-facing change? Yes. this PR updates the error message of Sequence expression. ### How was this patch tested? Tests updated. Closes #32492 from beliefer/SPARK-35088-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-11 09:45:09 +09:00
Gengliang Wang	d2a535f85b	[SPARK-34246][FOLLOWUP] Change the definition of `findTightestCommonType` for backward compatibility ### What changes were proposed in this pull request? Change the definition of `findTightestCommonType` from ``` def findTightestCommonType(t1: DataType, t2: DataType): Option[DataType] ``` to ``` val findTightestCommonType: (DataType, DataType) => Option[DataType] ``` ### Why are the changes needed? For backward compatibility. When running a MongoDB connector (built with Spark 3.1.1) with the latest master, there is such an error ``` java.lang.NoSuchMethodError: org.apache.spark.sql.catalyst.analysis.TypeCoercion$.findTightestCommonType()Lscala/Function2 ``` from https://github.com/mongodb/mongo-spark/blob/master/src/main/scala/com/mongodb/spark/sql/MongoInferSchema.scala#L150 In the previous release, the function was ``` static public scala.Function2<org.apache.spark.sql.types.DataType, org.apache.spark.sql.types.DataType, scala.Option<org.apache.spark.sql.types.DataType>> findTightestCommonType () ``` After https://github.com/apache/spark/pull/31349, the function becomes: ``` static public scala.Option<org.apache.spark.sql.types.DataType> findTightestCommonType (org.apache.spark.sql.types.DataType t1, org.apache.spark.sql.types.DataType t2) ``` This PR is to reduce the unnecessary API change. ### Does this PR introduce _any_ user-facing change? Yes, the definition of `TypeCoercion.findTightestCommonType` is consistent with previous release again. ### How was this patch tested? Existing unit tests Closes #32493 from gengliangwang/typecoercion. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-10 23:26:39 +08:00
Angerszhuuuu	7182f8cece	[SPARK-35360][SQL] RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too ### What changes were proposed in this pull request? RepairTableCommand respects `spark.sql.addPartitionInBatch.size` too ### Why are the changes needed? Make RepairTableCommand add partition batch size configurable. ### Does this PR introduce _any_ user-facing change? User can use `spark.sql.addPartitionInBatch.size` to change batch size when repair table. ### How was this patch tested? Not need Closes #32489 from AngersZhuuuu/SPARK-35360. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-05-10 14:53:31 +05:00
Angerszhuuuu	2c8ced9590	[SPARK-35111][SPARK-35112][SQL][FOLLOWUP] Rename ANSI interval patterns and regexps ### What changes were proposed in this pull request? Rename pattern strings and regexps of year-month and day-time intervals. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #32444 from AngersZhuuuu/SPARK-35111-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-05-10 11:33:27 +05:00
Ruifeng Zheng	620f0727e3	[SPARK-35231][SQL] logical.Range override maxRowsPerPartition ### What changes were proposed in this pull request? when `numSlices` is avaiable, `logical.Range` should compute a exact `maxRowsPerPartition` ### Why are the changes needed? `maxRowsPerPartition` is used in optimizer, we should provide an exact value if possible ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #32350 from zhengruifeng/range_maxRowsPerPartition. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-09 21:44:49 +09:00
Liang-Chi Hsieh	5b65d8a129	[SPARK-35347][SQL] Use MethodUtils for looking up methods in Invoke and StaticInvoke ### What changes were proposed in this pull request? This patch proposes to use `MethodUtils` for looking up methods `Invoke` and `StaticInvoke` expressions. ### Why are the changes needed? Currently we wrote our logic in `Invoke` and `StaticInvoke` expressions for looking up methods. It is tricky to consider all the cases and there is already existing utility package for this purpose. We should reuse the utility package. ### Does this PR introduce _any_ user-facing change? No, internal change only. ### How was this patch tested? Existing tests. Closes #32474 from viirya/invoke-util. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-08 15:17:30 -07:00
Kent Yao	b0257801d5	[SPARK-35331][SQL] Support resolving missing attrs for distribute/cluster by/repartition hint ### What changes were proposed in this pull request? This PR makes the below case work well. ```sql select a b from values(1) t(a) distribute by a; ``` ```logtalk == Parsed Logical Plan == 'RepartitionByExpression ['a] +- 'Project ['a AS b#42] +- 'SubqueryAlias t +- 'UnresolvedInlineTable [a], [List(1)] == Analyzed Logical Plan == org.apache.spark.sql.AnalysisException: cannot resolve 'a' given input columns: [b]; line 1 pos 62; 'RepartitionByExpression ['a] +- Project [a#48 AS b#42] +- SubqueryAlias t +- LocalRelation [a#48] ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, the original attributes can be used in `distribute by` / `cluster by` and hints like `/+ REPARTITION(3, c) /` ### How was this patch tested? new tests Closes #32465 from yaooqinn/SPARK-35331. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-05-08 05:00:51 -07:00
Chao Sun	323a6e848e	[SPARK-35232][SQL] Nested column pruning should retain column metadata ### What changes were proposed in this pull request? Retain column metadata during the process of nested column pruning, when constructing `StructField`. To test the above change, this also added the logic of column projection in `InMemoryTable`. Without the fix `DSV2CharVarcharDDLTestSuite` will fail. ### Why are the changes needed? The column metadata is used in a few places such as re-constructing CHAR/VARCHAR information such as in [SPARK-33901](https://issues.apache.org/jira/browse/SPARK-33901). Therefore, we should retain the info during nested column pruning. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #32354 from sunchao/SPARK-35232. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-07 22:37:54 -07:00
Chao Sun	f47e0f8379	[SPARK-35261][SQL] Support static magic method for stateless Java ScalarFunction ### What changes were proposed in this pull request? This allows `ScalarFunction` implemented in Java to optionally specify the magic method `invoke` to be static, which can be used if the UDF is stateless. Comparing to the non-static method, it can potentially give better performance due to elimination of dynamic dispatch, etc. Also added a benchmark to measure performance of: the default `produceResult`, non-static magic method and static magic method. ### Why are the changes needed? For UDFs that are stateless (e.g., no need to maintain intermediate state between each function call), it's better to allow users to implement the UDF function as static method which could potentially give better performance. ### Does this PR introduce _any_ user-facing change? Yes. Spark users can now have the choice to define static magic method for `ScalarFunction` when it is written in Java and when the UDF is stateless. ### How was this patch tested? Added new UT. Closes #32407 from sunchao/SPARK-35261. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-07 20:34:51 -07:00
Liang-Chi Hsieh	33fbf5647b	[SPARK-35288][SQL] StaticInvoke should find the method without exact argument classes match ### What changes were proposed in this pull request? This patch proposes to make StaticInvoke able to find method with given method name even the parameter types do not exactly match to argument classes. ### Why are the changes needed? Unlike `Invoke`, `StaticInvoke` only tries to get the method with exact argument classes. If the calling method's parameter types are not exactly matched with the argument classes, `StaticInvoke` cannot find the method. `StaticInvoke` should be able to find the method under the cases too. ### Does this PR introduce _any_ user-facing change? Yes. `StaticInvoke` can find a method even the argument classes are not exactly matched. ### How was this patch tested? Unit test. Closes #32413 from viirya/static-invoke. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-07 09:07:57 -07:00
beliefer	d3b92eec45	[SPARK-35021][SQL] Group exception messages in connector/catalog ### What changes were proposed in this pull request? This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32377 from beliefer/SPARK-35021. Lead-authored-by: beliefer <beliefer@163.com> Co-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-07 10:54:43 +00:00
Yingyi Bu	72d32662d4	[SPARK-35144][SQL] Migrate to transformWithPruning for object rules ### What changes were proposed in this pull request? Added the following TreePattern enums: - APPEND_COLUMNS - DESERIALIZE_TO_OBJECT - LAMBDA_VARIABLE - MAP_OBJECTS - SERIALIZE_FROM_OBJECT - PROJECT - TYPED_FILTER Added tree traversal pruning to the following rules dealing with objects: - EliminateSerialization - CombineTypedFilters - EliminateMapObjects - ObjectSerializerPruning ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Closes #32451 from sigmod/object. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-07 18:36:28 +08:00
Wenchen Fan	9aa18dfe19	[SPARK-35333][SQL] Skip object null check in Invoke if possible ### What changes were proposed in this pull request? If `targetObject` is not nullable, we don't need the object null check in `Invoke`. ### Why are the changes needed? small perf improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #32466 from cloud-fan/invoke. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-07 10:27:28 +00:00
gengjiaan	cf2c4ba584	[SPARK-35020][SQL] Group exception messages in catalyst/util ### What changes were proposed in this pull request? This PR group exception messages in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #32367 from beliefer/SPARK-35020. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-07 08:30:30 +00:00
Terry Kim	33c1034315	[SPARK-34701][SQL][FOLLOW-UP] Children/innerChildren should be mutually exclusive for AnalysisOnlyCommand ### What changes were proposed in this pull request? This is a follow up to https://github.com/apache/spark/pull/32032#discussion_r620928086. Basically, `children`/`innerChildren` should be mutually exclusive for `AlterViewAsCommand` and `CreateViewCommand`, which extend `AnalysisOnlyCommand`. Otherwise, there could be an issue in the `EXPLAIN` command. Currently, this is not an issue, because these commands will be analyzed (children will always be empty) when the `EXPLAIN` command is run. ### Why are the changes needed? To be future-proof where these commands are directly used. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new tsts Closes #32447 from imback82/SPARK-34701-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-07 06:07:53 +00:00
Linhong Liu	3f5a20919c	[SPARK-35318][SQL] Hide internal view properties for describe table cmd ### What changes were proposed in this pull request? Hide internal view properties for describe table command, because those properties are generated by spark and should be transparent to the end-user. ### Why are the changes needed? Avoid internal properties confusing the users. ### Does this PR introduce _any_ user-facing change? Yes Before this change, the user will see below output for `describe formatted test_view` ``` .... Table Properties [view.catalogAndNamespace.numParts=2, view.catalogAndNamespace.part.0=spark_catalog, view.catalogAndNamespace.part.1=default, view.query.out.col.0=c, view.query.out.col.1=v, view.query.out.numCols=2, view.referredTempFunctionsNames=[], view.referredTempViewNames=[]] ... ``` After this change, the internal properties will be hidden for `describe formatted test_view` ``` ... Table Properties [] ... ``` ### How was this patch tested? existing UT Closes #32441 from linhongliu-db/hide-properties. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-06 07:31:34 +00:00
Yingyi Bu	7970318296	[SPARK-35155][SQL] Add rule id pruning to Analyzer rules ### What changes were proposed in this pull request? Added rule id based pruning to Analyzer rules in fixed point batches: - org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns - org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator - org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions - org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveBinaryArithmetic - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveEncodersInUDF - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveInsertInto - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUserSpecifiedColumns - org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution - org.apache.spark.sql.catalyst.analysis.DeduplicateRelations - org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases - org.apache.spark.sql.catalyst.analysis.EliminateUnions - org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct - org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints - org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveJoinStrategyHints - org.apache.spark.sql.catalyst.analysis.ResolveInlineTables - org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables - org.apache.spark.sql.catalyst.analysis.ResolveTimeZone - org.apache.spark.sql.catalyst.analysis.ResolveUnion - org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals - org.apache.spark.sql.catalyst.analysis.TimeWindowing Subsequent PRs will add tree bits based pruning to those rules. Split a big PR to reduce review load. ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Closes #32425 from sigmod/analyzer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-06 08:55:29 +08:00
Yijia Cui	bbdbe0f734	[SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay ### What changes were proposed in this pull request? This pull request proposes a new API for streaming sources to signal that they can report metrics, and adds a use case to support Kafka micro batch stream to report the stats of # of offsets for the current offset falling behind the latest. A public interface is added. `metrics`: returns the metrics reported by the streaming source with given offset. ### Why are the changes needed? The new API can expose any custom metrics for the "current" offset for streaming sources. Different from #31398, this PR makes metrics available to user through progress report, not through spark UI. A use case is that people want to know how the current offset falls behind the latest offset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test for Kafka micro batch source v2 are added to test the Kafka use case. Closes #31944 from yijiacui-db/SPARK-34297. Authored-by: Yijia Cui <yijia.cui@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-05 17:26:07 +09:00
dsolow	f550e03b96	[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]\| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]\| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-05 12:46:13 +09:00
Yingyi Bu	7fd3f8f9ec	[SPARK-35294][SQL] Add tree traversal pruning in rules with dedicated files under optimizer ### What changes were proposed in this pull request? Added the following TreePattern enums: - CREATE_NAMED_STRUCT - EXTRACT_VALUE - JSON_TO_STRUCT - OUTER_REFERENCE - AGGREGATE - LOCAL_RELATION - EXCEPT - LIMIT - WINDOW Used them in the following rules: - DecorrelateInnerQuery - LimitPushDownThroughWindow - OptimizeCsvJsonExprs - PropagateEmptyRelation - PullOutGroupingExpressions - PushLeftSemiLeftAntiThroughJoin - ReplaceExceptWithFilter - RewriteDistinctAggregates - SimplifyConditionalsInPredicate - UnwrapCastInBinaryComparison ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Closes #32421 from sigmod/opt. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-04 19:17:22 +08:00
Chao Sun	2a8d7ed4bf	[SPARK-35281][SQL] StaticInvoke should not apply boxing if return type is primitive ### What changes were proposed in this pull request? In `StaticInvoke`, when result is nullable, don't box the return value if its type is primitive. ### Why are the changes needed? It is unnecessary to apply boxing when the method return value is of primitive type, and it would hurt performance a lot if the method is simple. The check is done in `Invoke` but not in `StaticInvoke`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a UT. Closes #32416 from sunchao/SPARK-35281. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 14:55:35 +09:00
Max Gekk	335f00b19b	[SPARK-35285][SQL] Parse ANSI interval types in SQL schema ### What changes were proposed in this pull request? 1. Extend Spark SQL parser to support parsing of: - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType` - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType` 2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`. ### Why are the changes needed? To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules: ``` <interval type> ::= INTERVAL <interval qualifier> <interval qualifier> ::= <start field> TO <end field> \| <single datetime field> <start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ] <end field> ::= <non-second primary datetime field> \| SECOND [ <left paren> <interval fractional seconds precision> <right paren> ] <primary datetime field> ::= <non-second primary datetime field \| SECOND <non-second primary datetime field> ::= YEAR \| MONTH \| DAY \| HOUR \| MINUTE <interval fractional seconds precision> ::= <unsigned integer> <interval leading field precision> ::= <unsigned integer> ``` Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`. ### Does this PR introduce _any_ user-facing change? Should not since the types has not been released yet. ### How was this patch tested? By running the affected tests such as: ``` $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z datetime.sql" $ build/sbt "test:testOnly ExpressionTypeCheckingSuite" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z windowFrameCoercion.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql" ``` Closes #32409 from MaxGekk/parse-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 13:50:35 +09:00
Angerszhuuuu	caa46ce0b6	[SPARK-35112][SQL] Support Cast string to day-second interval ### What changes were proposed in this pull request? Support Cast string to day-seconds interval ### Why are the changes needed? Users can cast day-second interval string to DayTimeIntervalType. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32271 from AngersZhuuuu/SPARK-35112. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-05-02 09:28:51 +03:00
Peter Toth	cfc0495f9c	[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function ### What changes were proposed in this pull request? This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions. ### Why are the changes needed? If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid. Here is a simple example: ``` SELECT not(t.id IS NULL) , count(*) FROM t GROUP BY t.id IS NULL ``` In this case the `BooleanSimplification` rule does this: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification === !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] +- Project [value#219 AS id#222] +- Project [value#219 AS id#222] +- LocalRelation [value#219] +- LocalRelation [value#219] ``` where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression. Before this PR: ``` == Optimized Logical Plan == Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L] +- Project [value#219 AS id#222] +- LocalRelation [value#219] ``` and running the query throws an error: ``` Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] ``` After this PR: ``` == Optimized Logical Plan == Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L] +- Project [isnull(value#219) AS _groupingexpression#233] +- LocalRelation [value#219] ``` and the query works. ### Does this PR introduce _any_ user-facing change? Yes, the query works. ### How was this patch tested? Added new UT. Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-02 05:53:09 +00:00
Liang-Chi Hsieh	6ce1b161e9	[SPARK-35278][SQL] Invoke should find the method with correct number of parameters ### What changes were proposed in this pull request? This patch fixes `Invoke` expression when the target object has more than one method with the given method name. ### Why are the changes needed? `Invoke` will find out the method on the target object with given method name. If there are more than one method with the name, currently it is undeterministic which method will be used. We should add the condition of parameter number when finding the method. ### Does this PR introduce _any_ user-facing change? Yes, fixed a bug when using `Invoke` on a object where more than one method with the given method name. ### How was this patch tested? Unit test. Closes #32404 from viirya/verify-invoke-param-len. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-01 10:20:46 -07:00
Yuming Wang	72e238a790	[SPARK-35273][SQL] CombineFilters support non-deterministic expressions ### What changes were proposed in this pull request? This pr makes `CombineFilters` support non-deterministic expressions. For example: ```sql spark.sql("CREATE TABLE t1(id INT, dt STRING) using parquet PARTITIONED BY (dt)") spark.sql("CREATE VIEW v1 AS SELECT * FROM t1 WHERE dt NOT IN ('2020-01-01', '2021-01-01')") spark.sql("SELECT * FROM v1 WHERE dt = '2021-05-01' AND rand() <= 0.01").explain() ``` Before this pr: ``` == Physical Plan == (1) Filter (isnotnull(dt#1) AND ((dt#1 = 2021-05-01) AND (rand(-6723800298719475098) <= 0.01))) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [NOT dt#1 IN (2020-01-01,2021-01-01)], PushedFilters: [], ReadSchema: struct<id:int> ``` After this pr: ``` == Physical Plan == (1) Filter (rand(-2400509328955813273) <= 0.01) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), NOT dt#1 IN (2020-01-01,2021-01-01), (dt#1 = 2021-05-01)], PushedFilters: [], ReadSchema: struct<id:int> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32405 from wangyum/SPARK-35273. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-01 06:02:11 +00:00
ulysses-you	39889df32a	[SPARK-35264][SQL] Support AQE side broadcastJoin threshold ### What changes were proposed in this pull request? ~~This PR aims to add a new AQE optimizer rule `DynamicJoinSelection`. Like other AQE partition number configs, this rule add a new broadcast threshold config `spark.sql.adaptive.autoBroadcastJoinThreshold`.~~ This PR amis to add a flag in `Statistics` to distinguish AQE stats or normal stats, so that we can make some sql configs isolation between AQE and normal. ### Why are the changes needed? The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. Actually we do not very trust using the static stats to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(AQE) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config. ### Does this PR introduce _any_ user-facing change? Yes, a new config `spark.sql.adaptive.autoBroadcastJoinThreshold` added. ### How was this patch tested? Add new test. Closes #32391 from ulysses-you/SPARK-35264. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-30 09:16:21 +00:00
Angerszhuuuu	11ea255283	[SPARK-35111][SQL] Support Cast string to year-month interval ### What changes were proposed in this pull request? Support Cast string to year-month interval Supported format as below ``` ANSI_STYLE, like INTERVAL -'-10-1' YEAR TO MONTH HIVE_STYLE like 10-1 or -10-1 Rules from the SQL standard about ANSI_STYLE: <interval literal> ::= INTERVAL [ <sign> ] <interval string> <interval qualifier> <interval string> ::= <quote> <unquoted interval string> <quote> <unquoted interval string> ::= [ <sign> ] { <year-month literal> \| <day-time literal> } <year-month literal> ::= <years value> [ <minus sign> <months value> ] \| <months value> <years value> ::= <datetime value> <months value> ::= <datetime value> <datetime value> ::= <unsigned integer> <unsigned integer> ::= <digit>... ``` ### Why are the changes needed? Support Cast string to year-month interval ### Does this PR introduce _any_ user-facing change? User can cast year month interval string to YearMonthIntervalType ### How was this patch tested? Added UT Closes #32266 from AngersZhuuuu/SPARK-SPARK-35111. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-30 08:03:07 +03:00
Kousuke Saruta	e8bf8fe213	[SPARK-35047][SQL] Allow Json datasources to write non-ascii characters as codepoints ### What changes were proposed in this pull request? This PR proposes to enable the JSON datasources to write non-ascii characters as codepoints. To enable/disable this feature, I introduce a new option `writeNonAsciiCharacterAsCodePoint` for JSON datasources. ### Why are the changes needed? JSON specification allows codepoints as literal but Spark SQL's JSON datasources don't support the way to do it. It's great if we can write non-ascii characters as codepoints, which is a platform neutral representation. ### Does this PR introduce _any_ user-facing change? Yes. Users can write non-ascii characters as codepoints with JSON datasources. ### How was this patch tested? New test. Closes #32147 from sarutak/json-unicode-write. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:50:15 -07:00

... 2 3 4 5 6 ...

5669 commits