ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kousuke Saruta	586eb5d4c6	Revert "[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported" ### What changes were proposed in this pull request? This PR reverts the change in SPARK-36429 (#33654). See [conversation](https://github.com/apache/spark/pull/33654#issuecomment-894160037). ### Why are the changes needed? To recover CIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #33670 from sarutak/revert-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit `e17612d0bf`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-08-06 20:56:40 +09:00
gaoyajun02	33e4ce562a	[SPARK-36339][SQL] References to grouping that not part of aggregation should be replaced ### What changes were proposed in this pull request? Currently, references to grouping sets are reported as errors after aggregated expressions, e.g. ``` SELECT count(name) c, name FROM VALUES ('Alice'), ('Bob') people(name) GROUP BY name GROUPING SETS(name); ``` Error in query: expression 'people.`name`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; ### Why are the changes needed? Fix the map anonymous function in the constructAggregateExprs function does not use underscores to avoid ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33574 from gaoyajun02/SPARK-36339. Lead-authored-by: gaoyajun02 <gaoyajun02@gmail.com> Co-authored-by: gaoyajun02 <gaoyajun02@meituan.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `888f8f03c8`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 16:35:01 +08:00
Kousuke Saruta	f3761bdb76	[SPARK-36429][SQL][FOLLOWUP] Update a golden file to comply with the change in SPARK-36429 ### What changes were proposed in this pull request? This PR updates a golden to comply with the change in SPARK-36429 (#33654). ### Why are the changes needed? To recover GA failure. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA itself. Closes #33663 from sarutak/followup-SPARK-36429. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `63c7d1847d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 15:21:29 +08:00
gengjiaan	be19270880	[SPARK-36429][SQL] JacksonParser should throw exception when data type unsupported ### What changes were proposed in this pull request? Currently, when `set spark.sql.timestampType=TIMESTAMP_NTZ`, the behavior is different between `from_json` and `from_csv`. ``` -- !query select from_json('{"t":"26/October/2015"}', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<from_json({"t":"26/October/2015"}):struct<t:timestamp_ntz>> -- !query output {"t":null} ``` ``` -- !query select from_csv('26/October/2015', 't Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy')) -- !query schema struct<> -- !query output java.lang.Exception Unsupported type: timestamp_ntz ``` We should make `from_json` throws exception too. This PR fix the discussion below https://github.com/apache/spark/pull/33640#discussion_r682862523 ### Why are the changes needed? Make the behavior of `from_json` more reasonable. ### Does this PR introduce _any_ user-facing change? 'Yes'. from_json throwing Exception when we set spark.sql.timestampType=TIMESTAMP_NTZ. ### How was this patch tested? Tests updated. Closes #33654 from beliefer/SPARK-36429. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-06 14:01:13 +08:00
Wenchen Fan	f719e9c200	[SPARK-36409][SQL][TESTS] Splitting test cases from datetime.sql ### What changes were proposed in this pull request? Currently `datetime.sql` contains a lot of tests and will be run 3 times: default mode, ansi mode, ntz mode. It wastes the test time and also large test files are hard to read. This PR proposes to split it into smaller ones: 1. `date.sql`, which contains date literals, functions and operations. It will be run twice with default and ansi mode. 2. `timestamp.sql`, which contains timestamp (no ltz or ntz suffix) literals, functions and operations. It will be run 4 times: default mode + ans off, defaul mode + ansi on, ntz mode + ansi off, ntz mode + ansi on. 3. `datetime_special.sql`, which create datetime values whose year is outside of [0, 9999]. This is a separated file as JDBC doesn't support them and need to ignore this test file. It will be run 4 times as well. 4. `timestamp_ltz.sql`, which contains timestamp_ltz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ltz are tested by `timestamp.sql` 5. `timestamp_ntz.sql`, which contains timestamp_ntz literals and constructors. It will be run twice with default and ntz mode, to make sure its result doesn't change with the timestamp mode. Note that, operations with ntz are tested by `timestamp.sql` ### Why are the changes needed? reduce test run time. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33640 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-06 12:55:31 +08:00
Kent Yao	0bb88c99f7	[SPARK-36421][SQL][DOCS] Use ConfigEntry.key to fix docs and set command results ### What changes were proposed in this pull request? This PR fixes the issue that `ConfigEntry` to be introduced to the doc field directly without calling `.key`, which causes malformed documents on the web site and in the result of `SET -v` 1. https://spark.apache.org/docs/3.1.2/configuration.html#static-sql-configuration - spark.sql.hive.metastore.jars 2. set -v ![image](https://user-images.githubusercontent.com/8326978/128292412-85100f95-24fd-4b40-a14f-d31a256dab7d.png) ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no, but contains doc fix ### How was this patch tested? new tests Closes #33647 from yaooqinn/SPARK-36421. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c7fa3c9090`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 11:01:54 +09:00
Kent Yao	1785ead733	[SPARK-36414][SQL] Disable timeout for BroadcastQueryStageExec in AQE ### What changes were proposed in this pull request? This reverts SPARK-31475, as there are always more concurrent jobs running in AQE mode, especially when running multiple queries at the same time. Currently, the broadcast timeout does not record accurately for the BroadcastQueryStageExec only, but also including the time waiting for being scheduled. If all the resources are currently being occupied for materializing other stages, it timeouts without a chance to run actually. ![image](https://user-images.githubusercontent.com/8326978/128169612-4c96c8f6-6f8e-48ed-8eaf-450f87982c3b.png) The default value is 300s, and it's hard to adjust the timeout for AQE mode. Usually, you need an extremely large number for real-world cases. As you can see in the example, above, the timeout we used for it was 1800s, and obviously, it needed 3x more or something ### Why are the changes needed? AQE is default now, we can make it more stable with this PR ### Does this PR introduce _any_ user-facing change? yes, broadcast timeout now is not used for AQE ### How was this patch tested? modified test Closes #33636 from yaooqinn/SPARK-36414. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `0c94e47aec`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-05 21:15:48 +08:00
Angerszhuuuu	bf4edb5f5a	[SPARK-36353][SQL] RemoveNoopOperators should keep output schema ### What changes were proposed in this pull request? RemoveNoopOperators should keep output schema ### Why are the changes needed? Expand function ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33587 from AngersZhuuuu/SPARK-36355. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `02810eecbf`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-05 20:43:48 +08:00
PengLei	8c42232638	[SPARK-36381][SQL] Add case sensitive and case insensitive compare for checking column name exist when alter table ### What changes were proposed in this pull request? Add the Resolver to `checkColumnNotExists` to check name exist in case sensitive. ### Why are the changes needed? At now the resolver is `_ == _` of `findNestedField` called by `checkColumnNotExists` Add `alter.conf.resolver` to it. [SPARK-36381](https://issues.apache.org/jira/browse/SPARK-36381) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add ut tests Closes #33618 from Peng-Lei/sensitive-cloumn-name. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `87d49cbcb1`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 10:04:25 +09:00
Max Gekk	bd33408b4b	[SPARK-36349][SQL] Disallow ANSI intervals in file-based datasources ### What changes were proposed in this pull request? In the PR, I propose to ban `YearMonthIntervalType` and `DayTimeIntervalType` at the analysis phase while creating a table using a built-in filed-based datasource or writing a dataset to such datasource. In particular, add the following case: ```scala case _: DayTimeIntervalType \| _: YearMonthIntervalType => false ``` to all methods that override either: - V2 `FileTable.supportsDataType()` - V1 `FileFormat.supportDataType()` ### Why are the changes needed? To improve user experience with Spark SQL, and output a proper error message at the analysis phase. ### Does this PR introduce _any_ user-facing change? Yes but ANSI interval types haven't released yet. So, for users this is new behavior. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 "test:testOnly *HiveOrcSourceSuite" ``` Closes #33580 from MaxGekk/interval-ban-in-ds. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `67cbc93263`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 20:30:33 +03:00
Kousuke Saruta	cc75618e87	[SPARK-35815][SQL][FOLLOWUP] Add test considering the case spark.sql.legacy.interval.enabled is true ### What changes were proposed in this pull request? This PR adds test considering the case `spark.sql.legacy.interval.enabled` is `true` for SPARK-35815. ### Why are the changes needed? SPARK-35815 (#33456) changes `Dataset.withWatermark` to accept ANSI interval literals as `delayThreshold` but I noticed the change didn't work with `spark.sql.legacy.interval.enabled=true`. We can't detect this issue because there is no test which considers the legacy interval type at that time. In SPARK-36323 (#33551), this issue was resolved but it's better to add test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33606 from sarutak/test-watermark-with-legacy-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `92cdb17d1a`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 13:48:53 +03:00
Wenchen Fan	8d817dcf30	[SPARK-36315][SQL] Only skip AQEShuffleReadRule in the final stage if it breaks the distribution requirement ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30494 This PR proposes a new way to optimize the final query stage in AQE. We first collect the effective user-specified repartition (semantic-wise, user-specified repartition is only effective if it's the root node or under a few simple nodes), and get the required distribution for the final plan. When we optimize the final query stage, we skip certain `AQEShuffleReadRule` if it breaks the required distribution. ### Why are the changes needed? The current solution for optimizing the final query stage is pretty hacky and overkill. As an example, the newly added rule `OptimizeSkewInRebalancePartitions` can hardly apply as it's very common that the query plan has shuffles with origin `ENSURE_REQUIREMENTS`, which is not supported by `OptimizeSkewInRebalancePartitions`. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #33541 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `dd80457ffb`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-03 18:29:07 +08:00
Wenchen Fan	7c586842d7	[SPARK-36380][SQL] Simplify the logical plan names for ALTER TABLE ... COLUMN ### What changes were proposed in this pull request? This a followup of the recent work such as https://github.com/apache/spark/pull/33200 For `ALTER TABLE` commands, the logical plans do not have the common `AlterTable` prefix in the name and just use names like `SetTableLocation`. This PR proposes to follow the same naming rule in `ALTER TABE ... COLUMN` commands. This PR also moves these AlterTable commands to a individual file and give them a base trait. ### Why are the changes needed? name simplification ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33609 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `7cb9c1c241`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-03 10:43:15 +03:00
Hyukjin Kwon	9eec11b956	[SPARK-36379][SQL] Null at root level of a JSON array should not fail w/ permissive mode This PR proposes to fail properly so JSON parser can proceed and parse the input with the permissive mode. Previously, we passed `null`s as are, the root `InternalRow`s became `null`s, and it causes the query fails even with permissive mode on. Now, we fail explicitly if `null` is passed when the input array contains `null`. Note that this is consistent with non-array JSON input: Permissive mode: ```scala spark.read.json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([str], [null]) ``` Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""{"a": "str"}""", """null""").toDS).collect() ``` ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` To make the permissive mode to proceed and parse without throwing an exception. Permissive mode: ```scala spark.read.json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` NOTE that this behaviour is consistent when JSON object is malformed: ```scala spark.read.schema("a int").json(Seq("""[{"a": 123}, {123123}, {"a": 123}]""").toDS).collect() ``` ``` res0: Array[org.apache.spark.sql.Row] = Array([null]) ``` Since we're parsing _one_ JSON array, related records all fail together. Failfast mode: ```scala spark.read.option("mode", "failfast").json(Seq("""[{"a": "str"}, null]""").toDS).collect() ``` Before: ``` java.lang.NullPointerException at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:759) ``` After: ``` org.apache.spark.SparkException: Malformed records are detected in record parsing. Parse Mode: FAILFAST. To process malformed records as null result, try setting the option 'mode' as 'PERMISSIVE'. at org.apache.spark.sql.catalyst.util.FailureSafeParser.parse(FailureSafeParser.scala:70) at org.apache.spark.sql.DataFrameReader.$anonfun$json$7(DataFrameReader.scala:540) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) ``` Manually tested, and unit test was added. Closes #33608 from HyukjinKwon/SPARK-36379. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `0bbcbc6508`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-02 10:02:09 -07:00
Angerszhuuuu	ea559adc2e	[SPARK-36086][SQL] CollapseProject project replace alias should use origin column name ### What changes were proposed in this pull request? For added UT, without this patch will failed as below ``` [info] - SHOW TABLES V2: SPARK-36086: CollapseProject project replace alias should use origin column name * FAILED * (4 seconds, 935 milliseconds) [info] java.lang.RuntimeException: After applying rule org.apache.spark.sql.catalyst.optimizer.CollapseProject in batch Operator Optimization before Inferring Filters, the structural integrity of the plan is broken. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.structuralIntegrityIsBrokenAfterApplyingRuleError(QueryExecutionErrors.scala:1217) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:229) [info] at scala.collection.LinearSeqOptimized.foldLeft(LinearSeqOptimized.scala:126) [info] at scala.collection.LinearSeqOptimized.foldLeft$(LinearSeqOptimized.scala:122) [info] at scala.collection.immutable.List.foldLeft(List.scala:91) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:208) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200) [info] at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179) [info] at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88) ``` CollapseProject project replace alias should use origin column name ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33576 from AngersZhuuuu/SPARK-36086. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f3173956cb`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-03 00:08:30 +08:00
Linhong Liu	e26cb968bd	[SPARK-36224][SQL] Use Void as the type name of NullType ### What changes were proposed in this pull request? Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType` ### Why are the changes needed? This PR is intended to address the type name discussion in PR #28833. Here are the reasons: 1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name 2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL 3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work ### Does this PR introduce _any_ user-facing change? Yes, the type name of "NULL" is changed from "null" to "void". for example: ``` scala> sql("select null as a, 1 as b").schema.catalogString res5: String = struct<a:void,b:int> ``` ### How was this patch tested? existing test cases Closes #33437 from linhongliu-db/SPARK-36224-void-type-name. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `2f700773c2`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 23:20:11 +08:00
Terry Kim	87ae397897	[SPARK-36372][SQL] v2 ALTER TABLE ADD COLUMNS should check duplicates for the user specified columns ### What changes were proposed in this pull request? Currently, v2 ALTER TABLE ADD COLUMNS does not check duplicates for the user specified columns. For example, ``` spark.sql(s"CREATE TABLE $t (id int) USING $v2Format") spark.sql("ALTER TABLE $t ADD COLUMNS (data string, data string)") ``` doesn't fail the analysis, and it's up to the catalog implementation to handle it. For v1 command, the duplication is checked before invoking the catalog. ### Why are the changes needed? To check the duplicate columns during analysis and be consistent with v1 command. ### Does this PR introduce _any_ user-facing change? Yes, now the above will command will print out the fllowing: ``` org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the user specified columns: `data` ``` ### How was this patch tested? Added new unit tests Closes #33600 from imback82/alter_add_duplicate_columns. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `3b713e7f61`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-02 17:55:07 +08:00
Andy Grove	f9f5656491	[SPARK-35881][SQL] Add support for columnar execution of final query stage in AdaptiveSparkPlanExec ### What changes were proposed in this pull request? Changes in this PR: - `AdaptiveSparkPlanExec` has new methods `finalPlanSupportsColumnar` and `doExecuteColumnar` to support adaptive queries where the final query stage produces columnar data. - `SessionState` now has a new set of injectable rules named `finalQueryStagePrepRules` that can be applied to the final query stage. - `AdaptiveSparkPlanExec` can now safely be wrapped by either `RowToColumnarExec` or `ColumnarToRowExec`. A Spark plugin can use the new rules to remove the root `ColumnarToRowExec` transition that is inserted by previous rules and at execution time can call `finalPlanSupportsColumnar` to see if the final query stage is columnar. If the plan is columnar then the plugin can safely call `doExecuteColumnar`. The adaptive plan can be wrapped in either `RowToColumnarExec` or `ColumnarToRowExec` to force a particular output format. There are fast paths in both of these operators to avoid any redundant transitions. ### Why are the changes needed? Without this change it is necessary to use reflection to get the final physical plan to determine whether it is columnar and to execute it is a columnar plan. `AdaptiveSparkPlanExec` only provides public methods for row-based execution. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I have manually tested this patch with the RAPIDS Accelerator for Apache Spark. Closes #33140 from andygrove/support-columnar-adaptive. Authored-by: Andy Grove <andygrove73@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit `0f538402fb`) Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-07-30 15:38:52 -05:00
Hyukjin Kwon	fee87f13d1	[SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side ### What changes were proposed in this pull request? This PR proposes to implement `distributed-sequence` index in Scala side. ### Why are the changes needed? - Avoid unnecessary (de)serialization - Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104) ### Does this PR introduce _any_ user-facing change? No to end users since pandas API on Spark is not released yet. ```python import pyspark.pandas as ps ps.set_option('compute.default_index_type', 'distributed-sequence') ps.range(1).spark.print_schema() ``` Before: ``` root \|-- id: long (nullable = true) ``` After: ``` root \|-- id: long (nullable = false) ``` ### How was this patch tested? Manually tested, and existing tests should cover them. Closes #33570 from HyukjinKwon/SPARK-36338. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c6140d4d0a`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-30 22:29:31 +09:00
Wenchen Fan	f6bb75b0bc	[SPARK-34952][SQL][FOLLOWUP] Simplify JDBC aggregate pushdown ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33352 , to simplify the JDBC aggregate pushdown: 1. We should get the schema of the aggregate query by asking the JDBC server, instead of calculating it by ourselves. This can simplify the code a lot, and is also more robust: the data type of SUM may vary in different databases, it's fragile to assume they are always the same as Spark. 2. because of 1, now we can remove the `dataType` property from the public `Sum` expression. This PR also contains some small improvements: 1. Spark should deduplicate the aggregate expressions before pushing them down. 2. Improve the `toString` of public aggregate expressions to make them more SQL. ### Why are the changes needed? code and API simplification ### Does this PR introduce _any_ user-facing change? this API is not released yet. ### How was this patch tested? existing tests Closes #33579 from cloud-fan/dsv2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `387a251a68`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-30 00:26:41 -07:00
Angerszhuuuu	a96e9e197e	[SPARK-34399][SQL][3.2] Add commit duration to SQL tab's graph node ### What changes were proposed in this pull request? Since we have add log about commit time, I think this useful and we can make user know it directly in SQL tab's UI. ![image](https://user-images.githubusercontent.com/46485123/126647754-dc3ba83a-5391-427c-8a67-e6af46e82290.png) ### Why are the changes needed? Make user can directly know commit duration. ### Does this PR introduce _any_ user-facing change? User can see file commit duration in SQL tab's SQL plan graph ### How was this patch tested? Mannul tested Closes #33553 from AngersZhuuuu/SPARK-34399-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-30 12:30:20 +08:00
Chao Sun	8c203272de	[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package ### What changes were proposed in this pull request? Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this. ### Why are the changes needed? Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at. ### Does this PR introduce _any_ user-facing change? No, it's just test refactoring. ### How was this patch tested? Using existing tests: ``` build/sbt "sql/testOnly PruneFileSourcePartitionsSuite" ``` and ``` build/sbt "hive/testOnly PruneHiveTablePartitionsSuite" ``` Closes #33564 from sunchao/SPARK-36136-partitions-suite. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `0ece865ea4`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-29 17:18:33 -07:00
Yuanjian Li	e8462a584c	[SPARK-36347][SS] Upgrade the RocksDB version to 6.20.3 ### What changes were proposed in this pull request? As the discussion in https://github.com/apache/spark/pull/32928/files#r654049392, after confirming the compatibility, we can use a newer RocksDB version for the state store implementation. ### Why are the changes needed? For further ARM support and leverage the bug fix for the newer version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33578 from xuanyuanking/SPARK-36347. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `4cd5fa96d8`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-07-29 11:09:10 -07:00
Kousuke Saruta	d247a6cd1a	[SPARK-36323][SQL] Support ANSI interval literals for TimeWindow ### What changes were proposed in this pull request? This PR proposes to support ANSI interval literals for `TimeWindow`. ### Why are the changes needed? Watermark also supports ANSI interval literals so it's great to support for `TimeWindow`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #33551 from sarutak/window-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `db18866742`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-29 08:52:10 +03:00
Cheng Su	6d188cbb08	[SPARK-36272][SQL][TEST] Change shuffled hash join metrics test to check relative value of build size ### What changes were proposed in this pull request? This is a follow up of https://github.com/apache/spark/pull/33447, where the unit test is disabled, due to failure after memory setting changed. I found the root cause is after https://github.com/apache/spark/pull/33447, in unit test, Spark memory page byte size is changed from `67108864` to `33554432` [1]. So the shuffled hash join build size is also changed accordingly due to [memory page byte size change](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L457). Previously the unit test is checking the exact value of build size, so it no longer works. Here we change the unit test to verify the relative value of build size, and it should work. [1]: I printed out the memory page byte size explicitly in unit test - `org.apache.spark.SparkException: chengsu pageSizeBytes: 33554432!` in https://github.com/c21/spark/runs/3186680616?check_suite_focus=true . ### Why are the changes needed? Make previously disabled unit test work. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Changed unit test itself. Closes #33494 from c21/test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `6a8dd3229a`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-29 11:14:48 +09:00
Linhong Liu	fa521c1506	[SPARK-36286][SQL] Block some invalid datetime string ### What changes were proposed in this pull request? In PR #32959, we found some weird datetime strings that can be parsed. ([details](https://github.com/apache/spark/pull/32959#discussion_r665015489)) This PR blocks the invalid datetime string. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? Yes, below strings will have different results when cast to datetime. ```sql select cast('12::' as timestamp); -- Before: 2021-07-07 12:00:00, After: NULL select cast('T' as timestamp); -- Before: 2021-07-07 00:00:00, After: NULL ``` ### How was this patch tested? some new test cases Closes #33490 from linhongliu-db/SPARK-35780-block-invalid-format. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ed0e351f05`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-29 09:17:01 +08:00
Venki Korukanti	c236101d4c	[SPARK-36236][SS] Additional metrics for RocksDB based state store implementation ### What changes were proposed in this pull request? Proposing adding new metrics to `customMetrics` under the `stateOperators` in `StreamingQueryProgress` event These metrics help have better visibility into the RocksDB based state store in streaming jobs. For full details of metrics, refer to https://issues.apache.org/jira/browse/SPARK-36236. ### Why are the changes needed? Current metrics available for the RockDB state store, do not provide observability into many operations such as how much time is spent by the RocksDB in compaction and what is the cache hit ratio. These metrics help compare performance differences in state store operations between slow and fast microbatches . ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unittests Closes #33455 from vkorukanti/rocksdb-metrics. Authored-by: Venki Korukanti <venki.korukanti@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `eb4d1c0332`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-28 12:51:56 -07:00
dgd-contributor	c5b0cb2d94	[SPARK-36229][SQL] conv() inconsistently handles invalid strings with more than 64 invalid characters and return wrong value on overflow ### What changes were proposed in this pull request? 1/ conv() have inconsistency in behavior where the returned value is different above the 64 char threshold. ``` scala> spark.sql("select conv(repeat('?', 64), 10, 16)").show +---------------------------+ \|conv(repeat(?, 64), 10, 16)\| +---------------------------+ \| 0\| +---------------------------+ scala> spark.sql("select conv(repeat('?', 65), 10, 16)").show // which should be 0 +---------------------------+ \|conv(repeat(?, 65), 10, 16)\| +---------------------------+ \| FFFFFFFFFFFFFFFF\| +---------------------------+ scala> spark.sql("select conv(repeat('?', 65), 10, -16)").show // which should be 0 +----------------------------+ \|conv(repeat(?, 65), 10, -16)\| +----------------------------+ \| -1\| +----------------------------+ scala> spark.sql("select conv(repeat('?', 64), 10, -16)").show +----------------------------+ \|conv(repeat(?, 64), 10, -16)\| +----------------------------+ \| 0\| +----------------------------+ ``` 2/ conv should return result equal to max unsigned long value in base toBase when there is overflow ``` scala> spark.sql(select conv('aaaaaaa0aaaaaaa0a', 16, 10)).show // which should be 18446744073709551615 +-------------------------------+ \|conv(aaaaaaa0aaaaaaa0a, 16, 10)\| +-------------------------------+ \| 12297828695278266890\| +-------------------------------+ ``` ### Why are the changes needed? Bug fix, this pull request aim to make conv function behave similarly with the behavior of conv function from MySQL database ### Does this PR introduce _any_ user-facing change? change in result of conv() function ### How was this patch tested? add test Closes #33459 from dgd-contributor/SPARK-36229_convInconsistencyBehaviorWithMoreThan64Characters. Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e1c50ff779`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-29 00:19:19 +08:00
Angerszhuuuu	b58170b192	[SPARK-36312][SQL][FOLLOWUP] Add back ParquetSchemaConverter.checkFieldNames ### What changes were proposed in this pull request? Add back ParquetSchemaConverter.checkFieldNames() ### Why are the changes needed? Fix code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #33552 from AngersZhuuuu/SPARK-36312-FOLLOWUP. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f086c17b8e`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 14:38:36 +08:00
Angerszhuuuu	2f4f7936fd	[SPARK-33865][SPARK-36202][SQL] When HiveDDL, we need check avro schema too ### What changes were proposed in this pull request? Unify schema check code of FileFormat and check avro schema filed name when CREATE TABLE DDL too ### Why are the changes needed? Refactor code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33441 from AngersZhuuuu/SPARK-36202. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `86f44578e5`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 14:04:37 +08:00
Terry Kim	cd6b303d0f	[SPARK-36006][SQL] Migrate ALTER TABLE ... ADD/REPLACE COLUMNS commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate the following `ALTER TABLE ... ADD/REPLACE COLUMNS` commands to use `UnresolvedTable` as a `child` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, the above `ALTER TABLE ... ADD/REPLACE COLUMNS` commands will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests. Closes #33200 from imback82/alter_add_cols. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `809b88a162`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 14:00:44 +08:00
Angerszhuuuu	3c441135bb	[SPARK-36312][SQL] ParquetWriterSupport.setSchema should check inner field ### What changes were proposed in this pull request? Last pr only support add inner field check for hive ddl, this pr add check for parquet data source write API. ### Why are the changes needed? Failed earlier ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Ut Without this UI it failed as ``` [info] - SPARK-36312: ParquetWriteSupport should check inner field * FAILED * (8 seconds, 29 milliseconds) [info] Expected exception org.apache.spark.sql.AnalysisException to be thrown, but org.apache.spark.SparkException was thrown (HiveDDLSuite.scala:3035) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.funsuite.AnyFunSuite.newAssertionFailedException(AnyFunSuite.scala:1563) [info] at org.scalatest.Assertions.intercept(Assertions.scala:756) [info] at org.scalatest.Assertions.intercept$(Assertions.scala:746) [info] at org.scalatest.funsuite.AnyFunSuite.intercept(AnyFunSuite.scala:1563) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396(HiveDDLSuite.scala:3035) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$396$adapted(HiveDDLSuite.scala:3034) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath(SQLHelper.scala:69) [info] at org.apache.spark.sql.catalyst.plans.SQLHelper.withTempPath$(SQLHelper.scala:66) [info] at org.apache.spark.sql.QueryTest.withTempPath(QueryTest.scala:34) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$395(HiveDDLSuite.scala:3034) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withView(SQLTestUtils.scala:316) [info] at org.apache.spark.sql.test.SQLTestUtilsBase.withView$(SQLTestUtils.scala:314) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.withView(HiveDDLSuite.scala:396) [info] at org.apache.spark.sql.hive.execution.HiveDDLSuite.$anonfun$new$394(HiveDDLSuite.scala:3032) [info] at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) [info] at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226) [info] at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190) [info] at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236) [info] at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234) [info] at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227) [info] at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269) [info] at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413) [info] at scala.collection.immutable.List.foreach(List.scala:431) [info] at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401) [info] at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396) [info] at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:269) [info] at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:268) [info] at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1563) [info] at org.scalatest.Suite.run(Suite.scala:1112) [info] at org.scalatest.Suite.run$(Suite.scala:1094) [info] at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1563) [info] at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:273) [info] at org.scalatest.SuperEngine.runImpl(Engine.scala:535) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:273) [info] at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:272) [info] at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:62) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:62) [info] at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318) [info] at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513) [info] at sbt.ForkMain$Run.lambda$runTest$1(ForkMain.java:413) [info] at java.util.concurrent.FutureTask.run(FutureTask.java:266) [info] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [info] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [info] at java.lang.Thread.run(Thread.java:748) [info] Cause: org.apache.spark.SparkException: Job aborted. [info] at org.apache.spark.sql.errors.QueryExecutionErrors$.jobAbortedError(QueryExecutionErrors.scala:496) [info] at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:251) [info] at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:186) [info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:113) [info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:111) [info] at org.apache.spark.sql.execution.command.DataWritingCommandExec.executeCollect(commands.scala:125) [info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.$anonfun$applyOrElse$1(QueryExecution.scala:97) [info] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) [info] at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) [info] at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) [info] at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775) [info] at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) [info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:97) [info] at org.apache.spark.sql.execution.QueryExecution$$anonfun$eagerlyExecuteCommands$1.applyOrElse(QueryExecution.scala:93) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDownWithPruning$1(TreeNode.scala:481) [info] at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:82) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDownWithPruning(TreeNode.scala:481) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.org$apache$spark$sql$catalyst$plans$logical$AnalysisHelper$$super$transformDownWithPruning(LogicalPlan.scala:30) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning(AnalysisHelper.scala:267) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDownWithPruning$(AnalysisHelper.scala:263) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDownWithPruning(LogicalPlan.scala:30) [info] at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:457) [info] at org.apache.spark.sql.execution.QueryExecution.eagerlyExecuteCommands(QueryExecution.scala:93) [info] at org.apache.spark.sql.execution.QueryExecution.commandExecuted$lzycompute(QueryExecution.scala:80) [info] at org.apache.spark.sql.execution.QueryExecution.commandExecuted(QueryExecution.scala:78) [info] at org.apache.spark.sql.execution.QueryExecution.assertCommandExecuted(QueryExecution.scala:115) [info] at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:848) [info] at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:382) [info] at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:355) [info] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:239) [info] at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:781) [in ``` Closes #33531 from AngersZhuuuu/SPARK-36312. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `59e0c25376`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 13:52:40 +08:00
Eugene Koifman	c59e54fe0e	[SPARK-35639][SQL] Add metrics about coalesced partitions to AQEShuffleRead in AQE ### What changes were proposed in this pull request? AQEShuffleReadExec already reports "number of skewed partitions" and "number of skewed partition splits". It would be useful to also report "number of coalesced partitions" and for ShuffleExchange to report "number of partitions" This way it's clear what happened on the map side and on the reduce side. ![Metrics](https://user-images.githubusercontent.com/4297661/126729820-cf01b3fa-7bc4-44a5-8098-91689766a68a.png) ### Why are the changes needed? Improves usability ### Does this PR introduce _any_ user-facing change? Yes, it now provides more information about `AQEShuffleReadExec` operator behavior in the metrics system. ### How was this patch tested? Existing tests Closes #32776 from ekoifman/PRISM-91635-customshufflereader-sql-metrics. Authored-by: Eugene Koifman <eugene.koifman@workday.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `41a16ebf11`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 13:50:04 +08:00
allisonwang-db	993ffafc3e	[SPARK-36275][SQL] ResolveAggregateFunctions should works with nested fields ### What changes were proposed in this pull request? This PR fixes an issue in `ResolveAggregateFunctions` where non-aggregated nested fields in ORDER BY and HAVING are not resolved correctly. This is because nested fields are resolved as aliases that fail to be semantically equal to any grouping/aggregate expressions. ### Why are the changes needed? To fix an analyzer issue. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #33498 from allisonwang-db/spark-36275-resolve-agg-func. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `23a6ffa5dc`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 13:35:35 +08:00
allisonwang-db	aea36aa977	[SPARK-36028][SQL][3.2] Allow Project to host outer references in scalar subqueries This PR cherry picks https://github.com/apache/spark/pull/33235 to branch-3.2 to fix test failures introduced by https://github.com/apache/spark/pull/33284. ### What changes were proposed in this pull request? This PR allows the `Project` node to host outer references in scalar subqueries when `decorrelateInnerQuery` is enabled. It is already supported by the new decorrelation framework and the `RewriteCorrelatedScalarSubquery` rule. Note currently by default all correlated subqueries will be decorrelated, which is not necessarily the most optimal approach. Consider `SELECT (SELECT c1) FROM t`. This should be optimized as `SELECT c1 FROM t` instead of rewriting it as a left outer join. This will be done in a separate PR to optimize correlated scalar/lateral subqueries with OneRowRelation. ### Why are the changes needed? To allow more types of correlated scalar subqueries. ### Does this PR introduce _any_ user-facing change? Yes. This PR allows outer query column references in the SELECT cluase of a correlated scalar subquery. For example: ```sql SELECT (SELECT c1) FROM t; ``` Before this change: ``` org.apache.spark.sql.AnalysisException: Expressions referencing the outer query are not supported outside of WHERE/HAVING clauses ``` After this change: ``` +------------------+ \|scalarsubquery(c1)\| +------------------+ \|0 \| \|1 \| +------------------+ ``` ### How was this patch tested? Added unit tests and SQL tests. (cherry picked from commit `ca348e50a4`) Signed-off-by: allisonwang-db <allison.wangdatabricks.com> Closes #33527 from allisonwang-db/spark-36028-3.2. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 12:54:15 +08:00
Huaxin Gao	33ef52e2c0	[SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up ### What changes were proposed in this pull request? update java doc, JDBC data source doc, address follow up comments ### Why are the changes needed? update doc and address follow up comments ### Does this PR introduce _any_ user-facing change? Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc. ### How was this patch tested? manually checked Closes #33526 from huaxingao/aggPD_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c8dd97d456`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 12:52:58 +08:00
Liang-Chi Hsieh	dcd37f9639	Revert "[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package" This reverts commit `634f96dde4`. Closes #33533 from viirya/revert-SPARK-36136. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `22ac98dcbf`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-27 19:11:51 +09:00
Linhong Liu	91b9de3d80	[SPARK-36241][SQL] Support creating tables with null column ### What changes were proposed in this pull request? Previously we blocked creating tables with the null column to follow the hive behavior in PR #28833 In this PR, I propose the restore the previous behavior to support the null column in a table. ### Why are the changes needed? For a complex query, it's possible to generate a column with null type. If this happens to the input query of CTAS, the query will fail due to Spark doesn't allow creating a table with null type. From the user's perspective, it’s hard to figure out why the null type column is produced in the complicated query and how to fix it. So removing this constraint is more friendly to users. ### Does this PR introduce _any_ user-facing change? Yes, this reverts the previous behavior change in #28833, for example, below command will success after this PR ```sql CREATE TABLE t (col_1 void, col_2 int) ``` ### How was this patch tested? newly added and existing test cases Closes #33488 from linhongliu-db/SPARK-36241-support-void-column. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `8e7e14dc0d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-27 17:32:16 +08:00
Wenchen Fan	14328e043d	[SPARK-36247][SQL] Check string length for char/varchar and apply type coercion in UPDATE/MERGE command ### What changes were proposed in this pull request? We added the char/varchar support in 3.1, but the string length check is only applied to INSERT, not UPDATE/MERGE. This PR fixes it. This PR also adds the missing type coercion for UPDATE/MERGE. ### Why are the changes needed? complete the char/varchar support and make UPDATE/MERGE easier to use by doing type coercion. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new UT. No built-in source support UPDATE/MERGE so end-to-end test is not applicable here. Closes #33468 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `068f8d434a`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-27 13:57:26 +08:00
Chao Sun	ae7b32a9e8	[SPARK-36136][SQL][TESTS] Refactor PruneFileSourcePartitionsSuite etc to a different package ### What changes were proposed in this pull request? Move both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` to the package `org.apache.spark.sql.execution.datasources`. Did a few refactoring to enable this. ### Why are the changes needed? Currently both `PruneFileSourcePartitionsSuite` and `PrunePartitionSuiteBase` are in package `org.apache.spark.sql.hive.execution` which doesn't look correct as these tests are not specific to Hive. Therefore, it's better to move them into `org.apache.spark.sql.execution.datasources`, the same place where the rule `PruneFileSourcePartitions` is at. ### Does this PR introduce _any_ user-facing change? No, it's just test refactoring. ### How was this patch tested? Using existing tests: ``` build/sbt "sql/testOnly PruneFileSourcePartitionsSuite" ``` and ``` build/sbt "hive/testOnly PruneHiveTablePartitionsSuite" ``` Closes #33350 from sunchao/SPARK-36136-partitions-suite. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `634f96dde4`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-26 13:04:06 -07:00
Hyukjin Kwon	a77c9d6d17	[SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE ### What changes were proposed in this pull request? This PR proposes to rename: - Rename `Reader`/`reader` to `Read`/`read` for rules and execution plan (user-facing doc/config name remain untouched) - `ShuffleReaderExec` ->`ShuffleReadExec` - `isLocalReader` -> `isLocalRead` - ... - Rename `CustomShuffle` prefix to `AQEShuffle` - Rename `OptimizeLocalShuffleReader` rule to `OptimizeShuffleWithLocalRead` ### Why are the changes needed? There are multiple problems in the current naming: - `CustomShuffle` -> `AQEShuffle` it sounds like it is a pluggable API. However, this is actually only used by AQE. - `OptimizeLocalShuffleReader` -> `OptimizeShuffleWithLocalRead` it is the name of a rule but it can be misread as a reader, which is counterintuative - `ReaderExec` -> `ReadExec` Reader execution reads a bit odd. It should better be read execution (like `ScanExec`, `ProjectExec` and `FilterExec`). I can't find the reason to name it with something that performs an action. See also the generated plans: Before: ``` ... * HashAggregate (12) +- CustomShuffleReader (11) +- ShuffleQueryStage (10) +- Exchange (9) ... ``` After: ``` ... * HashAggregate (12) +- AQEShuffleRead (11) +- ShuffleQueryStage (10) +- Exchange (9) .. ``` ### Does this PR introduce _any_ user-facing change? No, internal refactoring. ### How was this patch tested? Existing unittests should cover the changes. Closes #33429 from HyukjinKwon/SPARK-36217. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `6e3d404cec`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 22:42:16 +08:00
Angerszhuuuu	07c7a6f739	[SPARK-34402][SQL] Group exception about data format schema ### What changes were proposed in this pull request? Group exception about data format schema of different format, orc/parquet ### Why are the changes needed? group exception ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #33296 from AngersZhuuuu/SPARK-34402. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `a63802f2c6`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 19:18:56 +08:00
Cheng Su	f42cc10512	[SPARK-36269][SQL] Fix only set data columns to Hive column names config ### What changes were proposed in this pull request? When reading Hive table, we set the Hive column id and column name configs (`hive.io.file.readcolumn.ids` and `hive.io.file.readcolumn.names`). We should set non-partition columns (data columns) for both configs, as Spark always [appends partition columns in its own Hive reader](https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/TableReader.scala#L240). The column id config has only non-partition columns, but column name config has both partition and non-partition columns. We should keep them to be consistent with only non-partition columns. This does not cause issue for public OSS Hive file format for now. But for customized internal Hive file format, it causes the issue as we are expecting these two configs to be same. ### Why are the changes needed? Fix the code logic to be more consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing Hive tests. Closes #33489 from c21/hive-col. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e5616e32ee`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 18:48:26 +08:00
michaelzhang-db	ec91818e14	[SPARK-36105][SQL] OptimizeLocalShuffleReader support reading data of multiple mappers in one task ### What changes were proposed in this pull request? Added another partition spec to allow OptimizeLocalShuffleReader rule to read data from multiple mappers if the parallelism is less than the number of mappers. ### Why are the changes needed? Optimization to the OptimizeLocalShuffleReader rule ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #33310 from michaelzhang-db/supportDataFromMultipleMappers. Authored-by: michaelzhang-db <michael.zhang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `094ae3708f`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 17:57:15 +08:00
Huaxin Gao	b1f522cf97	[SPARK-34952][SQL] DSv2 Aggregate push down APIs ### What changes were proposed in this pull request? Add interfaces and APIs to push down Aggregates to V2 Data Source ### Why are the changes needed? improve performance ### Does this PR introduce _any_ user-facing change? SQLConf.PARQUET_AGGREGATE_PUSHDOWN_ENABLED was added. If this is set to true, Aggregates are pushed down to Data Source. ### How was this patch tested? New tests were added to test aggregates push down in https://github.com/apache/spark/pull/32049. The original PR is split into two PRs. This PR doesn't contain new tests. Closes #33352 from huaxingao/aggPushDownInterface. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c561ee6865`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-26 16:01:43 +08:00
Liang-Chi Hsieh	a6418a3463	[SPARK-36270][BUILD] Change memory settings for enabling GA ### What changes were proposed in this pull request? Trying to adjust build memory settings and serial execution to re-enable GA. ### Why are the changes needed? GA tests are failed recently due to return code 137. We need to adjust build settings to make GA work. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? GA Closes #33447 from viirya/test-ga. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `fd36ed4550`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-23 19:11:09 +09:00
Gengliang Wang	c5697d0f4a	[SPARK-36257][SQL][3.2] Updated the version of TimestampNTZ related changes as 3.3.0 ### What changes were proposed in this pull request? As we decided to release TimestampNTZ type in Spark 3.3, we should update the versions of TimestampNTZ related changes as 3.3.0. ### Why are the changes needed? Correct the versions in documentation/code comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #33480 from gengliangwang/updateVersion3.2. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-22 18:21:28 +03:00
Kousuke Saruta	3ee9a0db3a	[SPARK-35815][SQL] Allow delayThreshold for watermark to be represented as ANSI interval literals ### What changes were proposed in this pull request? This PR extends the way to represent `delayThreshold` with ANSI interval literals for watermark. ### Why are the changes needed? A `delayThreshold` is semantically an interval value so it's should be represented as ANSI interval literals as well as the conventional `1 second` form. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. Closes #33456 from sarutak/delayThreshold-interval. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `07fa38e2c1`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-22 17:36:52 +03:00
Angerszhuuuu	4a6f7d6c82	[SPARK-36156][SQL] SCRIPT TRANSFORM ROW FORMAT DELIMITED should respect `NULL DEFINED AS` and default value should be `\N` ### What changes were proposed in this pull request? SCRIPT TRANSFORM ROW FORMAT DELIMITED should respect `NULL DEFINED AS` and default value should be `\N` ![image](https://user-images.githubusercontent.com/46485123/125775377-611d4f06-f9e5-453a-990d-5a0018774f43.png) ![image](https://user-images.githubusercontent.com/46485123/125775387-6618bd0c-78d8-4457-bcc2-12dd70522946.png) ### Why are the changes needed? Keep consistence with Hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33363 from AngersZhuuuu/SPARK-36156. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `bb09bd2e2d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-22 17:28:52 +08:00
allisonwang-db	31bb9e04ad	[SPARK-36063][SQL] Optimize OneRowRelation subqueries ### What changes were proposed in this pull request? This PR adds optimization for scalar and lateral subqueries with OneRowRelation as leaf nodes. It inlines such subqueries before decorrelation to avoid rewriting them as left outer joins. It also introduces a flag to turn on/off this optimization: `spark.sql.optimizer.optimizeOneRowRelationSubquery` (default: True). For example: ```sql select (select c1) from t ``` Analyzed plan: ``` Project [scalar-subquery#17 [c1#18] AS scalarsubquery(c1)#22] : +- Project [outer(c1#18)] : +- OneRowRelation +- LocalRelation [c1#18, c2#19] ``` Optimized plan before this PR: ``` Project [c1#18#25 AS scalarsubquery(c1)#22] +- Join LeftOuter, (c1#24 <=> c1#18) :- LocalRelation [c1#18] +- Aggregate [c1#18], [c1#18 AS c1#18#25, c1#18 AS c1#24] +- LocalRelation [c1#18] ``` Optimized plan after this PR: ``` LocalRelation [scalarsubquery(c1)#22] ``` ### Why are the changes needed? To optimize query plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new unit tests. Closes #33284 from allisonwang-db/spark-36063-optimize-subquery-one-row-relation. Authored-by: allisonwang-db <allison.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `de8e4be92c`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-22 10:48:48 +08:00
Kousuke Saruta	468165ae52	[SPARK-36208][SQL][3.2] SparkScriptTransformation should support ANSI interval types ### What changes were proposed in this pull request? This PR changes `BaseScriptTransformationExec` for `SparkScriptTransformationExec` to support ANSI interval types. ### Why are the changes needed? `SparkScriptTransformationExec` support `CalendarIntervalType` so it's better to support ANSI interval types as well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Authored-by: Kousuke Saruta <sarutakoss.nttdata.com> Signed-off-by: Max Gekk <max.gekkgmail.com> (cherry picked from commit `f56c7b71ff`) Signed-off-by: Max Gekk <max.gekkgmail.com> Closes #33463 from MaxGekk/sarutak_script-transformation-interval-3.2. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-21 20:54:18 +03:00
Gengliang Wang	99eb3ff226	[SPARK-36227][SQL][3.2] Remove TimestampNTZ type support in Spark 3.2 ### What changes were proposed in this pull request? Remove TimestampNTZ type support in the production code of Spark 3.2. To archive the goal, this PR adds the check "Utils.isTesting" in the following code branches: - keyword "timestamp_ntz" and "timestamp_ltz" in parser - New expressions from https://issues.apache.org/jira/browse/SPARK-35662 - Using java.time.localDateTime as the external type for TimestampNTZType - `SQLConf.timestampType` which determines the default timestamp type of Spark SQL. This is to minimize the code difference between the master branch. So that future users won't think TimestampNTZ is already available in Spark 3.2. The downside is that users can still find TimestampNTZType under package `org.apache.spark.sql.types`. There should be nothing left other than this. ### Why are the changes needed? As of now, there are some blockers for delivering the TimestampNTZ project in Spark 3.2: - In the Hive Thrift server, both TimestampType and TimestampNTZType are mapped to the same timestamp type, which can cause confusion for users. - For the Parquet data source, the new written TimestampNTZType Parquet columns will be read as TimestampType in old Spark releases. Also, we need to decide the merge schema for files mixed with TimestampType and TimestampNTZ type. - The type coercion rules for TimestampNTZType are incomplete. For example, what should the data type of the in clause "IN(Timestamp'2020-01-01 00:00:00', TimestampNtz'2020-01-01 00:00:00') be. - It is tricky to support TimestampNTZType in JSON/CSV data readers. We need to avoid regressions as possible as we can. There are 10 days left for the expected 3.2 RC date. So, I propose to release the TimestampNTZ type in Spark 3.3 instead of Spark 3.2. So that we have enough time to make considerate designs for the issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing Unit tests + manual tests from spark-shell to validate the changes are gone. New functions ``` spark.sql("select to_timestamp_ntz'2021-01-01 00:00:00'").show() spark.sql("select to_timestamp_ltz'2021-01-01 00:00:00'").show() spark.sql("select make_timestamp_ntz(1,1,1,1,1,1)").show() spark.sql("select make_timestamp_ltz(1,1,1,1,1,1)").show() spark.sql("select localtimestamp()").show() ``` The SQL configuration `spark.sql.timestampType` should not work in 3.2 ``` spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ") spark.sql("select make_timestamp(1,1,1,1,1,1)").schema spark.sql("select to_timestamp('2021-01-01 00:00:00')").schema spark.sql("select timestamp'2021-01-01 00:00:00'").schema Seq((1, java.sql.Timestamp.valueOf("2021-01-01 00:00:00"))).toDF("i", "ts").write.partitionBy("ts").parquet("/tmp/test") spark.read.parquet("/tmp/test").schema ``` LocalDateTime is not supported as a built-in external type: ``` Seq(LocalDateTime.now()).toDF() org.apache.spark.sql.catalyst.expressions.Literal(java.time.LocalDateTime.now()) org.apache.spark.sql.catalyst.expressions.Literal(0L, TimestampNTZType) ``` Closes #33444 from gengliangwang/banNTZ. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-21 09:55:09 -07:00
Kent Yao	7d363733ac	[SPARK-36213][SQL] Normalize PartitionSpec for Describe Table Command with PartitionSpec ### What changes were proposed in this pull request? This fixes a case sensitivity issue for desc table commands with partition specified. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? yes, but it's a bugfix ### How was this patch tested? new tests #### before ``` +-- !query +DESC EXTENDED t PARTITION (C='Us', D=1) +-- !query schema +struct<> +-- !query output +org.apache.spark.sql.AnalysisException +Partition spec is invalid. The spec (C, D) must match the partition spec (c, d) defined in table '`default`.`t`' + ``` #### after https://github.com/apache/spark/pull/33424/files#diff-554189c49950974a948f99fa9b7436f615052511660c6a0ae3062fa8ca0a327cR328 Closes #33424 from yaooqinn/SPARK-36213. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit `4cd6cfc773`) Signed-off-by: Kent Yao <yao@apache.org>	2021-07-22 00:53:12 +08:00
Shardul Mahadik	1ce678b2aa	[SPARK-28266][SQL] convertToLogicalRelation should not interpret `path` property when reading Hive tables ### What changes were proposed in this pull request? For non-datasource Hive tables, e.g. tables written outside of Spark (through Hive or Trino), we have certain optimzations in Spark where we use Spark ORC and Parquet datasources to read these tables ([Ref](`fbf53dee37/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala (L128)`)) rather than using the Hive serde. If such a table contains a `path` property, Spark will try to list this path property in addition to the table location when creating an `InMemoryFileIndex`. ([Ref](`fbf53dee37/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L575)`)) This can lead to wrong data if `path` property points to a directory location or an error if `path` is not a location. A concrete example is provided in [SPARK-28266 (comment)](https://issues.apache.org/jira/browse/SPARK-28266?focusedCommentId=17380170&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17380170). Since these tables were not written through Spark, Spark should not interpret this `path` property as it can be set by an external system with a different meaning. ### Why are the changes needed? For better compatibility with Hive tables generated by other platforms (non-Spark) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit test Closes #33328 from shardulm94/spark-28266. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `685c3fd05b`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-21 22:40:59 +08:00
Wenchen Fan	f4291e373e	[SPARK-36228][SQL] Skip splitting a skewed partition when some map outputs are removed ### What changes were proposed in this pull request? Sometimes, AQE skew join optimization can fail with NPE. This is because AQE tries to get the shuffle block sizes, but some map outputs are missing due to the executor lost or something. This PR fixes this bug by skipping skew join handling if some map outputs are missing in the `MapOutputTracker`. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new UT Closes #33445 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `9c8a3d3975`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-21 22:18:14 +08:00
Wenchen Fan	b5c0f6c774	[SPARK-36020][SQL][FOLLOWUP] RemoveRedundantProjects should retain the LOGICAL_PLAN_TAG tag ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/33222 . https://github.com/apache/spark/pull/33222 made a mistake that, `RemoveRedundantProjects` may lose the `LOGICAL_PLAN_TAG` tag, even though the logical plan link is retained. This was actually caught by the test `LogicalPlanTagInSparkPlanSuite`, but was not being taken care of. There is no problem so far, but losing information can always lead to potential bugs. ### Why are the changes needed? fix a mistake ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #33442 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `94aece4325`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-21 14:03:22 +08:00
Rahul Mahadev	0d60cb51c0	[SPARK-36132][SS][SQL] Support initial state for batch mode of flatMapGroupsWithState ### What changes were proposed in this pull request? Adding support for accepting an initial state with flatMapGroupsWithState in batch mode. ### Why are the changes needed? SPARK-35897 added support for accepting an initial state for streaming queries using flatMapGroupsWithState. the code flow is separate for batch and streaming and required a different PR. ### Does this PR introduce _any_ user-facing change? Yes as discussed above flatMapGroupsWithState in batch mode can accept an initialState, previously this would throw an UnsupportedOperationException ### How was this patch tested? Added relevant unit tests in FlatMapGroupsWithStateSuite and modified the tests `JavaDatasetSuite` Closes #33336 from rahulsmahadev/flatMapGroupsWithStateBatch. Authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com> (cherry picked from commit `efcce23b91`) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2021-07-21 01:51:01 -04:00
Liang-Chi Hsieh	0b14ab12a2	[SPARK-36030][SQL][FOLLOW-UP][3.2] Remove duplicated test suiteRemove duplicated test suite ### What changes were proposed in this pull request? Removes `FileFormatDataWriterMetricSuite` which duplicated. ### Why are the changes needed? `FileFormatDataWriterMetricSuite` should be renamed to `InMemoryTableMetricSuite`. But it was wrongly copied. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #33454 from viirya/SPARK-36030-followup-3.2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 22:29:57 -07:00
Hyukjin Kwon	6041d1c51b	[SPARK-36030][SQL][FOLLOW-UP] Avoid procedure syntax deprecated in Scala 2.13 ### What changes were proposed in this pull request? This PR avoid using procedure syntax deprecated in Scala 2.13. https://github.com/apache/spark/runs/3120481756?check_suite_focus=true ``` [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileFormatDataWriterMetricSuite.scala:44:90: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testMetricOnDSv2`'s return type [error] private def testMetricOnDSv2(func: String => Unit, checker: Map[Long, String] => Unit) { [error] ^ [error] /home/runner/work/spark/spark/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/InMemoryTableMetricSuite.scala:44:90: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `testMetricOnDSv2`'s return type [error] private def testMetricOnDSv2(func: String => Unit, checker: Map[Long, String] => Unit) { [error] ^ [warn] 100 warnings found [error] two errors found [error] (sql / Test / compileIncremental) Compilation failed [error] Total time: 579 s (09:39), completed Jul 21, 2021 4:14:26 AM ``` ### Why are the changes needed? To make the build compatible with Scala 2.13 in Spark. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested: ```bash ./dev/change-scala-version.sh 2.13 ./build/mvn -DskipTests -Phive-2.3 -Phive clean package -Pscala-2.13 ``` Closes #33452 from HyukjinKwon/SPARK-36030. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `99006e515b`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-21 14:09:35 +09:00
Liang-Chi Hsieh	86d1fb4698	[SPARK-36030][SQL] Support DS v2 metrics at writing path ### What changes were proposed in this pull request? We add the interface for DS v2 metrics in SPARK-34366. It is only added for reading path, though. This patch extends the metrics interface to writing path. ### Why are the changes needed? Complete DS v2 metrics interface support in writing path. ### Does this PR introduce _any_ user-facing change? No. For developer, yes, as this adds metrics support at DS v2 writing path. ### How was this patch tested? Added test. Closes #33239 from viirya/v2-write-metrics. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `2653201b0a`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 20:20:48 -07:00
gengjiaan	9a7c59c99c	[SPARK-36222][SQL] Step by days in the Sequence expression for dates ### What changes were proposed in this pull request? The current implement of `Sequence` expression not support step by days for dates. ``` spark-sql> select sequence(date'2021-07-01', date'2021-07-10', interval '3' day); Error in query: cannot resolve 'sequence(DATE '2021-07-01', DATE '2021-07-10', INTERVAL '3' DAY)' due to data type mismatch: sequence uses the wrong parameter type. The parameter type must conform to: 1. The start and stop expressions must resolve to the same type. 2. If start and stop expressions resolve to the 'date' or 'timestamp' type then the step expression must resolve to the 'interval' or 'interval year to month' or 'interval day to second' type, otherwise to the same type as the start and stop expressions. ; line 1 pos 7; 'Project [unresolvedalias(sequence(2021-07-01, 2021-07-10, Some(INTERVAL '3' DAY), Some(Europe/Moscow)), None)] +- OneRowRelation ``` ### Why are the changes needed? `DayTimeInterval` has day granularity should as step for dates. ### Does this PR introduce _any_ user-facing change? 'Yes'. Sequence expression will supports step by `DayTimeInterval` has day granularity for dates. ### How was this patch tested? New tests. Closes #33439 from beliefer/SPARK-36222. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `c0d84e6cf1`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-20 19:17:09 +03:00
Koert Kuipers	a864388b5a	[SPARK-36210][SQL] Preserve column insertion order in Dataset.withColumns ### What changes were proposed in this pull request? Preserve the insertion order of columns in Dataset.withColumns ### Why are the changes needed? It is the expected behavior. We preserve insertion order in all other places. ### Does this PR introduce _any_ user-facing change? No. Currently Dataset.withColumns is not actually used anywhere to insert more than one column. This change is to make sure it behaves as expected when it is used for that purpose in future. ### How was this patch tested? Added test in DatasetSuite Closes #33423 from koertkuipers/feat-withcolumns-preserve-order. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `bf680bf25a`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-20 09:09:34 -07:00
Karen Feng	f55f8820fc	[SPARK-36079][SQL] Null-based filter estimate should always be in the range [0, 1] ### What changes were proposed in this pull request? Forces the selectivity estimate for null-based filters to be in the range `[0,1]`. ### Why are the changes needed? I noticed in a few TPC-DS query tests that the column statistic null count can be higher than the table statistic row count. In the current implementation, the selectivity estimate for `IsNotNull` is negative. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33286 from karenfeng/bound-selectivity-est. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ddc61e62b9`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-20 21:32:30 +08:00
gengjiaan	0f6cf8abe3	[SPARK-36046][SQL][FOLLOWUP] Implement prettyName for MakeTimestampNTZ and MakeTimestampLTZ ### What changes were proposed in this pull request? This PR follows https://github.com/apache/spark/pull/33299 and implement `prettyName` for `MakeTimestampNTZ` and `MakeTimestampLTZ` based on the discussion show below https://github.com/apache/spark/pull/33299/files#r668423810 ### Why are the changes needed? This PR fix the incorrect alias usecase. ### Does this PR introduce _any_ user-facing change? 'No'. Modifications are transparent to users. ### How was this patch tested? Jenkins test. Closes #33430 from beliefer/SPARK-36046-followup. Authored-by: gengjiaan <gengjiaan@360.cn> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `033a5731b4`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-20 21:31:34 +08:00
Angerszhuuuu	7cd89efca5	[SPARK-36201][SQL][FOLLOWUP] Schema check should check inner field too ### What changes were proposed in this pull request? When inner field have wrong schema filed name should check field name too. ![image](https://user-images.githubusercontent.com/46485123/126101009-c192d87f-1e18-4355-ad53-1419dacdeb76.png) ### Why are the changes needed? Early check early faield ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #33409 from AngersZhuuuu/SPARK-36201. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `251885772d`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-20 21:08:36 +08:00
ulysses-you	677104f495	[SPARK-36221][SQL] Make sure CustomShuffleReaderExec has at least one partition ### What changes were proposed in this pull request? * Add non-empty partition check in `CustomShuffleReaderExec` * Make sure `OptimizeLocalShuffleReader` doesn't return empty partition ### Why are the changes needed? Since SPARK-32083, AQE coalesce always return at least one partition, it should be robust to add non-empty check in `CustomShuffleReaderExec`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? not need Closes #33431 from ulysses-you/non-empty-partition. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `b70c25881c`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-20 20:48:51 +08:00
Kent Yao	782dc9a795	[SPARK-36179][SQL] Support TimestampNTZType in SparkGetColumnsOperation ### What changes were proposed in this pull request? Support TimestampNTZType in SparkGetColumnsOperation ### Why are the changes needed? TimestampNTZType coverage ### Does this PR introduce _any_ user-facing change? yes, jdbc end-users will be aware of TimestampNTZType ### How was this patch tested? add new test Closes #33393 from yaooqinn/SPARK-36179. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `0c76fb9c01`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-20 09:49:25 +09:00
gengjiaan	ab4c160880	[SPARK-36091][SQL] Support TimestampNTZ type in expression TimeWindow ### What changes were proposed in this pull request? The current implement of `TimeWindow` only supports `TimestampType`. Spark added a new type `TimestampNTZType`, so we should support `TimestampNTZType` in expression `TimeWindow`. ### Why are the changes needed? `TimestampNTZType` similar to `TimestampType`, we should support `TimestampNTZType` in expression `TimeWindow`. ### Does this PR introduce _any_ user-facing change? 'Yes'. `TimeWindow` will accepts `TimestampNTZType`. ### How was this patch tested? New tests. Closes #33341 from beliefer/SPARK-36091. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `7aa01798c5`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-19 19:24:01 +08:00
Angerszhuuuu	84a6fa22b3	[SPARK-36093][SQL] RemoveRedundantAliases should not change Command's parameter's expression's name ### What changes were proposed in this pull request? RemoveRedundantAliases may change DataWritingCommand's parameter's attribute name. In the UT's case before RemoveRedundantAliases the partitionColumns is `CAL_DT`, and change by RemoveRedundantAliases and change to `cal_dt` then case the error case ### Why are the changes needed? Fix bug ### Does this PR introduce _any_ user-facing change? For below SQL case ``` sql("create table t1(cal_dt date) using parquet") sql("insert into t1 values (date'2021-06-27'),(date'2021-06-28'),(date'2021-06-29'),(date'2021-06-30')") sql("create view t1_v as select * from t1") sql("CREATE TABLE t2 USING PARQUET PARTITIONED BY (CAL_DT) AS SELECT 1 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-27' AND '2021-06-28'") sql("INSERT INTO t2 SELECT 2 AS FLAG,CAL_DT FROM t1_v WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'") ``` Before this pr ``` sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show +----+------+ \|FLAG\|CAL_DT\| +----+------+ +----+------+ sql("SELECT * FROM t2 ").show +----+----------+ \|FLAG\| CAL_DT\| +----+----------+ \| 1\|2021-06-27\| \| 1\|2021-06-28\| +----+----------+ ``` After this pr ``` sql("SELECT * FROM t2 WHERE CAL_DT BETWEEN '2021-06-29' AND '2021-06-30'").show +----+------+ \|FLAG\|CAL_DT\| +----+------+ \| 2\|2021-06-29\| \| 2\|2021-06-30\| +----+------+ sql("SELECT * FROM t2 ").show +----+----------+ \|FLAG\| CAL_DT\| +----+----------+ \| 1\|2021-06-27\| \| 1\|2021-06-28\| \| 2\|2021-06-29\| \| 2\|2021-06-30\| +----+----------+ ``` ### How was this patch tested? Added UT Closes #33324 from AngersZhuuuu/SPARK-36093. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `313f3c5460`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-19 16:22:47 +08:00
Kent Yao	d0c4d224e0	[SPARK-36197][SQL] Use PartitionDesc instead of TableDesc for reading hive partitioned tables ### What changes were proposed in this pull request? A hive partition can have different `PartitionDesc`s from `TableDesc` for describing Serde/InputFormatClass/OutputFormatClass, for a hive partitioned table, we shall respect those in `PartitionDesc`. ### Why are the changes needed? in many cases, that Spark reads hive tables could result in surprise because of this issue. ### Does this PR introduce _any_ user-facing change? yes, hive partition table that contains different serde/input/output could be recognized by Spark ### How was this patch tested? new test added Closes #33406 from yaooqinn/SPARK-36197. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit `ef80356614`) Signed-off-by: Kent Yao <yao@apache.org>	2021-07-19 16:00:12 +08:00
Wenchen Fan	5b98ec2527	[SPARK-36184][SQL] Use ValidateRequirements instead of EnsureRequirements to skip AQE rules that adds extra shuffles ### What changes were proposed in this pull request? Currently, two AQE rules `OptimizeLocalShuffleReader` and `OptimizeSkewedJoin` run `EnsureRequirements` at the end to check if there are extra shuffles in the optimized plan and revert the optimization if extra shuffles are introduced. This PR proposes to run `ValidateRequirements` instead, which is much simpler than `EnsureRequirements`. This PR also moves this check to `AdaptiveSparkPlanExec`, so that it's centralized instead of in each rule. After centralization, the batch name of optimizing the final stage is the same as normal stages, which makes more sense. ### Why are the changes needed? `EnsureRequirements` is a big rule and even contains optimizations (remove unnecessary shuffles). `ValidateRequirements` is much faster to run and can avoid potential bugs as it has no optimization and is a pure check. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. Closes #33396 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `8396a70ddc`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-19 14:14:58 +08:00
gengjiaan	85f70a1181	[SPARK-36090][SQL] Support TimestampNTZType in expression Sequence ### What changes were proposed in this pull request? The current implement of `Sequence` accept `TimestampType`, `DateType` and `IntegralType`. This PR will let `Sequence` accepts `TimestampNTZType`. ### Why are the changes needed? We can generate sequence for timestamp without time zone. ### Does this PR introduce _any_ user-facing change? 'Yes'. This PR will let `Sequence` accepts `TimestampNTZType`. ### How was this patch tested? New tests. Closes #33360 from beliefer/SPARK-36090. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `42275bb20d`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-18 20:46:37 +03:00
Kousuke Saruta	f7ed6fc6c6	[SPARK-36170][SQL] Change quoted interval literal (interval constructor) to be converted to ANSI interval types ### What changes were proposed in this pull request? This PR changes the behavior of the quoted interval literals like `SELECT INTERVAL '1 year 2 month'` to be converted to ANSI interval types. ### Why are the changes needed? The tnit-to-unit interval literals and the unit list interval literals are converted to ANSI interval types but quoted interval literals are still converted to CalendarIntervalType. ``` -- Unit list interval literals spark-sql> select interval 1 year 2 month; 1-2 -- Quoted interval literals spark-sql> select interval '1 year 2 month'; 1 years 2 months ``` ### Does this PR introduce _any_ user-facing change? Yes but the following sentence in `sql-migration-guide.md` seems to cover this change. ``` - In Spark 3.2, the unit list interval literals can not mix year-month fields (YEAR and MONTH) and day-time fields (WEEK, DAY, ..., MICROSECOND). For example, `INTERVAL 1 day 1 hour` is invalid in Spark 3.2. In Spark 3.1 and earlier, there is no such limitation and the literal returns value of `CalendarIntervalType`. To restore the behavior before Spark 3.2, you can set `spark.sql.legacy.interval.enabled` to `true`. ``` ### How was this patch tested? Modified existing tests and add new tests. Closes #33380 from sarutak/fix-interval-constructor. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `71ea25d4f5`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-17 12:23:50 +03:00
Liang-Chi Hsieh	3d423b94a1	[SPARK-35785][SS][FOLLOWUP] Remove ignored test from RocksDBSuite ### What changes were proposed in this pull request? This patch removes an ignored test from `RocksDBSuite`. ### Why are the changes needed? The removed test is now ignored. The test itself doesn't look making sense. For example, the condition for capturing exception is never matched. The test runs updates to RocksDB instances at same remote dir with same versions. This doesn't look like a case it will run through in practice. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #33401 from viirya/remove-ignore-test. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `8009f0dd92`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-17 02:05:06 -07:00
Chao Sun	a7c576ee14	[SPARK-36128][SQL] Apply spark.sql.hive.metastorePartitionPruning for non-Hive tables that uses Hive metastore for partition management ### What changes were proposed in this pull request? In `CatalogFileIndex.filterPartitions`, check the config `spark.sql.hive.metastorePartitionPruning` and don't pushdown predicates to remote HMS if it is false. Instead, fallback to the `listPartitions` API and do the filtering on the client side. ### Why are the changes needed? Currently the config `spark.sql.hive.metastorePartitionPruning` is only effective for Hive tables, and for non-Hive tables we'd always use the `listPartitionsByFilter` API from HMS client. On the other hand, by default all data source tables also manage their partitions through HMS, when the config `spark.sql.hive.manageFilesourcePartitions` is turned on. Therefore, it seems reasonable to extend the above config for non-Hive tables as well. In certain cases the remote HMS service could throw exceptions when using the `listPartitionsByFilter` API, which, on the Spark side, is unrecoverable at the current state. Therefore it would be better to allow users to disable the API by using the above config. For instance, HMS only allow pushdown date column when direct SQL is used instead of JDO for interacting with the underlying RDBMS, and will throw exception otherwise. Even though the Spark Hive client will attempt to recover itself when the exception happens, it only does so when the config `hive.metastore.try.direct.sql` from remote HMS is `false`. There could be cases where the value of `hive.metastore.try.direct.sql` is true but remote HMS still throws exception. ### Does this PR introduce _any_ user-facing change? Yes now the config `spark.sql.hive.metastorePartitionPruning` is extended for non-Hive tables which use HMS to manage their partition metadata. ### How was this patch tested? Added a new unit test: ``` build/sbt "hive/testOnly *PruneFileSourcePartitionsSuite -- -z SPARK-36128" ``` Closes #33348 from sunchao/SPARK-36128-by-filter. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `37dc3f9ea7`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-16 13:32:45 -07:00
Jungtaek Lim	4bfcdf38cf	[SPARK-34893][SS] Support session window natively Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR). The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped). This PR leverages two different approaches on merging session windows: 1. merging session windows with Spark's aggregation logic (a variant of sort aggregation) 2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases. This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge. The state format is versioned, so that we can bring a new state format if we find a better one. ### Why are the changes needed? For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks: 1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows 2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves 3. (flat)MapGroupsWithState is only available in Scala/Java. With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations. Quoting the query example from test suite: ``` val inputData = MemoryStream[(String, Long)] // Split the lines into words, treat words as sessionId of events val events = inputData.toDF() .select($"_1".as("value"), $"_2".as("timestamp")) .withColumn("eventTime", $"timestamp".cast("timestamp")) .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime") .withWatermark("eventTime", "30 seconds") val sessionUpdates = events .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId) .agg(count("*").as("numEvents")) .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)", "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs", "numEvents") ``` which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes). `39542bb81f/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala (L66-L105)` (Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.) ### Does this PR introduce _any_ user-facing change? Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window". ### How was this patch tested? New test suites. Also tested with benchmark code. Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5. Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `f2bf8b051b`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-07-16 20:38:35 +09:00
Ke Jia	de3b8b996f	[SPARK-35710][SQL] Support DPP + AQE when there is no reused broadcast exchange ### What changes were proposed in this pull request? This PR add the DPP + AQE support when spark can't reuse the broadcast but executing the DPP subquery is cheaper. ### Why are the changes needed? Improve AQE + DPP ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Adding new ut Closes #32861 from JkSelf/supportDPP3. Lead-authored-by: Ke Jia <ke.a.jia@intel.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c1b3f86c58`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-16 16:01:23 +08:00
Steven Aerts	109247f02e	[SPARK-35985][SQL] push partitionFilters for empty readDataSchema this commit makes sure that for File Source V2 partition filters are also taken into account when the readDataSchema is empty. This is the case for queries like: SELECT count(*) FROM tbl WHERE partition=foo SELECT input_file_name() FROM tbl WHERE partition=foo ### What changes were proposed in this pull request? As described in SPARK-35985 there is bug in the File Datasource V2 which prevents it to push down to the FileScanner for queries like the ones listed above. ### Why are the changes needed? If partitions filters are not pushed down, the whole dataset will be scanned while only one partition is interesting. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? An extra test was added which relies on the output of explain, as is done in other places. Closes #33191 from steven-aerts/SPARK-35985. Authored-by: Steven Aerts <steven.aerts@airties.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `f06aa4a3f3`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-16 04:53:23 +00:00
Max Gekk	57849a54da	[SPARK-36034][SQL] Rebase datetime in pushed down filters to parquet ### What changes were proposed in this pull request? In the PR, I propose to propagate either the SQL config `spark.sql.parquet.datetimeRebaseModeInRead` or/and Parquet option `datetimeRebaseMode` to `ParquetFilters`. The `ParquetFilters` class uses the settings in conversions of dates/timestamps instances from datasource filters to values pushed via `FilterApi` to the `parquet-column` lib. Before the changes, date/timestamp values expressed as days/microseconds/milliseconds are interpreted as offsets in Proleptic Gregorian calendar, and pushed to the parquet library as is. That works fine if timestamp/dates values in parquet files were saved in the `CORRECTED` mode but in the `LEGACY` mode, filter's values could not match to actual values. After the changes, timestamp/dates values of filters pushed down to parquet libs such as `FilterApi.eq(col1, -719162)` are rebased according the rebase settings. For the example, if the rebase mode is `CORRECTED`, -719162 is pushed down as is but if the current rebase mode is `LEGACY`, the number of days is rebased to -719164. For more context, the PR description https://github.com/apache/spark/pull/28067 shows the diffs between two calendars. ### Why are the changes needed? The changes fix the bug portrayed by the following example from SPARK-36034: ```scala In [27]: spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY") >>> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") >>> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show() +----+ \|date\| +----+ +----+ ``` The result must have the date value `0001-01-01`. ### Does this PR introduce _any_ user-facing change? In some sense, yes. Query results can be different in some cases. For the example above: ```scala scala> spark.conf.set("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY") scala> spark.sql("SELECT DATE '0001-01-01' AS date").write.mode("overwrite").parquet("date_written_by_spark3_legacy") scala> spark.read.parquet("date_written_by_spark3_legacy").where("date = '0001-01-01'").show(false) +----------+ \|date \| +----------+ \|0001-01-01\| +----------+ ``` ### How was this patch tested? By running the modified test suite `ParquetFilterSuite`: ``` $ build/sbt "test:testOnly ParquetV1FilterSuite" $ build/sbt "test:testOnly ParquetV2FilterSuite" ``` Closes #33347 from MaxGekk/fix-parquet-ts-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `b09b7f7cc0`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-15 22:22:09 +03:00
Gengliang Wang	fce19ab31a	[SPARK-36135][SQL] Support TimestampNTZ type in file partitioning ### What changes were proposed in this pull request? Support TimestampNTZ type in file partitioning * When there is no provided schema and the default Timestamp type is TimestampNTZ , Spark should infer and parse the timestamp value partitions as TimestampNTZ. * When the provided Partition schema is TimestampNTZ, Spark should be able to parse the TimestampNTZ type partition column. ### Why are the changes needed? File partitioning is an important feature and Spark should support TimestampNTZ type in it. ### Does this PR introduce _any_ user-facing change? Yes, Spark supports TimestampNTZ type in file partitioning ### How was this patch tested? Unit tests Closes #33344 from gengliangwang/partition. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `96c2919988`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-16 01:13:49 +08:00
Yuming Wang	1ed72e2e8e	[SPARK-32792][SQL][FOLLOWUP] Fix Parquet filter pushdown NOT IN predicate ### What changes were proposed in this pull request? This pr fix Parquet filter pushdown `NOT` `IN` predicate if its values exceeds `spark.sql.parquet.pushdown.inFilterThreshold`. For example: `Not(In(a, Array(2, 3, 7))`. We can not push down `not(and(gteq(a, 2), lteq(a, 7)))`. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #33365 from wangyum/SPARK-32792-3. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `0062c03c15`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-15 18:52:06 +03:00
PengLei	1a3c56d3ea	[SPARK-29519][SQL][FOLLOWUP] Keep output is deterministic for show tblproperties ### What changes were proposed in this pull request? Keep the output order is deterministic for `SHOW TBLPROPERTIES` ### Why are the changes needed? [#33343](https://github.com/apache/spark/pull/33343#issue-689828187). Keep the output order deterministic meaningful. Since the properties are sorted and then compare result in the testcase for `SHOW TBLPROPERTIES`, it does not fail, but ideally, the output is ordered and deterministic. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existed ut test Closes #33353 from Peng-Lei/order-ouput-properties. Authored-by: PengLei <peng.8lei@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e05441c223`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-15 21:44:27 +08:00
Kousuke Saruta	42710991e2	[SPARK-33898][SQL][FOLLOWUP] Fix the behavior of `SHOW CREATE TABLE` to output deterministic results ### What changes were proposed in this pull request? This PR fixes a behavior of `SHOW CREATE TABLE` added in `SPARK-33898` (#32931) to output deterministic result. A test `SPARK-33898: SHOW CREATE TABLE` in `DataSourceV2SQLSuite` compares two `CREATE TABLE` statements. One is generated by `SHOW CREATE TABLE` against a created table and the other is expected `CREATE TABLE` statement. The created table has options `from` and `to`, and they are declared in this order. ``` CREATE TABLE $t ( a bigint NOT NULL, b bigint, c bigint, `extra col` ARRAY<INT>, `<another>` STRUCT<x: INT, y: ARRAY<BOOLEAN>> ) USING foo OPTIONS ( from = 0, to = 1) COMMENT 'This is a comment' TBLPROPERTIES ('prop1' = '1') PARTITIONED BY (a) LOCATION '/tmp' ``` And the expected `CREATE TABLE` in the test code is like as follows. ``` "CREATE TABLE testcat.ns1.ns2.tbl (", "`a` BIGINT NOT NULL,", "`b` BIGINT,", "`c` BIGINT,", "`extra col` ARRAY<INT>,", "`<another>` STRUCT<`x`: INT, `y`: ARRAY<BOOLEAN>>)", "USING foo", "OPTIONS(", "'from' = '0',", "'to' = '1')", "PARTITIONED BY (a)", "COMMENT 'This is a comment'", "LOCATION '/tmp'", "TBLPROPERTIES(", "'prop1' = '1')" ``` As you can see, the order of `from` and `to` is expected. But options are implemented as `Map` so the order of key cannot be kept. In fact, this test fails with Scala 2.13. ``` [info] - SPARK-33898: SHOW CREATE TABLE * FAILED * (515 milliseconds) [info] Array("CREATE TABLE testcat.ns1.ns2.tbl (", "`a` BIGINT NOT NULL,", "`b` BIGINT,", "`c` BIGINT,", "`extra col` ARRAY<INT>,", "`<another>` STRUCT<`x`: INT, `y`: ARRAY<BOOLEAN>>)", "USING foo", "OPTIONS(", "'to' = '1',", "'from' = '0')", "PARTITIONED BY (a)", "COMMENT 'This is a comment'", "LOCATION '/tmp'", "TBLPROPERTIES(", "'prop1' = '1')") did not equal Array("CREATE TABLE testcat.ns1.ns2.tbl (", "`a` BIGINT NOT NULL,", "`b` BIGINT,", "`c` BIGINT,", "`extra col` ARRAY<INT>,", "`<another>` STRUCT<`x`: INT, `y`: ARRAY<BOOLEAN>>)", "USING foo", "OPTIONS(", "'from' = '0',", "'to' = '1')", "PARTITIONED BY (a)", "COMMENT 'This is a comment'", "LOCATION '/tmp'", "TBLPROPERTIES(", "'prop1' = '1')") (DataSourceV2SQLSuite.scala:1997) ``` In the current master, the test doesn't fail with Scala 2.12 but it's still non-deterministic. ### Why are the changes needed? Bug fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that the modified test passed with both Scala 2.12 and Scala 2.13 with this change. Closes #33343 from sarutak/fix-show-create-table-test. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `f95ca31c0f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-15 20:53:32 +09:00
Gengliang Wang	75ff69a994	[SPARK-36037][TESTS][FOLLOWUP] Avoid wrong test results on daylight saving time ### What changes were proposed in this pull request? Only use the zone ids that has no daylight saving for testing `localtimestamp` ### Why are the changes needed? https://github.com/apache/spark/pull/33346#discussion_r670135296 MaxGekk suggests that we should avoid wrong results if possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33354 from gengliangwang/FIxDST. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `564d3de7c6`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-15 11:41:04 +03:00
Gengliang Wang	83b37dfaf0	[SPARK-36037][SQL][FOLLOWUP] Fix flaky test for datetime function localtimestamp ### What changes were proposed in this pull request? The threshold of the test case "datetime function localtimestamp" is small, which leads to flaky test results https://github.com/gengliangwang/spark/runs/3067396143?check_suite_focus=true This PR is to increase the threshold for checking two the different current local datetimes from 5ms to 1 second. (The test case of current_timestamp uses 5 seconds) ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #33346 from gengliangwang/fixFlaky. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `0973397721`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-07-15 11:32:35 +08:00
Karen Feng	8b35bc4d2b	[SPARK-36106][SQL][CORE] Label error classes for subset of QueryCompilationErrors ### What changes were proposed in this pull request? Adds error classes to some of the exceptions in QueryCompilationErrors. ### Why are the changes needed? Improves auditing for developers and adds useful fields for users (error class and SQLSTATE). ### Does this PR introduce _any_ user-facing change? Yes, fills in missing error class and SQLSTATE fields. ### How was this patch tested? Existing tests and new unit tests. Closes #33309 from karenfeng/group-compilation-errors-1. Authored-by: Karen Feng <karen.feng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `e92b8ea6f8`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-15 11:43:32 +09:00
ulysses-you	0da71548a5	[SPARK-35639][SQL][FOLLOWUP] Make hasCoalescedPartition return true if something was actually coalesced ### What changes were proposed in this pull request? Add `CoalescedPartitionSpec(0, 0, _)` check if a `CoalescedPartitionSpec` is coalesced. ### Why are the changes needed? Fix corner case. ### Does this PR introduce _any_ user-facing change? yes, UI may be changed ### How was this patch tested? Add test Closes #33342 from ulysses-you/SPARK-35639-FOLLOW. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `3819641201`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 22:05:05 +08:00
Chao Sun	4cc7d9b8f1	[SPARK-36123][SQL] Parquet vectorized reader doesn't skip null values correctly ### What changes were proposed in this pull request? Fix the skipping values logic in Parquet vectorized reader when column index is effective, by considering nulls and only call `ParquetVectorUpdater.skipValues` when the values are non-null. ### Why are the changes needed? Currently, the Parquet vectorized reader may not work correctly if column index filtering is effective, and the data page contains null values. For instance, let's say we have two columns `c1: BIGINT` and `c2: STRING`, and the following pages: ``` * c1 500 500 500 500 * \|---------\|---------\|---------\|---------\| * \|-------\|-----\|-----\|---\|---\|---\|---\|---\| * c2 400 300 300 200 200 200 200 200 ``` and suppose we have a query like the following: ```sql SELECT * FROM t WHERE c1 = 500 ``` this will create a Parquet row range `[500, 1000)` which, when applied to `c2`, will require us to skip all the rows in `[400,500)`. However the current logic for skipping rows is via `updater.skipValues(n, valueReader)` which is incorrect since this skips the next `n` non-null values. In the case when nulls are present, this will not work correctly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a new test in `ParquetColumnIndexSuite`. Closes #33330 from sunchao/SPARK-36123-skip-nulls. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `e980c7a840`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 18:14:33 +08:00
Linhong Liu	c9813f74e9	[SPARK-35780][SQL] Support DATE/TIMESTAMP literals across the full range ### What changes were proposed in this pull request? DATE/TIMESTAMP literals support years 0000 to 9999. However, internally we support a range that is much larger. We can add or subtract large intervals from a date/timestamp and the system will happily process and display large negative and positive dates. Since we obviously cannot put this genie back into the bottle the only thing we can do is allow matching DATE/TIMESTAMP literals. ### Why are the changes needed? make spark more usable and bug fix ### Does this PR introduce _any_ user-facing change? Yes, after this PR, below SQL will have different results ```sql select cast('-10000-1-2' as date) as date_col -- before PR: NULL -- after PR: -10000-1-2 ``` ```sql select cast('2021-4294967297-11' as date) as date_col -- before PR: 2021-01-11 -- after PR: NULL ``` ### How was this patch tested? newly added test cases Closes #32959 from linhongliu-db/SPARK-35780-full-range-datetime. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `b86645776b`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 18:11:53 +08:00
Jungtaek Lim	dcee7a65fd	[SPARK-34892][SS] Introduce MergingSortWithSessionWindowStateIterator sorting input rows and rows in state efficiently Introduction: this PR is a part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces MergingSortWithSessionWindowStateIterator, which does "merge sort" between input rows and sessions in state based on group key and session's start time. Note that the iterator does merge sort among input rows and sessions grouped by grouping key. The iterator doesn't provide sessions in state which keys don't exist in input rows. For input rows, the iterator will provide all rows regardless of the existence of matching sessions in state. MergingSortWithSessionWindowStateIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort. ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT added. Closes #33077 from HeartSaVioR/SPARK-34892-SPARK-10816-PR-31570-part-4. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `12a576f175`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-07-14 18:48:05 +09:00
Fu Chen	5bc06fd7d9	[SPARK-36130][SQL] UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal ### What changes were proposed in this pull request? Fix [comment](https://github.com/apache/spark/pull/32488#issuecomment-879315179) This PR fix rule `UnwrapCastInBinaryComparison` bug. Rule UnwrapCastInBinaryComparison should skip In expression when in.list contains an expression that is not literal. - In Before this pr, the following example will throw an exception. ```scala withTable("tbl") { sql("CREATE TABLE tbl (d decimal(33, 27)) USING PARQUET") sql("SELECT d FROM tbl WHERE d NOT IN (d + 1)") } ``` - InSet As the analyzer guarantee that all the elements in the `inSet.hset` are literal, so this is not an issue for `InSet`. `fbf53dee37/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L264-L279)` ### Does this PR introduce _any_ user-facing change? No, only bug fix. ### How was this patch tested? New test. Closes #33335 from cfmcgrady/SPARK-36130. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `103d16e868`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 15:57:23 +08:00
Eugene Koifman	78796349d9	[SPARK-35639][SQL] Make hasCoalescedPartition return true if something was actually coalesced ### What changes were proposed in this pull request? Fix `CustomShuffleReaderExec.hasCoalescedPartition` so that it returns true only if some original partitions got combined ### Why are the changes needed? W/o this change `CustomShuffleReaderExec` description can report `coalesced` even though partitions are unchanged ### Does this PR introduce _any_ user-facing change? Yes, the `Arguments` in the node description is now accurate: ``` (16) CustomShuffleReader Input [3]: [registration#4, sum#85, count#86L] Arguments: coalesced ``` ### How was this patch tested? Existing tests Closes #32872 from ekoifman/PRISM-77023-fix-hasCoalescedPartition. Authored-by: Eugene Koifman <eugene.koifman@workday.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `4033b2a3f4`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 15:48:24 +08:00
gengjiaan	1686cff9a1	[SPARK-36037][SQL] Support ANSI SQL LOCALTIMESTAMP datetime value function ### What changes were proposed in this pull request? `LOCALTIMESTAMP()` is a datetime value function from ANSI SQL. The syntax show below: ``` <datetime value function> ::= <current date value function> \| <current time value function> \| <current timestamp value function> \| <current local time value function> \| <current local timestamp value function> <current date value function> ::= CURRENT_DATE <current time value function> ::= CURRENT_TIME [ <left paren> <time precision> <right paren> ] <current local time value function> ::= LOCALTIME [ <left paren> <time precision> <right paren> ] <current timestamp value function> ::= CURRENT_TIMESTAMP [ <left paren> <timestamp precision> <right paren> ] <current local timestamp value function> ::= LOCALTIMESTAMP [ <left paren> <timestamp precision> <right paren> ] ``` `LOCALTIMESTAMP()` returns the current timestamp at the start of query evaluation as TIMESTAMP WITH OUT TIME ZONE. This is similar to `CURRENT_TIMESTAMP()`. Note we need to update the optimization rule `ComputeCurrentTime` so that Spark returns the same result in a single query if the function is called multiple times. ### Why are the changes needed? `CURRENT_TIMESTAMP()` returns the current timestamp at the start of query evaluation. `LOCALTIMESTAMP()` returns the current timestamp without time zone at the start of query evaluation. The `LOCALTIMESTAMP` function is an ANSI SQL. The `LOCALTIMESTAMP` function is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. Support new function `LOCALTIMESTAMP()`. ### How was this patch tested? New tests. Closes #33258 from beliefer/SPARK-36037. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `b4f7758944`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-14 15:39:02 +08:00
Chao Sun	b4355608e0	[SPARK-36131][SQL][TEST] Refactor ParquetColumnIndexSuite ### What changes were proposed in this pull request? Refactor `ParquetColumnIndexSuite` and allow better code reuse. ### Why are the changes needed? A few methods in the test suite can share the same utility method `checkUnalignedPages` so it's better to do that and remove code duplication. Additionally, `parquet.enable.dictionary` is tested for both `true` and `false` combination. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #33334 from sunchao/SPARK-35743-test-refactoring. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `7a7b086534`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-13 22:50:07 -07:00
Jungtaek Lim	fa8c37acb1	[SPARK-34891][SS] Introduce state store manager for session window in streaming query Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.) ### What changes were proposed in this pull request? This PR introduces state store manager for session window in streaming query. Session window in batch query wouldn't need to leverage state store manager. This PR ensures versioning on state format for state store manager, so that we can apply further optimization after releasing Spark version. StreamingSessionWindowStateManager is a trait defining the available methods in session window state store manager. Its subclasses are classes implementing the trait with versioning. The format of version 1 leverages the new feature of "prefix match scan" to represent the session windows: * full key : [ group keys, start time in session window ] * prefix key [ group keys ] ### Why are the changes needed? This part is a one of required on implementing SPARK-10816. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test suite added Closes #31989 from HeartSaVioR/SPARK-34891-SPARK-10816-PR-31570-part-3. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `0fe2d809d6`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-07-13 08:58:42 -07:00
Gengliang Wang	3ace01b25b	[SPARK-36120][SQL] Support TimestampNTZ type in cache table ### What changes were proposed in this pull request? Support TimestampNTZ type column in SQL command Cache table ### Why are the changes needed? Cache table should support the new timestamp type. ### Does this PR introduce _any_ user-facing change? Yes, the TimemstampNTZ type column can used in `CACHE TABLE` ### How was this patch tested? Unit test Closes #33322 from gengliangwang/cacheTable. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `067432705f`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-13 17:24:03 +03:00
Wenchen Fan	a0f61ccfe4	[SPARK-36033][SQL][TEST] Validate partitioning requirements in TPCDS tests ### What changes were proposed in this pull request? Make sure all physical plans of TPCDS queries are valid (satisfy the partitioning requirement). ### Why are the changes needed? improve test coverage ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33248 from cloud-fan/aqe2. Lead-authored-by: Wenchen Fan <cloud0fan@gmail.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `583173b7cc`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 21:17:25 +08:00
Wenchen Fan	017b7d3f0b	[SPARK-36074][SQL] Add error class for StructType.findNestedField ### What changes were proposed in this pull request? This PR adds an INVALID_FIELD_NAME error class for the errors in `StructType.findNestedField`. It also cleans up the code there and adds UT for this method. ### Why are the changes needed? follow the new error message framework ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #33282 from cloud-fan/error. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 21:15:00 +08:00
Max Gekk	b4692949f8	[SPARK-35735][SQL][FOLLOWUP] Remove unused method `IntervalUtils.checkIntervalStringDataType()` ### What changes were proposed in this pull request? Remove the private method `checkIntervalStringDataType()` from `IntervalUtils` since it hasn't been used anymore after https://github.com/apache/spark/pull/33242. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No. The method is private, and it existing in code base for short time. ### How was this patch tested? By existing GAs/tests. Closes #33321 from MaxGekk/SPARK-35735-remove-unused-method. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `1ba3982d16`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-13 15:11:32 +03:00
Kousuke Saruta	1f8e72f9b1	[SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types ### What changes were proposed in this pull request? This PR allow the parser to parse unit list interval literals like `'3' day '10' hours '3' seconds` or `'8' years '3' months` as `YearMonthIntervalType` or `DayTimeIntervalType`. ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? Yes. I noted the following things in the `sql-migration-guide.md`. * Unit list interval literals are parsed as `YearMonthIntervaType` or `DayTimeIntervalType` instead of `CalendarIntervalType`. * `WEEK`, `MILLISECONS`, `MICROSECOND` and `NANOSECOND` are not valid units for unit list interval literals. * Units of year-month and day-time cannot be mixed like `1 YEAR 2 MINUTES`. ### How was this patch tested? New tests and modified tests. Closes #32949 from sarutak/day-time-multi-units. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `8e92ef825a`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 18:55:22 +08:00

1 2 3 4 5 ...

11674 commits