ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Wenchen Fan	f7995c576a	Revert "[SPARK-32677][SQL] Load function resource before create" This reverts commit `05fcf26b79`.	2020-09-09 18:15:22 +00:00
Tathagata Das	e4237bbda6	[SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources ### What changes were proposed in this pull request? Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same. ### Why are the changes needed? Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when - The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts). - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches. - The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch. This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources like Delta and Autoloader which rely on this invariant to detect in the source whether the query is being started from scratch or restarted. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New unit test with a mock v1 source that fails without the fix. Closes #29651 from tdas/SPARK-32794. Authored-by: Tathagata Das <tathagata.das1565@gmail.com> Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>	2020-09-09 13:35:51 -04:00
yangjie01	513d51a2c5	[SPARK-32808][SQL] Fix some test cases of `sql/core` module in scala 2.13 ### What changes were proposed in this pull request? The purpose of this pr is to partial resolve [SPARK-32808](https://issues.apache.org/jira/browse/SPARK-32808), total of 26 failed test cases were fixed, the related suite as follow: - `StreamingAggregationSuite` related test cases (2 FAILED -> Pass) - `GeneratorFunctionSuite` related test cases (2 FAILED -> Pass) - `UDFSuite` related test cases (2 FAILED -> Pass) - `SQLQueryTestSuite` related test cases (5 FAILED -> Pass) - `WholeStageCodegenSuite` related test cases (1 FAILED -> Pass) - `DataFrameSuite` related test cases (3 FAILED -> Pass) - `OrcV1QuerySuite\OrcV2QuerySuite` related test cases (4 FAILED -> Pass) - `ExpressionsSchemaSuite` related test cases (1 FAILED -> Pass) - `DataFrameStatSuite` related test cases (1 FAILED -> Pass) - `JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite` related test cases (6 FAILED -> Pass) The main change of this pr as following: - Fix Scala 2.13 compilation problems in `ShuffleBlockFetcherIterator` and `Analyzer` - Specified `Seq` to `scala.collection.Seq` in `objects.scala` and `GenericArrayData` because internal use `Seq` maybe `mutable.ArraySeq` and not easy to call `.toSeq` - Should specified `Seq` to `scala.collection.Seq` when we call `Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13 - Use a compatible way to let `+` and `-` method of `Decimal` having the same behavior in Scala 2.12 and Scala 2.13 - Call `toList` in `RelationalGroupedDataset.toDF` method when `groupingExprs` is `Stream` type because `Stream` can't serialize in Scala 2.13 - Add a manual sort to `classFunsMap` in `ExpressionsSchemaSuite` because `Iterable.groupBy` in Scala 2.13 has different result with `TraversableLike.groupBy` in Scala 2.12 ### Why are the changes needed? We need to support a Scala 2.13 build. ### Does this PR introduce _any_ user-facing change? Should specified `Seq` to `scala.collection.Seq` when we call `Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but the `Seq` is `immutable.Seq` in Scala 2.13 ### How was this patch tested? - Scala 2.12: Pass the Jenkins or GitHub Action - Scala 2.13: Do the following: ``` dev/change-scala-version.sh 2.13 mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am mvn test -pl sql/core -Pscala-2.13 ``` Before ``` Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0 * 319 TESTS FAILED * ``` After ``` Tests: succeeded 8204, failed 286, canceled 1, ignored 52, pending 0 * 286 TESTS FAILED * ``` Closes #29660 from LuciferYang/SPARK-32808. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-09 08:53:44 -05:00
Liang-Chi Hsieh	de0dc52a84	[SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession ### What changes were proposed in this pull request? If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution. ### Why are the changes needed? Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception. ### Does this PR introduce _any_ user-facing change? Yes, this change fixes the bug. ### How was this patch tested? Unit test. Closes #29667 from viirya/SPARK-32813. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-09 12:23:05 +09:00
Max Gekk	adc8d687ce	[SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2 ### What changes were proposed in this pull request? In the PR, I propose to move the test `SPARK-32810: CSV and JSON data sources should be able to read files with escaped glob metacharacter in the paths` from `DataFrameReaderWriterSuite` to `CSVSuite` and to `JsonSuite`. This will allow to run the same test in `CSVv1Suite`/`CSVv2Suite` and in `JsonV1Suite`/`JsonV2Suite`. ### Why are the changes needed? To improve test coverage by checking JSON/CSV datasources v1 and v2. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running affected test suites: ``` $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.csv." $ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json." ``` Closes #29684 from MaxGekk/globbing-paths-when-inferring-schema-dsv2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-09 10:29:58 +09:00
manuzhang	96ff87dce8	[SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test ### What changes were proposed in this pull request? Fix indentation and clean up view in the test added by https://github.com/apache/spark/pull/29593. ### Why are the changes needed? Address review comments in https://github.com/apache/spark/pull/29665. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated test. Closes #29682 from manuzhang/spark-32753-followup. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-09 10:20:21 +09:00
Zhenhua Wang	e7d9a24565	[SPARK-32817][SQL] DPP throws error when broadcast side is empty ### What changes were proposed in this pull request? In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw `UnsupportedOperationException`. ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a new test. Closes #29671 from wzhfy/dpp_empty_broadcast. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-08 21:36:21 +09:00
sychen	bd3dc2f54d	[SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe ### What changes were proposed in this pull request? Before SPARK-31511 is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause data inaccuracy. We need to add a unit test. ### Why are the changes needed? Increase test coverage to ensure that iterator() is thread-safe. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? add ut Closes #29669 from cxzl25/SPARK-31511-test. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-08 11:54:04 +00:00
Zhenhua Wang	55d38a479b	[SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec" ### What changes were proposed in this pull request? This reverts commit `04f7f6dac0` due to the discussion in [comment](https://github.com/apache/spark/pull/29589#discussion_r484657207). ### Why are the changes needed? Based on the discussion in [comment](https://github.com/apache/spark/pull/29589#discussion_r484657207), propagation for thread local properties in `SubqueryBroadcastExec` is not necessary, since they will be propagated by broadcast exchange threads anyway. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Also revert the added test. Closes #29674 from wzhfy/revert_dpp_thread_local. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-08 20:20:16 +09:00
Wenchen Fan	4144b6da52	[SPARK-32764][SQL] -0.0 should be equal to 0.0 ### What changes were proposed in this pull request? This is a Spark 3.0 regression introduced by https://github.com/apache/spark/pull/26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic. This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0 ### Why are the changes needed? Fix a correctness bug. ### Does this PR introduce _any_ user-facing change? Yes, now `SELECT 0.0 > -0.0` returns false correctly as Spark 2.x. ### How was this patch tested? new tests Closes #29647 from cloud-fan/float. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-07 20:43:43 -07:00
Max Gekk	954cd9feaa	[SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema ### What changes were proposed in this pull request? In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true: * no user specified schema * some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc. ### Why are the changes needed? To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`: ```scala spark.read.csv("""/tmp/\[abc\].csv""").show spark.read.json("""/tmp/\[abc\].json""").show ``` but would end up hitting an exception: ``` org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv; at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new test cases in `DataFrameReaderWriterSuite`. Closes #29659 from MaxGekk/globbing-paths-when-inferring-schema. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-08 09:42:59 +09:00
manuzhang	c43460cf82	[SPARK-32753][SQL] Only copy tags to node with no tags ### What changes were proposed in this pull request? Only copy tags to node with no tags when transforming plans. ### Why are the changes needed? cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE. ``` spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1") val df = spark.sql("select id from v1 group by id distribute by id") println(df.collect().toArray.mkString(",")) println(df.queryExecution.executedPlan) // With AQE [4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9] AdaptiveSparkPlan(isFinalPlan=true) +- CustomShuffleReader local +- ShuffleQueryStage 0 +- Exchange hashpartitioning(id#183L, 10), true +- (3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L]) +- Union :- (1) Range (0, 10, step=1, splits=2) +- (2) Range (0, 10, step=1, splits=2) // Without AQE [4],[7],[0],[6],[8],[3],[2],[5],[1],[9] (4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Exchange hashpartitioning(id#206L, 10), true +- (3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L]) +- Union :- (1) Range (0, 10, step=1, splits=2) +- *(2) Range (0, 10, step=1, splits=2) ``` It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags. ### Does this PR introduce _any_ user-facing change? Yes. Fix a bug. ### How was this patch tested? Add test. Closes #29593 from manuzhang/spark-32753. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 16:08:57 +00:00
Zhenhua Wang	04f7f6dac0	[SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec ### What changes were proposed in this pull request? Since [SPARK-22590](`2854091d12`), local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. This pr adds the support in `SubqueryBroadcastExec`. ### Why are the changes needed? Local property propagation is missed in `SubqueryBroadcastExec`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a new test. Closes #29589 from wzhfy/thread_local. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 06:26:14 +00:00
ulysses	05fcf26b79	[SPARK-32677][SQL] Load function resource before create ### What changes were proposed in this pull request? Change `CreateFunctionCommand` code that add class check before create function. ### Why are the changes needed? We have different behavior between create permanent function and temporary function when function class is invaild. e.g., ``` create function f as 'test.non.exists.udf'; -- Time taken: 0.104 seconds create temporary function f as 'test.non.exists.udf' -- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath; ``` And Hive also fails both of them. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception when create a invalid udf. ### How was this patch tested? New test. Closes #29502 from ulysses-you/function. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 06:00:23 +00:00
Kent Yao	de44e9cfa0	[SPARK-32785][SQL] Interval with dangling parts should not results null ### What changes were proposed in this pull request? bugfix for incomplete interval values, e.g. interval '1', interval '1 day 2', currently these cases will result null, but actually we should fail them with IllegalArgumentsException ### Why are the changes needed? correctness ### Does this PR introduce _any_ user-facing change? yes, incomplete intervals will throw exception now #### before ``` bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'" NULL NULL NULL ``` #### after ``` -- !query select interval '1' -- !query schema struct<> -- !query output org.apache.spark.sql.catalyst.parser.ParseException Cannot parse the INTERVAL value: 1(line 1, pos 7) == SQL == select interval '1' ``` ### How was this patch tested? unit tests added Closes #29635 from yaooqinn/SPARK-32785. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-07 05:11:30 +00:00
Eren Avsarogullari	f5360e761e	[SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API ### What changes were proposed in this pull request? Currently, Spark Public Rest APIs support Application attemptId except SQL API. This causes `no such app: application_X` issue when the application has `attemptId` (e.g: YARN cluster mode). Please find existing and supported Rest endpoints with attemptId. ``` // Existing Rest Endpoints applications/{appId}/sql applications/{appId}/sql/{executionId} // Rest Endpoints required support applications/{appId}/{attemptId}/sql applications/{appId}/{attemptId}/sql/{executionId} ``` Also fixing following compile warning on `SqlResourceSuite`: ``` [WARNING] [Warn] ~/spark/sql/core/src/test/scala/org/apache/spark/status/api/v1/sql/SqlResourceSuite.scala:67: Reference to uninitialized value edges ``` ### Why are the changes needed? This causes `no such app: application_X` issue when the application has `attemptId`. ### Does this PR introduce _any_ user-facing change? Not yet because SQL Rest API is being planned to release with `Spark 3.1`. ### How was this patch tested? 1. New Unit tests are added for existing Rest endpoints. `attemptId` seems not coming in `local-mode` and coming in `YARN cluster mode` so could not be added for `attemptId` case (Suggestions are welcome). 2. Also, patch has been tested manually through both Spark Core and History Server Rest APIs. Closes #29364 from erenavsarogullari/SPARK-32548. Authored-by: Eren Avsarogullari <erenavsarogullari@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-09-06 19:23:12 +08:00
Yuming Wang	0b3bb45b89	[SPARK-32791][SQL] Non-partitioned table metric should not have dynamic partition pruning time ### What changes were proposed in this pull request? This pr make non-partitioned table metric should not have dynamic partition pruning time. ### Why are the changes needed? It is useless for non-partitioned table. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test Before this pr: ![image](https://user-images.githubusercontent.com/5399861/92141803-87fed380-ee45-11ea-9784-09625b246fea.png) After this pr: ![image](https://user-images.githubusercontent.com/5399861/92141774-7c131180-ee45-11ea-8a9e-6775c592f496.png) Closes #29641 from wangyum/SPARK-32791. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-09-05 23:49:17 +08:00
yangjie	1de272f98d	[SPARK-32762][SQL][TEST] Enhance the verification of ExpressionsSchemaSuite to sql-expression-schema.md ### What changes were proposed in this pull request? `sql-expression-schema.md` automatically generated by `ExpressionsSchemaSuite`, but only expressions entries are checked in `ExpressionsSchemaSuite`. So if we manually modify the contents of the file, `ExpressionsSchemaSuite` does not necessarily guarantee the correctness of the it some times. For example, [Spark-24884](https://github.com/apache/spark/pull/27507) added `regexp_extract_all` expression support, and manually modify the `sql-expression-schema.md` but not change the content of `Number of queries` cause file content inconsistency. Some additional checks have been added to `ExpressionsSchemaSuite` to improve the correctness guarantee of `sql-expression-schema.md` as follow: - `Number of queries` should equals size of `expressions entries` in `sql-expression-schema.md` - `Number of expressions that missing example` should equals size of `Expressions missing examples` in `sql-expression-schema.md` - `MissExamples` from case should same as `expectedMissingExamples` from `sql-expression-schema.md` ### Why are the changes needed? Ensure the correctness of `sql-expression-schema.md` content. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Enhanced ExpressionsSchemaSuite Closes #29608 from LuciferYang/sql-expression-schema. Authored-by: yangjie <yangjie@MacintoshdeMacBook-Pro.local> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-04 09:40:35 +09:00
Zhenhua Wang	e693df2a07	[SPARK-32786][SQL][TEST] Improve performance for some slow DPP tests ### What changes were proposed in this pull request? The whole `DynamicPartitionPruningSuite` takes about 2 min on my laptop (either AE on or off). The slowest tests are `test("simple inner join triggers DPP with mock-up tables")` and `test("cleanup any DPP filter that isn't pushed down due to expression id clashes")`, which totally take about 1 min. We can reuse existing test tables or use smaller tables to reduce the cost. After that, the two tests takes only about 1 sec in total, leading to 2x speedup for the suite. ### Why are the changes needed? To speedup DPP test suites. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified two existing tests. Closes #29636 from wzhfy/improve_dpp_test. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-09-04 09:33:20 +09:00
Wenchen Fan	76330e0295	[SPARK-32788][SQL] non-partitioned table scan should not have partition filter ### What changes were proposed in this pull request? This PR fixes a bug `FileSourceStrategy`, which generates partition filters even if the table is not partitioned. This can confuse `FileSourceScanExec`, which mistakenly think the table is partitioned and tries to update the `numPartitions` metrics, and cause a failure. We should not generate partition filters for non-partitioned table. ### Why are the changes needed? The bug was exposed by https://github.com/apache/spark/pull/29436. ### Does this PR introduce _any_ user-facing change? Yes, fix a bug. ### How was this patch tested? new test Closes #29637 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-09-03 23:49:17 +08:00
Takeshi Yamamuro	a6114d8fb8	[SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes ### What changes were proposed in this pull request? This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example, ``` CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v); SELECT t.v FROM ( SELECT v FROM t3 UNION ALL SELECT v + v AS v FROM t3 ) t; org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;; !Project [v#1] <------ the reference got missing +- SubqueryAlias t +- Union :- Project [cast(v#1 as decimal(11,0)) AS v#3] : +- Project [v#1] : +- SubqueryAlias t3 : +- SubqueryAlias tbl : +- LocalRelation [v#1] +- Project [v#2] +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2] +- SubqueryAlias t3 +- SubqueryAlias tbl +- LocalRelation [v#1] ``` In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule. ### Why are the changes needed? bugfixes ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests Closes #29485 from maropu/SPARK-32638. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-03 14:48:26 +00:00
Peter Toth	ffd5227543	[SPARK-32730][SQL] Improve LeftSemi and Existence SortMergeJoin right side buffering ### What changes were proposed in this pull request? LeftSemi and Existence SortMergeJoin should not buffer all matching right side rows when bound condition is empty, this is unnecessary and can lead to performance degradation especially when spilling happens. ### Why are the changes needed? Performance improvement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New UT and TPCDS benchmarks. Closes #29572 from peter-toth/SPARK-32730-improve-leftsemi-sortmergejoin. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-03 14:17:34 +00:00
Ali Afroozeh	0a6043f683	[SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet ### What changes were proposed in this pull request? This PR changes `AttributeSet` and `ExpressionSet` to maintain the insertion order of the elements. More specifically, we: - change the underlying data structure of `AttributeSet` from `HashSet` to `LinkedHashSet` to maintain the insertion order. - `ExpressionSet` already uses a list to keep track of the expressions, however, since it is extending Scala's immutable.Set class, operations such as map and flatMap are delegated to the immutable.Set itself. This means that the result of these operations is not an instance of ExpressionSet anymore, rather it's a implementation picked up by the parent class. We also remove this inheritance from `immutable.Set `and implement the needed methods directly. ExpressionSet has a very specific semantics and it does not make sense to extend `immutable.Set` anyway. - change the `PlanStabilitySuite` to not sort the attributes, to be able to catch changes in the order of expressions in different runs. ### Why are the changes needed? Expressions identity is based on the `ExprId` which is an auto-incremented number. This means that the same query can yield a query plan with different expression ids in different runs. `AttributeSet` and `ExpressionSet` internally use a `HashSet` as the underlying data structure, and therefore cannot guarantee the a fixed order of operations in different runs. This can be problematic in cases we like to check for plan changes in different runs. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passes `PlanStabilitySuite` after regenerating the golden files. Closes #29598 from dbaliafroozeh/FixOrderOfExpressions. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2020-09-03 13:56:03 +02:00
Yuanjian Li	95f1e9549b	[SPARK-32782][SS] Refactor StreamingRelationV2 and move it to catalyst ### What changes were proposed in this pull request? Move StreamingRelationV2 to the catalyst module and bind with the Table interface. ### Why are the changes needed? Currently, the StreamingRelationV2 is bind with TableProvider. Since the V2 relation is not bound with `DataSource`, to make it more flexible and have better expansibility, it should be moved to the catalyst module and bound with the Table interface. We did a similar thing for DataSourceV2Relation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT. Closes #29633 from xuanyuanking/SPARK-32782. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-03 16:04:36 +09:00
Kent Yao	1fba286407	[SPARK-32781][SQL] Non-ASCII characters are mistakenly omitted in the middle of intervals ### What changes were proposed in this pull request? This PR fails the interval values parsing when they contain non-ASCII characters which are silently omitted right now. e.g. the case below should be invalid ``` select interval 'interval中文 1 day' ``` ### Why are the changes needed? bugfix, intervals should fail when containing invalid characters ### Does this PR introduce _any_ user-facing change? yes, #### before select interval 'interval中文 1 day' results 1 day, now it fails with ``` org.apache.spark.sql.catalyst.parser.ParseException Cannot parse the INTERVAL value: interval中文 1 day ``` ### How was this patch tested? new tests Closes #29632 from yaooqinn/SPARK-32781. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-03 04:56:40 +00:00
angerszhu	5e6173ebef	[SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets ### What changes were proposed in this pull request? Struct field both in GROUP BY and Aggregate Expresison with CUBE/ROLLUP/GROUPING SET will failed when analysis. ``` test("SPARK-31670") { withTable("t1") { sql( """ \|CREATE TEMPORARY VIEW t(a, b, c) AS \|SELECT * FROM VALUES \|('A', 1, NAMED_STRUCT('row_id', 1, 'json_string', '{"i": 1}')), \|('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 1}')), \|('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 2}')), \|('B', 1, NAMED_STRUCT('row_id', 3, 'json_string', '{"i": 1}')), \|('C', 3, NAMED_STRUCT('row_id', 4, 'json_string', '{"i": 1}')) """.stripMargin) checkAnswer( sql( """ \|SELECT a, c.json_string, SUM(b) \|FROM t \|GROUP BY a, c.json_string \|WITH CUBE \|""".stripMargin), Row("A", "{\"i\": 1}", 3) :: Row("A", "{\"i\": 2}", 2) :: Row("A", null, 5) :: Row("B", "{\"i\": 1}", 1) :: Row("B", null, 1) :: Row("C", "{\"i\": 1}", 3) :: Row("C", null, 3) :: Row(null, "{\"i\": 1}", 7) :: Row(null, "{\"i\": 2}", 2) :: Row(null, null, 9) :: Nil) } } ``` Error ``` [info] - SPARK-31670 * FAILED * (2 seconds, 857 milliseconds) [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 't.`c`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; [info] Aggregate [a#247, json_string#248, spark_grouping_id#246L], [a#247, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L] [info] +- Expand [List(a#221, b#222, c#223, a#244, json_string#245, 0), List(a#221, b#222, c#223, a#244, null, 1), List(a#221, b#222, c#223, null, json_string#245, 2), List(a#221, b#222, c#223, null, null, 3)], [a#221, b#222, c#223, a#247, json_string#248, spark_grouping_id#246L] [info] +- Project [a#221, b#222, c#223, a#221 AS a#244, c#223.json_string AS json_string#245] [info] +- SubqueryAlias t [info] +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223] [info] +- Project [col1#218, col2#219, col3#220] [info] +- LocalRelation [col1#218, col2#219, col3#220] [info] ``` For Struct type Field, when we resolve it, it will construct with Alias. When struct field in GROUP BY with CUBE/ROLLUP etc, struct field in groupByExpression and aggregateExpression will be resolved with different exprId as below ``` 'Aggregate [cube(a#221, c#223.json_string AS json_string#240)], [a#221, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L] +- SubqueryAlias t +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223] +- Project [col1#218, col2#219, col3#220] +- LocalRelation [col1#218, col2#219, col3#220] ``` This makes `ResolveGroupingAnalytics.constructAggregateExprs()` failed to replace aggreagteExpression use expand groupByExpression attribute since there exprId is not same. then error happened. ### Why are the changes needed? Fix analyze bug ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Added UT Closes #28490 from AngersZhuuuu/SPARK-31670. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-02 13:49:09 +00:00
Zhenhua Wang	03afbc8820	[SPARK-32739][SQL] Support prune right for left semi join in DPP ### What changes were proposed in this pull request? Currently in DPP, left semi can only prune left, this pr makes it also support prune right. ### Why are the changes needed? A minor improvement for DPP. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a test case. Closes #29582 from wzhfy/dpp_support_leftsemi_pruneRight. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-09-02 21:34:49 +08:00
Karol Chmist	7511e43c50	[SPARK-32756][SQL] Fix CaseInsensitiveMap usage for Scala 2.13 ### What changes were proposed in this pull request? This is a follow-up of #29160. This allows Spark SQL project to compile for Scala 2.13. ### Why are the changes needed? It's needed for #28545 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I compiled with Scala 2.13. It fails in `Spark REPL` project, which will be fixed by #28545 Closes #29584 from karolchmist/SPARK-32364-scala-2.13. Authored-by: Karol Chmist <info+github@chmist.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-09-02 08:27:00 -05:00
angerszhu	55ce49ed28	[SPARK-32400][SQL][TEST][FOLLOWUP][TEST-MAVEN] Fix resource loading error in HiveScripTransformationSuite ### What changes were proposed in this pull request? #29401 move `test_script.py` from sql/hive module to sql/core module, cause HiveScripTransformationSuite load resource issue. ### Why are the changes needed? This issue cause jenkins test failed in mvn spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/ spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/ spark-master-test-maven-hadoop-3.2-hive-2.3: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3/ ![image](https://user-images.githubusercontent.com/46485123/91681585-71285a80-eb81-11ea-8519-99fc9783d6b9.png) ![image](https://user-images.githubusercontent.com/46485123/91681010-aaf86180-eb7f-11ea-8dbb-61365a3b0ab4.png) Error as below: ``` Exception thrown while executing Spark plan: HiveScriptTransformation [a#349299, b#349300, c#349301, d#349302, e#349303], python /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py, [a#349309, b#349310, c#349311, d#349312, e#349313], ScriptTransformationIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) +- Project [_1#349288 AS a#349299, _2#349289 AS b#349300, _3#349290 AS c#349301, _4#349291 AS d#349302, _5#349292 AS e#349303] +- LocalTableScan [_1#349288, _2#349289, _3#349290, _4#349291, _5#349292] == Exception == org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18021.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18021.0 (TID 37324) (192.168.10.31 executor driver): org.apache.spark.SparkException: Subprocess exited with status 2. Error: python: can't open file '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py': [Errno 2] No such file or directory at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate(BaseScriptTransformationExec.scala:180) at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate$(BaseScriptTransformationExec.scala:157) at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec.checkFailureAndPropagate(HiveScriptTransformationExec.scala:49) at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec$$anon$1.hasNext(HiveScriptTransformationExec.scala:110) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at o ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Existed UT Closes #29588 from AngersZhuuuu/SPARK-32400-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 18:27:29 +09:00
liwensun	f0851e95c6	[SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation ### What changes were proposed in this pull request? PropagateEmptyRelation will not be applied to LIMIT operators in streaming queries. ### Why are the changes needed? Right now, the limit operator in a streaming query may get optimized away when the relation is empty. This can be problematic for stateful streaming, as this empty batch will not write any state store files, and the next batch will fail when trying to read these state store files and throw a file not found error. We should not let PropagateEmptyRelation optimize away the Limit operator for streaming queries. This PR is intended as a small and safe fix for PropagateEmptyRelation. A fundamental fix that can prevent this from happening again in the future and in other optimizer rules is more desirable, but that's a much larger task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit tests. Closes #29623 from liwensun/spark-32776. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 18:05:06 +09:00
Yuming Wang	54348dbd21	[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number ### What changes were proposed in this pull request? Bucket join should work if `spark.sql.shuffle.partitions` larger than bucket number, such as: ```scala spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1") spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2") sql("set spark.sql.shuffle.partitions=600") sql("set spark.sql.autoBroadcastJoinThreshold=-1") sql("select * from t1 join t2 on t1.id = t2.id").explain() ``` Before this pr: ``` == Physical Plan == (5) SortMergeJoin [id#26L], [id#27L], Inner :- (2) Sort [id#26L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#26L, 600), true : +- (1) Filter isnotnull(id#26L) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- (4) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 600), true +- (3) Filter isnotnull(id#27L) +- (3) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` After this pr: ``` == Physical Plan == (4) SortMergeJoin [id#26L], [id#27L], Inner :- (1) Sort [id#26L ASC NULLS FIRST], false, 0 : +- (1) Filter isnotnull(id#26L) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- (3) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 432), true +- (2) Filter isnotnull(id#27L) +- (2) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` ### Why are the changes needed? Spark 2.4 support this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29612 from wangyum/SPARK-32767. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-02 04:16:20 +00:00
Kousuke Saruta	812d0918a8	[SPARK-32771][DOCS] The example of expressions.Aggregator in Javadoc / Scaladoc is wrong ### What changes were proposed in this pull request? This PR modifies an example for `expressions.Aggregator` in Javadoc and Scaladoc. The definition of `bufferEncoder` and `outputEncoder` are added. ### Why are the changes needed? To correct the example. The current example is wrong and doesn't work because `bufferEncoder` and `outputEncoder` are not defined. ### Does this PR introduce _any_ user-facing change? Yes. Before this change, the scaladoc and javadoc are like as follows. ![wrong-example-java](https://user-images.githubusercontent.com/4736016/91897528-5ebf3580-ecd5-11ea-8d7b-e846b776ebbb.png) ![wrong-example](https://user-images.githubusercontent.com/4736016/91897509-58c95480-ecd5-11ea-81a3-98774083b689.png) After this change, the docs are like as follows. ![fixed-example-java](https://user-images.githubusercontent.com/4736016/91897592-78607d00-ecd5-11ea-9e55-03fd9c9c6b54.png) ![fixed-example](https://user-images.githubusercontent.com/4736016/91897609-7c8c9a80-ecd5-11ea-837e-9dbcada6cd53.png) ### How was this patch tested? Build with `build/sbt unidoc` and confirmed the generated javadoc/scaladoc and got the screenshots above. Closes #29617 from sarutak/fix-aggregator-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 10:03:07 +09:00
Linhong Liu	a410658c9b	[SPARK-32761][SQL] Allow aggregating multiple foldable distinct expressions ### What changes were proposed in this pull request? For queries with multiple foldable distinct columns, since they will be eliminated during execution, it's not mandatory to let `RewriteDistinctAggregates` handle this case. And in the current code, `RewriteDistinctAggregates` dose miss some "aggregating with multiple foldable distinct expressions" cases. For example: `select count(distinct 2), count(distinct 2, 3)` will be missed. But in the planner, this will trigger an error that "multiple distinct expressions" are not allowed. As the foldable distinct columns can be eliminated finally, we can allow this in the aggregation planner check. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added test case Closes #29607 from linhongliu-db/SPARK-32761. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 13:04:24 +00:00
Wenchen Fan	fea9360ae7	[SPARK-32757][SQL][FOLLOW-UP] Use child's output for canonicalization in SubqueryBroadcastExec ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29601 , to fix a small mistake in `SubqueryBroadcastExec`. `SubqueryBroadcastExec.doCanonicalize` should canonicalize the build keys with the query output, not the `SubqueryBroadcastExec.output`. ### Why are the changes needed? fix mistake ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #29610 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 12:54:40 +00:00
Huaxin Gao	e1dbc85c72	[SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder ### What changes were proposed in this pull request? Add JDBCScan, JDBCScanBuilder, JDBCWriteBuilder in Datasource V2 JDBC ### Why are the changes needed? Complete Datasource V2 JDBC implementation ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? new tests Closes #29396 from huaxingao/v2jdbc. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 07:23:20 +00:00
Wenchen Fan	d2a5dad97c	[SPARK-32757][SQL] Physical InSubqueryExec should be consistent with logical InSubquery ### What changes were proposed in this pull request? `InSubquery` can be either single-column mode, or multi-column mode, depending on the output length of the subquery. For multi-column mode, the length of input `values` must match the subquery output length. However, `InSubqueryExec` doesn't follow it and always be executed under single column mode. It's OK as it's only used by DPP, which looks up one key in one `InSubqueryExec`, so the multi-column mode is not needed. But it's better to make the physical and logical node consistent. This PR updates `InSubqueryExec` to support multi-column mode, and also fix `SubqueryBroadcastExec` to report output correctly. ### Why are the changes needed? Fix a potential bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #29601 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 07:19:43 +00:00
Yuming Wang	a701bc79e3	[SPARK-32659][SQL][FOLLOWUP] Improve test for pruning DPP on non-atomic type ### What changes were proposed in this pull request? Improve test for pruning DPP on non-atomic type: - Avoid creating new partition tables. This may take 30 seconds.. - Add test `array` type. ### Why are the changes needed? Improve test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #29595 from wangyum/SPARK-32659-test. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 05:51:04 +00:00
Huaxin Gao	806140de40	[SPARK-32592][SQL] Make DataFrameReader.table take the specified options ### What changes were proposed in this pull request? pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable ### Why are the changes needed? Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost. ``` val df = spark.read .option("partitionColumn", "id") .option("lowerBound", "0") .option("upperBound", "3") .option("numPartitions", "2") .table("h2.test.people") ``` We need to make `DataFrameReader.table` take the specified options. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test for now. Will add a test after V2 JDBC read is implemented. Closes #29535 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-31 13:21:15 +00:00
Cheng Su	ce473b223a	[SPARK-32740][SQL] Refactor common partitioning/distribution logic to BaseAggregateExec ### What changes were proposed in this pull request? For all three different aggregate physical operator: `HashAggregateExec`, `ObjectHashAggregateExec` and `SortAggregateExec`, they have same `outputPartitioning` and `requiredChildDistribution` logic. Refactor these same logic into their super class `BaseAggregateExec` to avoid code duplication and future bugs (similar to `HashJoin` and `ShuffledJoin`). ### Why are the changes needed? Reduce duplicated code across classes and prevent future bugs if we only update one class but forget another. We already did similar refactoring for join (`HashJoin` and `ShuffledJoin`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests as this is pure refactoring and no new logic added. Closes #29583 from c21/aggregate-refactor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-31 15:43:13 +09:00
Udbhav30	065f17386d	[SPARK-32481][CORE][SQL] Support truncate table to move data to trash ### What changes were proposed in this pull request? Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. ### Why are the changes needed? Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. ### Does this PR introduce _any_ user-facing change? Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; ### How was this patch tested? new UTs added Closes #29552 from Udbhav30/truncate. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-30 10:25:32 -07:00
Cheng Su	cfe012a431	[SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ ### What changes were proposed in this pull request? This is followup from https://github.com/apache/spark/pull/29342, where to do two things: * Per https://github.com/apache/spark/pull/29342#discussion_r470153323, change from java `HashSet` to spark in-house `OpenHashSet` to track matched rows for non-unique join keys. I checked `OpenHashSet` implementation which is built from a key index (`OpenHashSet._bitset` as `BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet` is built from `HashMap`, which stores value in `Node` linked list and by theory should have taken more memory than `OpenHashSet`. Reran the same benchmark query used in https://github.com/apache/spark/pull/29342, and verified the query has similar performance here between `HashSet` and `OpenHashSet`. * Track metrics of the extra data structure `BitSet`/`OpenHashSet` for full outer SHJ. This depends on above thing, because there seems no easy way to get java `HashSet` memory size. ### Why are the changes needed? To better surface the memory usage for full outer SHJ more accurately. This can help users/developers to debug/improve full outer SHJ. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unite test in `SQLMetricsSuite.scala` . Closes #29566 from c21/add-metrics. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-30 07:01:33 +09:00
Wenchen Fan	ccc0250a08	[SPARK-32718][SQL] Remove unnecessary keywords for interval units ### What changes were proposed in this pull request? Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway. ### Why are the changes needed? These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries. Closes #29560 from cloud-fan/keyword. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-29 14:06:01 -07:00
Chen Zhang	58f87b3178	[SPARK-32639][SQL] Support GroupType parquet mapkey field ### What changes were proposed in this pull request? Remove the assertion in ParquetSchemaConverter that the parquet mapKey field must be PrimitiveType. ### Why are the changes needed? There is a parquet file in the attachment of [SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and the MessageType recorded in the file is: ``` message parquet_schema { optional group value (MAP) { repeated group key_value { required group key { optional binary first (UTF8); optional binary middle (UTF8); optional binary last (UTF8); } optional binary value (UTF8); } } } ``` Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark will throw an exception when converting Parquet MessageType to Spark SQL StructType: > AssertionError(Map key type is expected to be a primitive type, but found...) Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file, spark returns the correct result . According to the parquet project document (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), the mapKey in the parquet format does not need to be a primitive type. Note: This parquet file is not written by spark, because spark will write additional sparkSchema string information in the parquet file. When Spark reads, it will directly use the additional sparkSchema information in the file instead of converting Parquet MessageType to Spark SQL StructType. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test case Closes #29451 from izchen/SPARK-32639. Authored-by: Chen Zhang <izchen@126.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-28 16:51:00 +00:00
Takeshi Yamamuro	0cb91b8c18	[SPARK-32704][SQL] Logging plan changes for execution ### What changes were proposed in this pull request? Since we only log plan changes for analyzer/optimizer now, this PR intends to add code to log plan changes in the preparation phase in `QueryExecution` for execution. ``` scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN") scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan ... 20/08/26 09:32:36 WARN PlanChangeLogger: === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages === !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) (1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) +- (1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) ! +- Range (0, 10, step=1, splits=4) +- (1) Range (0, 10, step=1, splits=4) 20/08/26 09:32:36 WARN PlanChangeLogger: === Result of Batch Preparations === !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) (1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) +- (1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) ! +- Range (0, 10, step=1, splits=4) +- (1) Range (0, 10, step=1, splits=4) ``` ### Why are the changes needed? Easy debugging for executed plans ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. Closes #29544 from maropu/PlanLoggingInPreparations. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-28 16:35:47 +00:00
yi.wu	c3b9404253	[SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec ### What changes were proposed in this pull request? This PR proposes to add a specific `AQEOptimizer` for the `AdaptiveSparkPlanExec` instead of implementing an anonymous `RuleExecutor`. At the same time, this PR also adds the configuration `spark.sql.adaptive.optimizer.excludedRules`, which follows the same pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for users and developers. ### Why are the changes needed? Currently, `AdaptiveSparkPlanExec` has implemented an anonymous `RuleExecutor` to apply the AQE optimize rules on the plan. However, the anonymous class usually could be inconvenient to maintain and extend for the long term. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's a pure refactor so pass existing tests should be ok. Closes #29559 from Ngone51/impro-aqe-optimizer. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 21:23:53 +09:00
Jungtaek Lim (HeartSaVioR)	73bfed3633	[SPARK-28612][SQL][FOLLOWUP] Correct method doc of DataFrameWriterV2.replace() ### What changes were proposed in this pull request? This patch corrects the method doc of DataFrameWriterV2.replace() which explanation of exception is described oppositely. ### Why are the changes needed? The method doc is incorrect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only doc change. Closes #29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:14:57 +09:00
Liang-Chi Hsieh	d6c095c92c	[SPARK-32693][SQL] Compare two dataframes with same schema except nullable property ### What changes were proposed in this pull request? This PR changes key data types check in `HashJoin` to use `sameType`. ### Why are the changes needed? Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable. It makes inconsistent results when doing `except` between two dataframes. If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored. If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects. ### Does this PR introduce _any_ user-facing change? Yes. Making consistent `except` operation between dataframes. ### How was this patch tested? Unit test. Closes #29555 from viirya/SPARK-32693. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-28 10:32:23 +09:00
xuewei.linxuewei	eb379766f4	[SPARK-32705][SQL] Fix serialization issue for EmptyHashedRelation ### What changes were proposed in this pull request? Currently, EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and it will cause JavaDeserialization Exception as following ``` 20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java.io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169) at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328) ``` This PR includes * Using case object instead to fix serialization issue. * Also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins. ### Why are the changes needed? It will cause BHJ failed when buildSide is Empty and BHJ(NAAJ) failed when buildSide with null partition keys. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Existing UT. * Run entire TPCDS for E2E coverage. Closes #29547 from leanken/leanken-SPARK-32705. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:24:42 +00:00
Terry Kim	baaa756dee	[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. ### Why are the changes needed? The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. For example, ``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") ``` will write the result to `/tmp/path2`. ### Does this PR introduce _any_ user-facing change? Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`: ``` scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added new tests. Closes #29543 from imback82/path_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:21:04 +00:00
Dongjoon Hyun	2dee4352a0	Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash" This reverts commit `5c077f0580`.	2020-08-26 11:24:35 -07:00

1 2 3 4 5 ...

7102 commits