ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dereck Li	1288ad814e	[SPARK-34067][SQL] PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time ### What changes were proposed in this pull request? PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time ### Why are the changes needed? to accelerate PartitionPruning prune calculate ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existed unit test Closes #31122 from monkeyboy123/optimize-dynamic-pruning. Authored-by: Dereck Li <monkeyboy.ljh@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-01-14 16:28:06 +08:00
Yuming Wang	d3ea308c8f	[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join ### What changes were proposed in this pull request? Should not pushdown LeftSemi/LeftAnti over Aggregate for some cases. ```scala spark.range(50000000L).selectExpr("id % 10000 as a", "id % 10000 as b").write.saveAsTable("t1") spark.range(40000000L).selectExpr("id % 8000 as c", "id % 8000 as d").write.saveAsTable("t2") spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM t2").explain ``` Before this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#72] +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L)], LeftSemi :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, [id=#65] : +- FileScan parquet default.t1[a#16L,b#17L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, [id=#66] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- Exchange hashpartitioning(c#18L, d#19L, 5), ENSURE_REQUIREMENTS, [id=#61] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- FileScan parquet default.t2[c#18L,d#19L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:bigint,d:bigint> ``` After this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#74] +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L)], LeftSemi :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, [id=#67] : +- HashAggregate(keys=[a#16L, b#17L], functions=[]) : +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#61] : +- HashAggregate(keys=[a#16L, b#17L], functions=[]) : +- FileScan parquet default.t1[a#16L,b#17L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, [id=#68] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- Exchange hashpartitioning(c#18L, d#19L, 5), ENSURE_REQUIREMENTS, [id=#63] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- FileScan parquet default.t2[c#18L,d#19L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:bigint,d:bigint> ``` ### Why are the changes needed? 1. Pushdown LeftSemi/LeftAnti over Aggregate will affect performance. 2. It will remove user added DISTINCT operator, e.g.: [q38](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q38.sql), [q87](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q87.sql). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test. SQL \| Before this PR(Seconds) \| After this PR(Seconds) -- \| -- \| -- q14a \| 660 \| 594 q14b \| 660 \| 600 q38 \| 55 \| 29 q87 \| 66 \| 35 Before this pr: ![image](https://user-images.githubusercontent.com/5399861/104452849-8789fc80-55de-11eb-88da-44059899f9a9.png) After this pr: ![image](https://user-images.githubusercontent.com/5399861/104452899-9a043600-55de-11eb-9286-d8f3a23ca3b8.png) Closes #31145 from wangyum/SPARK-34081. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-14 04:37:55 +00:00
Gengliang Wang	467d758973	[SPARK-34075][SQL][CORE] Hidden directories are being listed for partition inference ### What changes were proposed in this pull request? Fix a regression from https://github.com/apache/spark/pull/29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR https://github.com/apache/spark/pull/29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-14 09:39:38 +09:00
Kousuke Saruta	b7da108cae	[SPARK-33690][SQL][FOLLOWUP] Escape further meta-characters in showString ### What changes were proposed in this pull request? This is a followup PR for SPARK-33690 (#30647) . In addition to the original PR, this PR intends to escape the following meta-characters in `Dataset#showString`. * `\r` (carrige ret) * `\f` (form feed) * `\b` (backspace) * `\u000B` (vertical tab) * `\u0007` (bell) ### Why are the changes needed? To avoid breaking the layout of `Dataset#showString`. `\u0007` does not break the layout of `Dataset#showString` but it's noisy (beeps for each row) so it should be also escaped. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified the existing tests. I also build the documents and check the generated html for `sql-migration-guide.md`. Closes #31144 from sarutak/escape-metacharacters-in-getRows. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:13:01 -06:00
Kousuke Saruta	62d8466c74	[SPARK-34051][SQL] Support 32-bit unicode escape in string literals ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR adds a feature which supports 32-bit unicode escape in string literals like PostgreSQL or some modern programming languages do (e.g, Python3, C++11 and Rust). In addition to the feature which supports 16-bit unicode escape like `"\u0041"`, users can express unicode characters like `"\U00020BB7"` with this change. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> Users can express unicode characters straightly without surrogate pair. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Users an express all the unicode characters straightly. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Added new assertions to the existing test case. Closes #31096 from sarutak/32-bit-unicode-escape. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:10:03 -06:00
yangjie01	8b1ba233f1	[SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion ### What changes were proposed in this pull request? There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile. ### Why are the changes needed? Remove redundant collection conversion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed Closes #31125 from LuciferYang/SPARK-34068. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:07:02 -06:00
yangjie01	8c5fecda73	[SPARK-34070][CORE][SQL] Replaces find and emptiness check with exists ### What changes were proposed in this pull request? This pr use `exists` to simplify `find + emptiness check`, it's semantically consistent, but looks simpler. Before ``` seq.find(p).isDefined or seq.find(p).isEmpty ``` After ``` seq.exists(p) or !seq.exists(p) ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31130 from LuciferYang/SPARK-34070. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 10:42:24 -06:00
Chao Sun	62d82b5b27	[SPARK-34076][SQL] SQLContext.dropTempTable fails if cache is non-empty ### What changes were proposed in this pull request? This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved. ### Why are the changes needed? Currently, `CatalogImpl.dropTempView` is implemented as following: ```scala override def dropTempView(viewName: String): Boolean = { sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef => sparkSession.sharedState.cacheManager.uncacheQuery( sparkSession, viewDef, cascade = false) sessionCatalog.dropTempView(viewName) } } ``` Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like: ``` Invalid call to qualifier on unresolved object, tree: 'key ``` This can be reproduced via: ```scala sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10") sql(s"CREATE TABLE $t AS SELECT * FROM src") sql(s"CACHE TABLE $t") dropTempTable(v) ``` ### Does this PR introduce _any_ user-facing change? The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix. ### How was this patch tested? Added new unit tests. Closes #31136 from sunchao/SPARK-34076. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 13:22:21 +00:00
LantaoJin	f1b21ba505	[SPARK-34064][SQL] Cancel the running broadcast sub-jobs when SQL statement is cancelled ### What changes were proposed in this pull request? #24595 introduced `private val runId: UUID = UUID.randomUUID` in `BroadcastExchangeExec` to cancel the broadcast execution in the Future when timeout happens. Since the runId is a random UUID instead of inheriting the job group id, when a SQL statement is cancelled, these broadcast sub-jobs are still executing. This PR uses the job group id of the outside thread as its `runId` to abort these broadcast sub-jobs when the SQL statement is cancelled. ### Why are the changes needed? When broadcasting a table takes too long and the SQL statement is cancelled. However, the background Spark job is still running and it wastes resources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. Since broadcasting a table is too fast to cancel in UT, but it is very easy to verify manually: 1. Start a Spark thrift-server with less resource in YARN. 2. When the driver is running but no executors are launched, submit a SQL which will broadcast tables from beeline. 3. Cancel the SQL in beeline Without the patch, broadcast sub-jobs won't be cancelled. ![Screen Shot 2021-01-11 at 12 03 13 PM](https://user-images.githubusercontent.com/1853780/104150975-ab024b00-5416-11eb-8bf9-b5167bdad80a.png) With this patch, broadcast sub-jobs will be cancelled. ![Screen Shot 2021-01-11 at 11 43 40 AM](https://user-images.githubusercontent.com/1853780/104150994-be151b00-5416-11eb-80ff-313d423c8a2e.png) Closes #31119 from LantaoJin/SPARK-34064. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 12:58:27 +00:00
Kent Yao	04f031acb3	[SPARK-34086][SQL] RaiseError generates too much code and may fails codegen in length check for char varchar ### What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression. W/ this fix, for q41 in TPCDS with [Using TPCDS original definitions for char/varchar columns](https://github.com/apache/spark/pull/31012) applied, we can reduce the stage code-gen size from 22523 to 14369 ``` 14369 - 22523 = - 8154 ``` ### Why are the changes needed? fix the perf regression(we need other improvements for q41 works), there will be a huge performance regression if codegen fails ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? modified uts Closes #31150 from yaooqinn/SPARK-34086. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 09:52:36 +00:00
Max Gekk	861f8bb5fb	[SPARK-34071][SQL][TESTS] Check stats of cached v1 tables after altering ### What changes were proposed in this pull request? Port the test added by https://github.com/apache/spark/pull/31112 to: 1. v1 In-Memory catalog for `ALTER TABLE .. DROP PARTITION` 2. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. ADD PARTITION` 3. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. RENAME PARTITION` ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite" ``` Closes #31131 from MaxGekk/cache-stats-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 04:58:01 +00:00
Takuya UESHIN	ad8e40e2ab	[SPARK-32338][SQL][PYSPARK][FOLLOW-UP][TEST] Add more tests for slice function ### What changes were proposed in this pull request? This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function. ### Why are the changes needed? The original PRs are missing tests with column-based arguments instead of literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests and existing tests. Closes #31159 from ueshin/issues/SPARK-32338/slice_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-13 09:56:38 +09:00
yi.wu	0099715aae	[SPARK-34091][SQL] Shuffle batch fetch should be able to disable after it's been enabled ### What changes were proposed in this pull request? Fix the setting issue of shuffle batch fetch in `ShuffledRowRDD`. ### Why are the changes needed? Currently, we can not disable the shuffle batch fetch mode once the batch fetch mode has been enabled. This PR fixes the issue to make `ShuffledRowRDD` respects the `spark.sql.adaptive.fetchShuffleBlocksInBatch` at runtime. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, users can not disable batch fetch if they enabled first. After this PR, they can. ### How was this patch tested? Added unit test. Closes #31155 from Ngone51/fix-batchfetch-set-issue. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 15:45:15 +00:00
Max Gekk	6c047958f9	[SPARK-34084][SQL] Fix auto updating of table stats in `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Fix an issue in `ALTER TABLE .. ADD PARTITION` which happens when: - A table doesn't have stats - `spark.sql.statistics.size.autoUpdate.enabled` is `true` In that case, `ALTER TABLE .. ADD PARTITION` does not update table stats automatically. ### Why are the changes needed? The changes fix the issue demonstrated by the example: ```sql spark-sql> create table tbl (col0 int, part int) partitioned by (part); spark-sql> insert into tbl partition (part = 0) select 0; spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; spark-sql> alter table tbl add partition (part = 1); ``` the `add partition` command should update table stats but it does not. There is no stats in the output of: ``` spark-sql> describe table extended tbl; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. ADD PARTITION` updates stats even when a table does have them before the command: ```sql spark-sql> alter table tbl add partition (part = 1); spark-sql> describe table extended tbl; col0 int NULL part int NULL # Partition Information # col_name data_type comment part int NULL # Detailed Table Information ... Statistics 2 bytes ``` ### How was this patch tested? By running new UT and existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31149 from MaxGekk/fix-stats-in-add-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 14:34:17 +00:00
Kent Yao	99f84892a5	[SPARK-34003][SQL][FOLLOWUP] Avoid pushing modified Char/Varchar sort attributes into aggregate for existing ones ### What changes were proposed in this pull request? In `0f8e5dd445`, we partially fix the rule conflicts between `PaddingAndLengthCheckForCharVarchar` and `ResolveAggregateFunctions`, as error still exists in sql like ```SELECT substr(v, 1, 2), sum(i) FROM t GROUP BY v ORDER BY substr(v, 1, 2)``` ```sql [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'spark_catalog.default.t.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; [info] Project [substr(v, 1, 2)#100, sum(i)#101L] [info] +- Sort [aggOrder#102 ASC NULLS FIRST], true [info] +- !Aggregate [v#106], [substr(v#106, 1, 2) AS substr(v, 1, 2)#100, sum(cast(i#98 as bigint)) AS sum(i)#101L, substr(v#103, 1, 2) AS aggOrder#102 [info] +- SubqueryAlias spark_catalog.default.t [info] +- Project [if ((length(v#97) <= 3)) v#97 else if ((length(rtrim(v#97, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#97) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#97, None), 3, ) AS v#106, i#98] [info] +- Relation[v#97,i#98] parquet [info] [info] Project [substr(v, 1, 2)#100, sum(i)#101L] [info] +- Sort [aggOrder#102 ASC NULLS FIRST], true [info] +- !Aggregate [v#106], [substr(v#106, 1, 2) AS substr(v, 1, 2)#100, sum(cast(i#98 as bigint)) AS sum(i)#101L, substr(v#103, 1, 2) AS aggOrder#102 [info] +- SubqueryAlias spark_catalog.default.t [info] +- Project [if ((length(v#97) <= 3)) v#97 else if ((length(rtrim(v#97, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#97) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#97, None), 3, ) AS v#106, i#98] [info] +- Relation[v#97,i#98] parquet ``` We need to look recursively into children to find char/varchars. In this PR, we try to resolve the full attributes including the original `Aggregate` expressions and the candidates in `SortOrder` together, then use the new re-resolved `Aggregate` expressions to determine which candidate in the `SortOrder` shall be pushed. This can avoid mismatch for the same attributes w/o this change, as the expressions returned by `executeSameContext` will change when `PaddingAndLengthCheckForCharVarchar` takes effects. W/ this change, the expressions can be matched correctly. For those unmatched, w need to look recursively into children to find char/varchars instead of the expression itself only. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new tests Closes #31129 from yaooqinn/SPARK-34003-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 08:20:39 +00:00
Gengliang Wang	02a17e92f1	[SPARK-28646][SQL][FOLLOWUP] Add legacy config for allowing parameterless count ### What changes were proposed in this pull request? Add a legacy configuration `spark.sql.legacy.allowParameterlessCount` in case users need the parameterless count. This is a follow-up for https://github.com/apache/spark/pull/30541. ### Why are the changes needed? There can be some users depends on the legacy behavior. We need a legacy flag for it. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy flag `spark.sql.legacy.allowParameterlessCount`. ### How was this patch tested? Unit tests Closes #31143 from gengliangwang/countLegacy. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-12 16:31:22 +09:00
Max Gekk	f7cbeec487	[SPARK-34074][SQL] Update stats only when table size changes ### What changes were proposed in this pull request? Do not alter table stats if they are the same as in the catalog (at least since the recent retrieve). ### Why are the changes needed? The changes reduce the number of calls to Hive external catalog. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #31135 from MaxGekk/optimize-updateTableStats. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 03:28:28 +00:00
Dongjoon Hyun	3556929c43	[SPARK-33970][SQL][TEST][FOLLOWUP] Use String comparision ### What changes were proposed in this pull request? This is a follow-up to replace `version.toDouble > 2` with `version >= "2.0"` ### Why are the changes needed? `toDouble` has some assumption and can cause `java.lang.NumberFormatException`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #31134 from dongjoon-hyun/SPARK-33970-FOLLOWUP. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-11 13:40:03 -08:00
Liang-Chi Hsieh	0bcbafb4b8	[SPARK-34002][SQL] Fix the usage of encoder in ScalaUDF ### What changes were proposed in this pull request? This patch fixes few issues when using encoders to serialize input/output in `ScalaUDF`. ### Why are the changes needed? This fixes a bug when using encoders in Scala UDF. First, the output data type should be corrected to the corresponding data type of the object serializer. Second, `catalystConverter` should not serialize `Option[_]` as the ordinary row because in `ScalaUDF` case it is serialized to a column, not the top-level row. Otherwise, there will be a redundant `value` struct wrapping the serialized `Option[_]` object. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of `ScalaUDF`. ### How was this patch tested? Unit test. Closes #31103 from viirya/SPARK-34002. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-11 11:31:35 -08:00
yi.wu	4afca0f706	[SPARK-31952][SQL] Fix incorrect memory spill metric when doing Aggregate ### What changes were proposed in this pull request? This PR takes over https://github.com/apache/spark/pull/28780. 1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter` 2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter` ### Why are the changes needed? As mentioned in https://github.com/apache/spark/pull/28780: > It happends when hash aggregate downgrades to sort based aggregate. `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right. Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`. Issues can be reproduced by the following step by checking the SQL metrics in UI: ``` bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1" scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json") ``` Before: <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png"> After: <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png"> ### Does this PR introduce _any_ user-facing change? Yes, users can see the correct spill metrics after this PR. ### How was this patch tested? Tested manually and added UTs. Closes #31035 from Ngone51/SPARK-31952. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:15:28 +00:00
Max Gekk	d97e99157e	[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan. ### Why are the changes needed? This fixes the issue demonstrated by the example below: ```scala scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true) scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)") scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0") scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1") scala> sql("CACHE TABLE tbl") scala> sql("SELECT * FROM tbl").show(false) +---+----+ \|id \|part\| +---+----+ \|0 \|0 \| \|1 \|1 \| +---+----+ scala> spark.catalog.isCached("tbl") scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = false ``` `ALTER TABLE .. DROP PARTITION` must keep the table in the cache. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats: ```scala scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = true ``` ### How was this patch tested? By running new UT in `AlterTableDropPartitionSuite`. Closes #31112 from MaxGekk/fix-caching-hive-table-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:03:44 +00:00
Max Gekk	664ef184c1	[SPARK-34055][SQL][TESTS][FOLLOWUP] Check partition adding to cached Hive table ### What changes were proposed in this pull request? Replace `USING parquet` by `$defaultUsing` which is `USING parquet` for v1 In-Memory catalog and `USING hive` for v1 Hive external catalog. ### Why are the changes needed? The PR https://github.com/apache/spark/pull/31101 added UT test but it checks only v1 In-Memory catalog. This PR runs this test for Hive external catalog as well to improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31117 from MaxGekk/add-partition-refresh-cache-2-followup-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:02:49 +00:00
Yuming Wang	f77eeb0451	[SPARK-33970][SQL][TEST] Add test default partition in metastoredirectsql ### What changes were proposed in this pull request? This pr add test default partition in metastoredirectsql. ### Why are the changes needed? Improve test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #31109 from wangyum/SPARK-33970. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-11 14:19:53 +09:00
Terry Kim	8391a4a687	[SPARK-34057][SQL] UnresolvedTableOrView should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("DROP TABLE unknown") org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be `11`. This PR proposes to fix this issue for commands using `UnresolvedTableOrView`: ``` DROP TABLE unknown DESCRIBE TABLE unknown ANALYZE TABLE unknown COMPUTE STATISTICS ANALYZE TABLE unknown COMPUTE STATISTICS FOR COLUMNS col ANALYZE TABLE unknown COMPUTE STATISTICS FOR ALL COLUMNS SHOW CREATE TABLE unknown REFRESH TABLE unknown SHOW COLUMNS FROM unknown SHOW COLUMNS FROM unknown IN db ALTER TABLE unknown RENAME TO t ALTER VIEW unknown RENAME TO v ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 11; ``` ### How was this patch tested? Add a new test. Closes #31106 from imback82/unresolved_table_or_view_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 04:28:39 +00:00
HyukjinKwon	830249284d	[SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure execute it eagerly ### What changes were proposed in this pull request? This PR is basically a followup of https://github.com/apache/spark/pull/14332. Calling `map` alone might leave it not executed due to lazy evaluation, e.g.) ``` scala> val foo = Seq(1,2,3) foo: Seq[Int] = List(1, 2, 3) scala> foo.map(println) 1 2 3 res0: Seq[Unit] = List((), (), ()) scala> foo.view.map(println) res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) scala> foo.view.foreach(println) 1 2 3 ``` We should better use `foreach` to make sure it's executed where the output is unused or `Unit`. ### Why are the changes needed? To prevent the potential issues by not executing `map`. ### Does this PR introduce _any_ user-facing change? No, the current codes look not causing any problem for now. ### How was this patch tested? I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally. Closes #31110 from HyukjinKwon/SPARK-34059. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-01-10 15:22:24 -08:00
Max Gekk	9a8d275226	[SPARK-34055][SQL][TESTS][FOLLOWUP] Increase the expected number of calls to Hive external catalog in partition adding ### What changes were proposed in this pull request? Increase the number of calls to Hive external catalog in the test for `ALTER TABLE .. ADD PARTITION`. ### Why are the changes needed? There is a logical conflict between https://github.com/apache/spark/pull/31101 and https://github.com/apache/spark/pull/31092. The first one fixes a caching issue and increases the number of calls to Hive external catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31111 from MaxGekk/add-partition-refresh-cache-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-10 18:29:02 +09:00
ulysses-you	48b9611ba3	[SPARK-32668][SQL] HiveGenericUDTF initialize UDTF should use StructObjectInspector method ### What changes were proposed in this pull request? Use `initialize(StructObjectInspector argOIs)` instead `initialize(ObjectInspector[] args)` in `HiveGenericUDTF`. ### Why are the changes needed? In our case, we implement a Hive `GenericUDTF` and override `initialize(StructObjectInspector argOIs)`. Then it's ok to execute with Hive, but failed with Spark SQL. Here is the Spark SQL error msg: ``` No handler for UDF/UDAF/UDTF 'com.xxxx.xxxUDTF': java.lang.IllegalStateException: Should not be called directly Please make sure your function overrides `public StructObjectInspector initialize(ObjectInspector[] args)`. ``` The reason is Spark `HiveGenericUDTF` call `initialize(ObjectInspector[] argOIs)` to init a UDTF, but it's a Deprecated method. ``` public StructObjectInspector initialize(StructObjectInspector argOIs) throws UDFArgumentException { List<? extends StructField> inputFields = argOIs.getAllStructFieldRefs(); ObjectInspector[] udtfInputOIs = new ObjectInspector[inputFields.size()]; for(int i = 0; i < inputFields.size(); ++i) { udtfInputOIs[i] = ((StructField)inputFields.get(i)).getFieldObjectInspector(); } return this.initialize(udtfInputOIs); } Deprecated public StructObjectInspector initialize(ObjectInspector[] argOIs) throws UDFArgumentException { throw new IllegalStateException("Should not be called directly"); } ``` We should use `initialize(StructObjectInspector argOIs)` to do this so that we can be compatible both of the two method. Same as Hive. ### Does this PR introduce _any_ user-facing change? Yes, fix UDTF initialize method. ### How was this patch tested? manual test and passed `HiveUDFDynamicLoadSuite` Closes #29490 from ulysses-you/SPARK-32668. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-01-10 13:19:04 +08:00
Max Gekk	e0e06c18fd	[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 0 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes #31101 from MaxGekk/add-partition-refresh-cache-2. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-10 14:06:17 +09:00
HyukjinKwon	105ba6e5f0	Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE" This reverts commit `d36cdd5541`.	2021-01-10 13:52:48 +09:00
ulysses-you	48cd11c483	[SPARK-34030][SQL] Fold RepartitionByExpression num partition should at Optimizer ### What changes were proposed in this pull request? Move `RepartitionByExpression` fold partition number code to a new rule at `Optimizer`. ### Why are the changes needed? We meet some ploblem when backport SPARK-33806. It is because the UnresolvedFunction.foldable will throw a exception. It's ok with master branch, but it's better to do it at Optimizer. Some reason: 1. It's not always safe to call Expression.foldable before analysis. 2. fold num partition to 1 more like a optimize behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31077 from ulysses-you/SPARK-34030. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-10 13:00:40 +09:00
Max Gekk	0af387480c	[SPARK-34048][SQL][TESTS] Check the amount of calls to Hive external catalog ### What changes were proposed in this pull request? Add new tests to unified test suites to check the total amount of calls via the Hive client. ### Why are the changes needed? 1. To improve test coverage 2. To make foundation for future optimizations ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites like: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31092 from MaxGekk/access-to-catalog-refreshTable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-09 15:33:08 -08:00
Anton Okolnychyi	6b34745cb9	[SPARK-34049][SS] DataSource V2: Use Write abstraction in StreamExecution ### What changes were proposed in this pull request? This PR makes `StreamExecution` use the `Write` abstraction introduced in SPARK-33779. Note: we will need separate plans for streaming writes in order to support the required distribution and ordering in SS. This change only migrates to the `Write` abstraction. ### Why are the changes needed? These changes prevent exceptions from data sources that implement only the `build` method in `WriteBuilder`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31093 from aokolnychyi/spark-34049. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-08 20:37:35 -08:00
Kousuke Saruta	0781ed4f5b	[MINOR][SQL][TESTS] Fix the incorrect unicode escape test in ParserUtilsSuite ### What changes were proposed in this pull request? This PR fixes an incorrect unicode literal test in `ParserUtilsSuite`. In that suite, string literals in queries have unicode escape characters like `\u7328` but the backslash should be escaped because the queriy strings are given as Java strings. ### Why are the changes needed? Correct the test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Run `ParserUtilsSuite` and it passed. Closes #31088 from sarutak/fix-incorrect-unicode-test. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-08 09:44:33 -06:00
Max Gekk	157b72ac9f	[SPARK-33591][SQL] Recognize `null` in partition spec values ### What changes were proposed in this pull request? 1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. 3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec. ### Why are the changes needed? Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, the resulted table doesn't contain `null`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` ### How was this patch tested? 1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`. 2. Compiling by Scala 2.13: ``` $ ./dev/change-scala-version.sh 2.13 $ ./build/sbt -Pscala-2.13 compile ``` Closes #30538 from MaxGekk/partition-spec-value-null. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-08 14:14:27 +00:00
Kent Yao	0f8e5dd445	[SPARK-34003][SQL] Fix Rule conflicts between PaddingAndLengthCheckForCharVarchar and ResolveAggregateFunctions ### What changes were proposed in this pull request? ResolveAggregateFunctions is a hacky rule and it calls `executeSameContext` to generate a `resolved agg` to determine which unresolved sort attribute should be pushed into the agg. However, after we add the PaddingAndLengthCheckForCharVarchar rule which will rewrite the query output, thus, the `resolved agg` cannot match original attributes anymore. It causes some dissociative sort attribute to be pushed in and fails the query ``` logtalk [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'testcat.t1.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; [info] Project [v#14, sum(i)#11L] [info] +- Sort [aggOrder#12 ASC NULLS FIRST], true [info] +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12] [info] +- SubqueryAlias testcat.t1 [info] +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3, ) AS v#14, i#7] [info] +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1 [info] [info] Project [v#14, sum(i)#11L] [info] +- Sort [aggOrder#12 ASC NULLS FIRST], true [info] +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12] [info] +- SubqueryAlias testcat.t1 [info] +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3, ) AS v#14, i#7] [info] +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1 ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31027 from yaooqinn/SPARK-34003. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-08 09:05:22 +00:00
Gengliang Wang	b95a847ce1	[SPARK-34046][SQL][TESTS] Use join hint for constructing joins in JoinSuite and WholeStageCodegenSuite ### What changes were proposed in this pull request? There are some existing test cases that constructing various joins by tuning the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. This can be tricky and not straight-forward. In the future development we might have to tweak the configurations again . This PR is to construct specific joins by using join hint in test cases. ### Why are the changes needed? Make test cases for join simpler and more robust. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #31087 from gengliangwang/joinhintInTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-08 07:52:39 +00:00
Chao Sun	0de7f2ff1e	[SPARK-34039][SQL] ReplaceTable should invalidate cache ### What changes were proposed in this pull request? This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec. ### Why are the changes needed? Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added a new unit test. Closes #31081 from sunchao/SPARK-34039. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 21:13:22 -08:00
fwang12	7b06acc28b	[SPARK-33100][SQL][FOLLOWUP] Find correct bound of bracketed comment in spark-sql ### What changes were proposed in this pull request? This PR help find correct bound of bracketed comment in spark-sql. Here is the log for UT of SPARK-33100 in CliSuite before: ``` 2021-01-05 13:22:34.768 - stdout> spark-sql> /* SELECT 'test';/ SELECT 'test'; 2021-01-05 13:22:41.523 - stderr> Time taken: 6.716 seconds, Fetched 1 row(s) 2021-01-05 13:22:41.599 - stdout> test 2021-01-05 13:22:41.6 - stdout> spark-sql> ;;/ SELECT 'test';/ SELECT 'test'; 2021-01-05 13:22:41.709 - stdout> test 2021-01-05 13:22:41.709 - stdout> spark-sql> / SELECT 'test';/;; SELECT 'test'; 2021-01-05 13:22:41.902 - stdout> spark-sql> SELECT 'test'; -- SELECT 'test'; 2021-01-05 13:22:41.902 - stderr> Time taken: 0.129 seconds, Fetched 1 row(s) 2021-01-05 13:22:41.902 - stderr> Error in query: 2021-01-05 13:22:41.902 - stderr> mismatched input '<EOF>' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 19) 2021-01-05 13:22:42.006 - stderr> 2021-01-05 13:22:42.006 - stderr> == SQL == 2021-01-05 13:22:42.006 - stderr> / SELECT 'test';/ 2021-01-05 13:22:42.006 - stderr> -------------------^^^ 2021-01-05 13:22:42.006 - stderr> 2021-01-05 13:22:42.006 - stderr> Time taken: 0.226 seconds, Fetched 1 row(s) 2021-01-05 13:22:42.006 - stdout> test ``` The root cause is that the insideBracketedComment is not accurate. For `/ comment */`, the last character `/` is not insideBracketedComment and it would be treat as beginning of statements. In this PR, this issue is fixed. ### Why are the changes needed? To fix the issue described above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #31054 from turboFei/SPARK-33100-followup. Authored-by: fwang12 <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-07 20:49:37 +09:00
Yu Zhong	d36cdd5541	[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE ### What changes were proposed in this pull request? In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? 1. Add UT 2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933 Closes #30998 from zhongyu09/aqe-broadcast. Authored-by: Yu Zhong <yzhong@freewheel.tv> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-07 08:59:26 +00:00
Dongjoon Hyun	194edc86a2	Revert "[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider" This reverts commit `8bb70bf0d6`.	2021-01-06 23:41:27 -08:00
Yuming Wang	aa509c1eee	[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled ### What changes were proposed in this pull request? This pr add row count to `Union` operator when CBO enabled. ```scala spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)") spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Union false, false, Statistics(sizeInBytes=320.0 B) :- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) +- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) ``` After this pr: ``` == Optimized Logical Plan == Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20) :- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) +- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) ``` ### Why are the changes needed? Improve query performance, [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31068 from wangyum/SPARK-34031. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:41:10 +09:00
Yuming Wang	3aa4e113c5	[SPARK-33861][SQL][FOLLOWUP] Simplify conditional in predicate should consider deterministic ### What changes were proposed in this pull request? This pr address https://github.com/apache/spark/pull/30865#pullrequestreview-562344089 to fix simplify conditional in predicate should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31067 from wangyum/SPARK-33861-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:28:30 +09:00
yangjie01	26b603992c	[SPARK-34028][SQL] Cleanup "unreachable code" compilation warning ### What changes were proposed in this pull request? There is one compilation warning as follow: ``` [WARNING] [Warn] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1555: [other-match-analysis org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction] unreachable code ``` This compilation warning is due to `NoSuchPermanentFunctionException` is sub-class of `AnalysisException` and if there is `NoSuchPermanentFunctionException` be thrown out, it will be catch by `case _: AnalysisException => failFunctionLookup(name)`, so `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` is `unreachable code`. This pr remove `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` directly because both these 2 branches handle exceptions in the same way: `failFunctionLookup(name)` ### Why are the changes needed? Cleanup "unreachable code" compilation warnings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31064 from LuciferYang/SPARK-34028. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:26:04 +09:00
ulysses-you	f9daf035f4	[SPARK-33806][SQL][FOLLOWUP] Fold RepartitionExpression num partition should check if partition expression is empty ### What changes were proposed in this pull request? Add check partition expressions is empty. ### Why are the changes needed? We should keep `spark.range(1).hint("REPARTITION_BY_RANGE")` has default shuffle number instead of 1. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #31074 from ulysses-you/SPARK-33806-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 17:22:14 -08:00
Dongjoon Hyun	8bb70bf0d6	[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider ### What changes were proposed in this pull request? This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`. Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe. ### Why are the changes needed? Apache ORC 1.6 supports columnar encryption. ### Does this PR introduce _any_ user-facing change? No. This is for a test case. ### How was this patch tested? Pass the newly added test suite. Closes #31065 from dongjoon-hyun/SPARK-34029. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 12:59:47 -08:00
Kazuaki Ishizaki	a0269bb419	[SPARK-34022][DOCS][FOLLOW-UP] Fix typo in SQL built-in function docs ### What changes were proposed in this pull request? This PR is a follow-up of #31061. It fixes a typo in a document: `Finctions` -> `Functions` ### Why are the changes needed? Make the change better documented. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31069 from kiszk/SPARK-34022-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 09:28:22 -08:00
HyukjinKwon	0d86a02ffb	[SPARK-34022][DOCS] Support latest mkdocs in SQL built-in function docs ### What changes were proposed in this pull request? This PR adds the support of the latest mkdocs, and makes the sidebar properly show. It works in lower versions too. Before: ![Screen Shot 2021-01-06 at 5 11 56 PM](https://user-images.githubusercontent.com/6477701/103745131-4e7fe400-5042-11eb-9c09-84f9f95e9fb9.png) After: ![Screen Shot 2021-01-06 at 5 10 53 PM](https://user-images.githubusercontent.com/6477701/103745139-5049a780-5042-11eb-8ded-30b6f7ef48aa.png) ### Why are the changes needed? This is a regression in the documentation. ### Does this PR introduce _any_ user-facing change? Technically no. It's not related yet. It fixes the list on the sidebar appears properly. ### How was this patch tested? Manually built the docs via `./sql/create-docs.sh` and `open ./sql/site/index.html` Closes #31061 from HyukjinKwon/SPARK-34022. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 20:31:27 +09:00
gengjiaan	26d8df300a	[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification ### What changes were proposed in this pull request? We should optimize Like Any/All by LikeSimplification to improve performance. ### Why are the changes needed? Optimize Like Any/All ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30975 from beliefer/SPARK-33938. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-06 08:25:34 +00:00
yangjie01	45a4ff8e54	[SPARK-33948][SQL] Fix CodeGen error of MapObjects.doGenCode method in Scala 2.13 ### What changes were proposed in this pull request? `MapObjects.doGenCode` method will generate wrong code when `inputDataType` is `ArrayBuffer`. For example `encode/decode for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen path)` in `ExpressionEncoderSuite`, the error generated code part as follow: ``` /* 126 / private scala.collection.mutable.ArrayBuffer MapObjects_0(InternalRow i) { / 127 / boolean isNull_4 = i.isNullAt(1); / 128 / ArrayData value_4 = isNull_4 ? / 129 / null : (i.getArray(1)); / 130 / scala.collection.mutable.ArrayBuffer value_3 = null; / 131 / / 132 / if (!isNull_4) { / 133 / / 134 / int dataLength_0 = value_4.numElements(); / 135 / / 136 / scala.Tuple2[] convertedArray_0 = null; / 137 / convertedArray_0 = new scala.Tuple2[dataLength_0]; / 138 / / 139 / / 140 / int loopIndex_0 = 0; / 141 / / 142 / while (loopIndex_0 < dataLength_0) { / 143 / value_MapObject_lambda_variable_1 = (InternalRow) (value_4.getStruct(loopIndex_0, 2)); / 144 / isNull_MapObject_lambda_variable_1 = value_4.isNullAt(loopIndex_0); / 145 / / 146 / boolean isNull_5 = false; / 147 / scala.Tuple2 value_5 = null; / 148 / if (!false && isNull_MapObject_lambda_variable_1) { / 149 / / 150 / isNull_5 = true; / 151 / value_5 = ((scala.Tuple2)null); / 152 / } else { / 153 / scala.Tuple2 value_13 = NewInstance_0(i); / 154 / isNull_5 = false; / 155 / value_5 = value_13; / 156 / } / 157 / if (isNull_5) { / 158 / convertedArray_0[loopIndex_0] = null; / 159 / } else { / 160 / convertedArray_0[loopIndex_0] = value_5; / 161 / } / 162 / / 163 / loopIndex_0 += 1; / 164 / } / 165 / / 166 / value_3 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0); / 167 / } / 168 / globalIsNull_0 = isNull_4; / 169 / return value_3; / 170 / } ``` Line 166 in generated code try to assign `GenericArrayData` to `value_3(ArrayBuffer)` because `ArrayBuffer` type can't match `s.c.i.Seq` branch in Scala 2.13 in `MapObjects.doGenCode` method now. So this pr change to use `s.c.Seq` instead of `Seq` alias to let `ArrayBuffer` type can enter the same branch as Scala 2.12. After this pr the generate code when `inputDataType` is `ArrayBuffer` as follow: ``` / 126 / private scala.collection.mutable.ArrayBuffer MapObjects_0(InternalRow i) { / 127 / boolean isNull_4 = i.isNullAt(1); / 128 / ArrayData value_4 = isNull_4 ? / 129 / null : (i.getArray(1)); / 130 / scala.collection.mutable.ArrayBuffer value_3 = null; / 131 / / 132 / if (!isNull_4) { / 133 / / 134 / int dataLength_0 = value_4.numElements(); / 135 / / 136 / scala.collection.mutable.Builder collectionBuilder_0 = scala.collection.mutable.ArrayBuffer$.MODULE$.newBuilder(); / 137 / collectionBuilder_0.sizeHint(dataLength_0); / 138 / / 139 / / 140 / int loopIndex_0 = 0; / 141 / / 142 / while (loopIndex_0 < dataLength_0) { / 143 / value_MapObject_lambda_variable_1 = (InternalRow) (value_4.getStruct(loopIndex_0, 2)); / 144 / isNull_MapObject_lambda_variable_1 = value_4.isNullAt(loopIndex_0); / 145 / / 146 / boolean isNull_5 = false; / 147 / scala.Tuple2 value_5 = null; / 148 / if (!false && isNull_MapObject_lambda_variable_1) { / 149 / / 150 / isNull_5 = true; / 151 / value_5 = ((scala.Tuple2)null); / 152 / } else { / 153 / scala.Tuple2 value_13 = NewInstance_0(i); / 154 / isNull_5 = false; / 155 / value_5 = value_13; / 156 / } / 157 / if (isNull_5) { / 158 / collectionBuilder_0.$plus$eq(null); / 159 / } else { / 160 / collectionBuilder_0.$plus$eq(value_5); / 161 / } / 162 / / 163 / loopIndex_0 += 1; / 164 / } / 165 / / 166 / value_3 = (scala.collection.mutable.ArrayBuffer) collectionBuilder_0.result(); / 167 / } / 168 / globalIsNull_0 = isNull_4; / 169 / return value_3; / 170 */ } ``` ### Why are the changes needed? Bug fix in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `sql/catalyst` and `sql/core` in Scala 2.13 passed ``` mvn clean test -pl sql/catalyst -Pscala-2.13 Run completed in 11 minutes, 23 seconds. Total number of tests run: 4711 Suites: completed 261, aborted 0 Tests: succeeded 4711, failed 0, canceled 0, ignored 5, pending 0 All tests passed. ``` - Manual cherry-pick this pr to branch 3.1 and test`sql/catalyst` in Scala 2.13 passed ``` mvn clean test -pl sql/catalyst -Pscala-2.13 Run completed in 11 minutes, 18 seconds. Total number of tests run: 4655 Suites: completed 256, aborted 0 Tests: succeeded 4655, failed 0, canceled 0, ignored 5, pending 0 ``` Closes #31055 from LuciferYang/SPARK-33948. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 23:11:23 -08:00
angerszhu	c0d0dbabdb	[SPARK-33934][SQL][FOLLOW-UP] Use SubProcessor's exit code as assert condition to fix flaky test ### What changes were proposed in this pull request? Follow comment and fix. flaky test https://github.com/apache/spark/pull/30973#issuecomment-754852130. This flaky test is similar as https://github.com/apache/spark/pull/30896 Some task's failed with root cause but in driver may return error without root cause , change. UT to check with status exit code since different root cause's exit code is not same. ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #31046 from AngersZhuuuu/SPARK-33934-FOLLOW-UP. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 22:33:15 -08:00
gengjiaan	2ab77d634f	[SPARK-34004][SQL] Change FrameLessOffsetWindowFunction as sealed abstract class ### What changes were proposed in this pull request? Change `FrameLessOffsetWindowFunction` as sealed abstract class so that simplify pattern match. ### Why are the changes needed? Simplify pattern match ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Jenkins test Closes #31026 from beliefer/SPARK-30789-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 20:45:19 -08:00
Max Gekk	b77d11dfd9	[SPARK-34011][SQL] Refresh cache in `ALTER TABLE .. RENAME TO PARTITION` ### What changes were proposed in this pull request? 1. Invoke `refreshTable()` from `AlterTableRenamePartitionCommand.run()` after partitions renaming. In particular, this re-creates the cache associated with the modified table. 2. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. RENAME TO PARTITION` command. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 PARTITION (part0=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 2` since `0 0` was renamed by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 PARTITION (part=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 2 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31044 from MaxGekk/rename-partition-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 11:19:44 +09:00
angerszhu	e279ed3044	[SPARK-34012][SQL] Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after https://github.com/apache/spark/pull/28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-06 08:48:24 +09:00
gengjiaan	cc1d9d25fb	[SPARK-33542][SQL] Group exception messages in catalyst/catalog ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30870 from beliefer/SPARK-33542. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 16:15:33 +00:00
HyukjinKwon	8d09f96495	[SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build ### What changes were proposed in this pull request? This PR proposes to use python3 instead of python in SQL documentation build. After SPARK-29672, we use `sql/create-docs.sh` everywhere in Spark dev. We should fix it in `sql/create-docs.sh` too. This blocks release because the release container does not have `python` but only `python3`. ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually ran the script Closes #31041 from HyukjinKwon/SPARK-34010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 19:48:10 +09:00
Max Gekk	122f8f0fdb	[SPARK-33919][SQL][TESTS] Unify v1 and v2 SHOW NAMESPACES tests ### What changes were proposed in this pull request? 1. Port DS V2 tests from `DataSourceV2SQLSuite` to the base test suite `ShowNamespacesSuiteBase` to run those tests for v1 catalogs. 2. Port DS v1 tests from `DDLSuite` to `ShowNamespacesSuiteBase` to run the tests for v2 catalogs too. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowNamespacesSuite" ``` Closes #30937 from MaxGekk/unify-show-namespaces-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 07:30:59 +00:00
tanel.kiis@gmail.com	f252a9334e	[SPARK-33935][SQL] Fix CBO cost function ### What changes were proposed in this pull request? Changed the cost function in CBO to match documentation. ### Why are the changes needed? The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as: ``` The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). ``` The implementation in `JoinReorderDP.betterThan` does not match this documentaiton: ``` def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 \|\| other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } ``` This different implementation has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. This happens with several of the TPCDS queries. The new implementation does not have this behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and existing UTs Closes #30965 from tanelk/SPARK-33935_cbo_cost_function. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-05 16:00:24 +09:00
fwang12	a071826f72	[SPARK-33100][SQL] Ignore a semicolon inside a bracketed comment in spark-sql ### What changes were proposed in this pull request? Now the spark-sql does not support parse the sql statements with bracketed comments. For the sql statements: ``` /* SELECT 'test'; / SELECT 'test'; ``` Would be split to two statements: The first one: `/ SELECT 'test'` The second one: `*/ SELECT 'test'` Then it would throw an exception because the first one is illegal. In this PR, we ignore the content in bracketed comments while splitting the sql statements. Besides, we ignore the comment without any content. ### Why are the changes needed? Spark-sql might split the statements inside bracketed comments and it is not correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #29982 from turboFei/SPARK-33110. Lead-authored-by: fwang12 <fwang12@ebay.com> Co-authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-05 15:55:30 +09:00
Kent Yao	f0ffe0cd65	[SPARK-33992][SQL] override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer because it and resolveOperatorsUpWithNewOutput are called in the analyzer. For example, PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput with ```logtalk [info] - char/varchar resolution in sub query * FAILED * (367 milliseconds) [info] java.lang.RuntimeException: This method should not be called in the analyzer [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267) ``` ### Why are the changes needed? trivial bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31013 from yaooqinn/SPARK-33992. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:34:11 +00:00
Terry Kim	15a863fd54	[SPARK-34001][SQL][TESTS] Remove unused runShowTablesSql() in DataSourceV2SQLSuite.scala ### What changes were proposed in this pull request? After #30287, `runShowTablesSql()` in `DataSourceV2SQLSuite.scala` is no longer used. This PR removes the unused method. ### Why are the changes needed? To remove unused method. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #31022 from imback82/33382-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 21:32:49 -08:00
Terry Kim	6b00fdc756	[SPARK-33998][SQL] Provide an API to create an InternalRow in V2CommandExec ### What changes were proposed in this pull request? There are many v2 commands such as `SHOW TABLES`, `DESCRIBE TABLE`, etc. that require creating `InternalRow`s. Currently, the code to create `InternalRow`s are duplicated across many commands and it can be moved into `V2CommandExec` to remove duplicate code. ### Why are the changes needed? To clean up duplicate code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test since this is just refactoring. Closes #31020 from imback82/refactor_v2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:32:36 +00:00
Chongguang LIU	976e97a80d	[SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode ### What changes were proposed in this pull request? Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter. ### Why are the changes needed? For ansiMode. ### Does this PR introduce _any_ user-facing change? Yes. When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter. When spark.sql.ansi.enabled = false, same behaviour as before. ### How was this patch tested? Ansi mode is tested with existing tests. End-to-end tests have been added. Closes #30807 from chongguang/SPARK-33794. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:20:16 +00:00
tanel.kiis@gmail.com	bb6d6b5602	[SPARK-33964][SQL] Combine distinct unions in more cases ### What changes were proposed in this pull request? Added the `RemoveNoopOperators` rule to optimization batch `Union`. Also made sure that the `RemoveNoopOperators` would be idempotent. ### Why are the changes needed? In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them. The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs and the output of `PlanStabilitySuite` Closes #30996 from tanelk/SPARK-33964_combine_unions. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 11:01:31 +09:00
Max Gekk	84c1f43669	[SPARK-33987][SQL] Refresh cache in v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? 1. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. DROP PARTITION` command. 2. Port the test for v1 catalogs to the base suite to run it for v2 table catalog. ### Why are the changes needed? The changes fix incorrect query results from cached V2 table altered by `ALTER TABLE .. DROP PARTITION`, see the added test and SPARK-33987. ### Does this PR introduce _any_ user-facing change? Yes, it could if users have v2 table catalogs. ### How was this patch tested? By running unified tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31017 from MaxGekk/drop-partition-refresh-cache-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:00:48 -08:00
Kent Yao	ac4651a7d1	[SPARK-33980][SS] Invalidate char/varchar in spark.readStream.schema ### What changes were proposed in this pull request? invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in `da72b87374` ### Why are the changes needed? bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false` ### Does this PR introduce _any_ user-facing change? yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false` ### How was this patch tested? new tests Closes #31003 from yaooqinn/SPARK-33980. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 12:59:45 -08:00
Takeshi Yamamuro	414d323d6c	[SPARK-33988][SQL][TEST] Add an option to enable CBO in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled. ### Why are the changes needed? To monitor performance chaneges with CBO enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #31011 from maropu/AddOptionForCBOInTPCDSBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:31:20 -08:00
Max Gekk	fc3f22645e	[SPARK-33990][SQL][TESTS] Remove partition data by v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests. ### Why are the changes needed? This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests suites for v1 and v2 catalogs: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31014 from MaxGekk/fix-drop-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:26:39 -08:00
Terry Kim	ddc0d5148a	[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables ### What changes were proposed in this pull request? This PR proposes to implement `DESCRIBE COLUMN` for v2 tables. Note that `isExnteded` option is not implemented in this PR. ### Why are the changes needed? Parity with v1 tables. ### Does this PR introduce _any_ user-facing change? Yes, now, `DESCRIBE COLUMN` works for v2 tables. ```scala sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo") sql("DESCRIBE testcat.tbl data").show ``` ``` +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| data\| \|data_type\| string\| \| comment\| hello\| +---------+----------+ ``` Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.` ### How was this patch tested? Added new test. Closes #30881 from imback82/describe_col_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 16:14:33 +00:00
angerszhu	8583a4605f	[SPARK-33844][SQL] InsertIntoHiveDir command should check col name too ### What changes were proposed in this pull request? In hive-1.2.1, hive serde just split `serdeConstants.LIST_COLUMNS` and `serdeConstants.LIST_COLUMN_TYPES` use comma. When we use spark 2.4 with UT ``` test("insert overwrite directory with comma col name") { withTempDir { dir => val path = dir.toURI.getPath val v1 = s""" \| INSERT OVERWRITE DIRECTORY '${path}' \| STORED AS TEXTFILE \| SELECT 1 as a, 'c' as b, if(1 = 1, "true", "false") """.stripMargin sql(v1).explain(true) sql(v1).show() } } ``` failed with as below since column name contains `,` then column names and column types size not equal. ``` 19:56:05.618 ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter: [ angerszhu ] Aborting job dd774f18-93fa-431f-9468-3534c7d8acda. org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): org.apache.hadoop.hive.serde2.SerDeException: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe: columns has 5 elements while columns.types has 3 elements! at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.extractColumnInfo(LazySerDeParameters.java:145) at org.apache.hadoop.hive.serde2.lazy.LazySerDeParameters.<init>(LazySerDeParameters.java:85) at org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe.initialize(LazySimpleSerDe.java:125) at org.apache.spark.sql.hive.execution.HiveOutputWriter.<init>(HiveFileFormat.scala:119) at org.apache.spark.sql.hive.execution.HiveFileFormat$$anon$1.newInstance(HiveFileFormat.scala:103) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:287) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:219) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:218) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:121) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$12.apply(Executor.scala:461) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` After hive-2.3 we will set COLUMN_NAME_DELIMITER to special char when col name cntains ','： `6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1180-L1188)` `6f4c35c9e9/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java (L1044-L1075)` And in script transform, we parse column name to avoid this problem `554600c2af/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationExec.scala (L257-L261)` So I think in `InsertIntoHiveDirComman`, we should do same thing too. And I have verified this method can make spark-2.4 work well. ### Why are the changes needed? More save use serde ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Closes #30850 from AngersZhuuuu/SPARK-33844. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 09:43:15 +00:00
Dongjoon Hyun	271c4f6e00	[SPARK-33978][SQL] Support ZSTD compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. ### Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. - https://issues.apache.org/jira/browse/ORC-363 BEFORE ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") ``` ```bash $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #31002 from dongjoon-hyun/SPARK-33978. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 00:54:47 -08:00
Max Gekk	8b3fb43f40	[SPARK-33965][SQL][TESTS] Recognize `spark_catalog` by `CACHE TABLE` in Hive table names ### What changes were proposed in this pull request? Remove special handling of `CacheTable` in `TestHiveQueryExecution. analyzed` because it does not allow to support of `spark_catalog` in Hive table names. `spark_catalog` could be handled by a few lines below: ```scala case UnresolvedRelation(ident, _, _) => if (ident.length > 1 && ident.head.equalsIgnoreCase(CatalogManager.SESSION_CATALOG_NAME)) { ``` added by https://github.com/apache/spark/pull/30883. ### Why are the changes needed? 1. To have feature parity with v1 In-Memory catalog. 2. To be able to write unified tests for In-Memory and Hive external catalogs. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the test suite with new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #30997 from MaxGekk/cache-table-spark_catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 08:28:26 +00:00
Hoa	0b647fe69c	[SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis ### What changes were proposed in this pull request? JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now. ### Why are the changes needed? Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Close #30902 Closes #30902 from saikocat/SPARK-33888. Lead-authored-by: Hoa <hoameomu@gmail.com> Co-authored-by: Hoa <saikocatz@gmail.com> Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn> Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 06:53:12 +00:00
angerszhu	adac633f93	[SPARK-33934][SQL] Add SparkFile's root dir to env property PATH ### What changes were proposed in this pull request? In hive we always use ``` add file /path/to/script.py; select transform(col1, col2, ..) using 'script.py' as (col1, col2, ...) from ... ``` Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`. This pr add a SparkFile's root dir path to execution env property `PATH`, then sub-processor will find `scrip.py` as program under `PATH`. ### Why are the changes needed? Support SQL migration form Hive to Spark. ### Does this PR introduce _any_ user-facing change? User can direct use script file name as program in script transform SQL. ``` add file /path/to/script.py; select transform(col1, col2, ..) using 'script.py' as (col1, col2, ...) from ... ``` ### How was this patch tested? UT Closes #30973 from AngersZhuuuu/SPARK-33934. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-04 15:46:49 +09:00
Yuming Wang	2a68ed71e4	[SPARK-33954][SQL] Some operator missing rowCount when enable CBO ### What changes were proposed in this pull request? This pr fix some operator missing rowCount when enable CBO, e.g.: ```scala spark.range(1000).selectExpr("id as a", "id as b").write.saveAsTable("t1") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql("set spark.sql.cbo.planStats.enabled=true") spark.sql("select * from (select * from t1 distribute by a limit 100) distribute by b").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB) +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB) +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB) +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) ``` After this pr: ``` == Optimized Logical Plan == RepartitionByExpression [b#2129L], Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- GlobalLimit 100, Statistics(sizeInBytes=2.3 KiB, rowCount=100) +- LocalLimit 100, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- RepartitionByExpression [a#2128L], Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) +- Relation[a#2128L,b#2129L] parquet, Statistics(sizeInBytes=23.4 KiB, rowCount=1.00E+3) ``` ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30987 from wangyum/SPARK-33954. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 05:53:14 +00:00
gengjiaan	b037930952	[SPARK-33951][SQL] Distinguish the error between filter and distinct ### What changes were proposed in this pull request? The error messages for specifying filter and distinct for the aggregate function are mixed together and should be separated. This can increase readability and ease of use. ### Why are the changes needed? increase readability and ease of use. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test Closes #30982 from beliefer/SPARK-33951. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 05:44:00 +00:00
Max Gekk	67195d0d97	[SPARK-33950][SQL] Refresh cache in v1 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 0` since it was deleted by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30983 from MaxGekk/drop-partition-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 04:11:39 +00:00
Liang-Chi Hsieh	963c60fe49	[SPARK-33955][SS] Add latest offsets to source progress ### What changes were proposed in this pull request? This patch proposes to add latest offset to source progress for streaming queries. ### Why are the changes needed? Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress. ### Does this PR introduce _any_ user-facing change? Yes, for new metric about latest source offset in source progress. ### How was this patch tested? Unit test. Manually test in Spark cluster: ``` "description" : "KafkaV2[Subscribe[page_view_events]]", "startOffset" : { "page_view_events" : { "2" : 582370921, "4" : 391910836, "1" : 631009201, "3" : 406601346, "0" : 195799112 } }, "endOffset" : { "page_view_events" : { "2" : 583764414, "4" : 392338002, "1" : 632183480, "3" : 407101489, "0" : 197304028 } }, "latestOffset" : { "page_view_events" : { "2" : 589852545, "4" : 394204277, "1" : 637313869, "3" : 409286602, "0" : 203878962 } }, "numInputRows" : 4999997, "inputRowsPerSecond" : 29287.70501405811, ``` Closes #30988 from viirya/latest-offset. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-03 01:31:38 -08:00
Max Gekk	fc7d0165d2	[SPARK-33963][SQL] Canonicalize `HiveTableRelation` w/o table stats ### What changes were proposed in this pull request? Skip table stats in canonicalizing of `HiveTableRelation`. ### Why are the changes needed? The changes fix a regression comparing to Spark 3.0, see SPARK-33963. ### Does this PR introduce _any_ user-facing change? Yes. After changes Spark behaves as in the version 3.0.1. ### How was this patch tested? By running new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #30995 from MaxGekk/fix-caching-hive-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-03 11:23:46 +09:00
Yuming Wang	6c5ba8169a	[SPARK-33959][SQL] Improve the statistics estimation of the Tail ### What changes were proposed in this pull request? This pr improve the statistics estimation of the `Tail`: ```scala spark.sql("set spark.sql.cbo.enabled=true") spark.range(100).selectExpr("id as a", "id as b", "id as c", "id as e").write.saveAsTable("t1") println(Tail(Literal(5), spark.sql("SELECT * FROM t1").queryExecution.logical).queryExecution.stringWithStats) ``` Before this pr: ``` == Optimized Logical Plan == Tail 5, Statistics(sizeInBytes=3.8 KiB) +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB) ``` After this pr: ``` == Optimized Logical Plan == Tail 5, Statistics(sizeInBytes=200.0 B, rowCount=5) +- Relation[a#24L,b#25L,c#26L,e#27L] parquet, Statistics(sizeInBytes=3.8 KiB) ``` ### Why are the changes needed? Import statistics estimation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30991 from wangyum/SPARK-33959. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-03 10:59:12 +09:00
Yuming Wang	4cd680581a	[SPARK-33956][SQL] Add rowCount for Range operator ### What changes were proposed in this pull request? This pr add rowCount for `Range` operator: ```scala spark.sql("set spark.sql.cbo.enabled=true") spark.sql("select id from range(100)").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B) ``` After this pr: ``` == Optimized Logical Plan == Range (0, 100, step=1, splits=None), Statistics(sizeInBytes=800.0 B, rowCount=100) ``` ### Why are the changes needed? [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30989 from wangyum/SPARK-33956. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-02 08:58:48 -08:00
Kent Yao	ed9f728801	[SPARK-33944][SQL] Incorrect logging for warehouse keys in SharedState options ### What changes were proposed in this pull request? While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled. ### Why are the changes needed? bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30978 from yaooqinn/SPARK-33944. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-31 13:20:31 -08:00
angerszhu	771c538620	[SPARK-33084][SQL][TESTS][FOLLOW-UP] Fix Scala 2.13 UT failure ### What changes were proposed in this pull request? Fix UT according to https://github.com/apache/spark/pull/29966#issuecomment-752830046 Change StructType construct from ``` def inputSchema: StructType = StructType(StructField("inputColumn", LongType) :: Nil) ``` to ``` def inputSchema: StructType = new StructType().add("inputColumn", LongType) ``` The whole udf class is : ``` package org.apache.spark.examples.sql import org.apache.spark.sql.expressions.{MutableAggregationBuffer, UserDefinedAggregateFunction} import org.apache.spark.sql.types._ import org.apache.spark.sql.Row class Spark33084 extends UserDefinedAggregateFunction { // Data types of input arguments of this aggregate function def inputSchema: StructType = new StructType().add("inputColumn", LongType) // Data types of values in the aggregation buffer def bufferSchema: StructType = new StructType().add("sum", LongType).add("count", LongType) // The data type of the returned value def dataType: DataType = DoubleType // Whether this function always returns the same output on the identical input def deterministic: Boolean = true // Initializes the given aggregation buffer. The buffer itself is a `Row` that in addition to // standard methods like retrieving a value at an index (e.g., get(), getBoolean()), provides // the opportunity to update its values. Note that arrays and maps inside the buffer are still // immutable. def initialize(buffer: MutableAggregationBuffer): Unit = { buffer(0) = 0L buffer(1) = 0L } // Updates the given aggregation buffer `buffer` with new input data from `input` def update(buffer: MutableAggregationBuffer, input: Row): Unit = { if (!input.isNullAt(0)) { buffer(0) = buffer.getLong(0) + input.getLong(0) buffer(1) = buffer.getLong(1) + 1 } } // Merges two aggregation buffers and stores the updated buffer values back to `buffer1` def merge(buffer1: MutableAggregationBuffer, buffer2: Row): Unit = { buffer1(0) = buffer1.getLong(0) + buffer2.getLong(0) buffer1(1) = buffer1.getLong(1) + buffer2.getLong(1) } // Calculates the final result def evaluate(buffer: Row): Double = buffer.getLong(0).toDouble / buffer.getLong(1) } ``` ### Why are the changes needed? Fix UT for scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30980 from AngersZhuuuu/spark-33084-followup. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-31 13:18:31 -08:00
Liang-Chi Hsieh	f38265ddda	[SPARK-33907][SQL] Only prune columns of from_json if parsing options is empty ### What changes were proposed in this pull request? As a follow-up task to SPARK-32958, this patch takes safer approach to only prune columns from JsonToStructs if the parsing option is empty. It is to avoid unexpected behavior change regarding parsing. This patch also adds a few e2e tests to make sure failfast parsing behavior is not changed. ### Why are the changes needed? It is to avoid unexpected behavior change regarding parsing. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30970 from viirya/SPARK-33907-3.2. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-12-30 09:57:15 -08:00
gengjiaan	ba974ea8e4	[SPARK-30789][SQL] Support (IGNORE \| RESPECT) NULLS for LEAD/LAG/NTH_VALUE/FIRST_VALUE/LAST_VALUE ### What changes were proposed in this pull request? All of `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE` should support IGNORE NULLS \| RESPECT NULLS. For example: ``` LEAD (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` LAG (value_expr [, offset ]) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] ORDER BY window_ordering ) ``` ``` NTH_VALUE (expr, offset) [ IGNORE NULLS \| RESPECT NULLS ] OVER ( [ PARTITION BY window_partition ] [ ORDER BY window_ordering frame_clause ] ) ``` The mainstream database or engine supports this syntax contains: Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0 Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html Presto https://prestodb.io/docs/current/functions/window.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html https://docs.snowflake.com/en/sql-reference/functions/nth_value.html https://docs.snowflake.com/en/sql-reference/functions/first_value.html https://docs.snowflake.com/en/sql-reference/functions/last_value.html Exasol https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lead.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/lag.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/nth_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/first_value.htm https://docs.exasol.com/sql_references/functions/alphabeticallistfunctions/last_value.htm ### Why are the changes needed? Support `(IGNORE \| RESPECT) NULLS` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE `is very useful. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Jenkins test Closes #30943 from beliefer/SPARK-30789. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 13:14:31 +00:00
Max Gekk	2afd1fb492	[SPARK-33904][SQL] Recognize `spark_catalog` in `saveAsTable()` and `insertInto()` ### What changes were proposed in this pull request? In the `saveAsTable()` and `insertInto()` methods of `DataFrameWriter`, recognize `spark_catalog` as the default session catalog in table names. ### Why are the changes needed? 1. To simplify writing of unified v1 and v2 tests 2. To improve Spark SQL user experience. `insertInto()` should have feature parity with the `INSERT INTO` sql command. Currently, `insertInto()` fails on a table from a namespace in `spark_catalog`: ```scala scala> sql("CREATE NAMESPACE spark_catalog.ns") scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl. at org.apache.spark.sql.DataFrameWriter.saveAsTable(DataFrameWriter.scala:629) ... 47 elided scala> Seq(0).toDF().write.insertInto("spark_catalog.ns.tbl") org.apache.spark.sql.AnalysisException: Couldn't find a catalog to handle the identifier spark_catalog.ns.tbl. at org.apache.spark.sql.DataFrameWriter.insertInto(DataFrameWriter.scala:498) ... 47 elided ``` but `INSERT INTO` succeed: ```sql spark-sql> create table spark_catalog.ns.tbl (c int); spark-sql> insert into spark_catalog.ns.tbl select 0; spark-sql> select * from spark_catalog.ns.tbl; 0 ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```scala scala> Seq(0).toDF().write.saveAsTable("spark_catalog.ns.tbl") scala> Seq(1).toDF().write.insertInto("spark_catalog.ns.tbl") scala> spark.table("spark_catalog.ns.tbl").show(false) +-----+ \|value\| +-----+ \|0 \| \|1 \| +-----+ ``` ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .FileFormatWriterSuite" ``` Closes #30919 from MaxGekk/insert-into-spark_catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 07:56:34 +00:00
Max Gekk	0eb4961ca8	[SPARK-33926][SQL] Improve the error message from resolving of v1 database name ### What changes were proposed in this pull request? 1. Replace `SessionCatalogAndNamespace` by `DatabaseInSessionCatalog` in resolving database name from v1 session catalog. 2. Throw more precise errors from `DatabaseInSessionCatalog` 3. Fix expected error messages in `v1.ShowTablesSuiteBase` Closes #30947 ### Why are the changes needed? Current error message "multi-part identifier cannot be empty" may confuse users. And this error message is just a consequence of "incorrectly" applied an implicit class. For example, `SHOW TABLES IN spark_catalog`: 1. Spark cuts off `spark_catalog` from namespaces in `SessionCatalogAndNamespace`, so, `ns == Seq.empty` here: `0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L365)` 2. Then `ns.length != 1` is `true` and Spark tries to raise the exception at `0617dfce7b/sql/core/src/main/scala/org/apache/spark/sql/catalyst/analysis/ResolveSessionCatalog.scala (L367)` 3. ... but `ns.quoted` triggers implicit wrapping `Seq.empty` by `MultipartIdentifierHelper`, and hit to the second check `if (parts.isEmpty)` at `156704ba0d/sql/catalyst/src/main/scala/org/apache/spark/sql/connector/catalog/CatalogV2Implicits.scala (L120-L122)` So, Spark throws the exception at third step instead of `new AnalysisException(s"The database name is not valid: $quoted")` on the second step. And even on the second step, the exception doesn't show actual reason as it is pretty generic. ### Does this PR introduce _any_ user-facing change? Yes in the case of v1 DDL commands when a database is not specified or nested databases is set. ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DDLSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowTablesSuite" ``` Closes #30963 from MaxGekk/database-in-session-catalog. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 07:52:34 +00:00
gengjiaan	687f465244	[SPARK-33890][SQL] Improve the implement of trim/trimleft/trimright ### What changes were proposed in this pull request? The current implement of trim/trimleft/trimright have somewhat redundant. ### Why are the changes needed? Improve the implement of trim/trimleft/trimright ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #30905 from beliefer/SPARK-33890. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: Wenchen Fan <cloud0fan@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 06:06:17 +00:00
angerszhu	49aa6ebef1	[SPARK-32684][SQL][TESTS] Add a test case to check if null value is same as Hive's '\\N' in script transformation ### What changes were proposed in this pull request? In hive script transform serde mode, NULL format default is `\\N` ``` String nullString = tbl.getProperty( serdeConstants.SERIALIZATION_NULL_FORMAT, "\\N"); nullSequence = new Text(nullString); ``` I make a mistake that in Spark's code we need to fix and keep same with hive too. So add some test case to show this issue. ### Why are the changes needed? add UT ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30946 from AngersZhuuuu/SPARK-32684. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-30 05:28:01 +00:00
Max Gekk	2b6836cdc2	[SPARK-33936][SQL] Add the version when connector's methods and interfaces were updated ### What changes were proposed in this pull request? Add the `since` tag to methods and interfaces added recently. ### Why are the changes needed? 1. To follow the existing convention for Spark API. 2. To inform devs when Spark API was changed. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? `dev/scalastyle` Closes #30966 from MaxGekk/spark-23889-interfaces-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-29 12:26:25 -08:00
Yuming Wang	c42502493a	[SPARK-33847][SQL][FOLLOWUP] Remove the CaseWhen should consider deterministic ### What changes were proposed in this pull request? This pr fix remove the `CaseWhen` if elseValue is empty and other outputs are null because of we should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30960 from wangyum/SPARK-33847-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 14:35:01 +00:00
Max Gekk	16c594de79	[SPARK-33859][SQL][FOLLOWUP] Add version to `SupportsPartitionManagement.renamePartition()` ### What changes were proposed in this pull request? Add the version 3.2.0 to new method `renamePartition()` in the `SupportsPartitionManagement` interface. ### Why are the changes needed? To inform Spark devs when the method appears in the interface. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `./dev/scalastyle` Closes #30964 from MaxGekk/alter-table-rename-partition-v2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 14:30:37 +00:00
angerszhu	aadda4b561	[SPARK-33930][SQL] Script Transform default FIELD DELIMIT should be \u0001 for no serde ### What changes were proposed in this pull request? For same SQL ``` SELECT TRANSFORM(a, b, c, null) ROW FORMAT DELIMITED USING 'cat' ROW FORMAT DELIMITED FIELDS TERMINATED BY '&' FROM (select 1 as a, 2 as b, 3 as c) t ``` In hive: ``` hive> SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t; OK 123\N NULL Time taken: 14.519 seconds, Fetched: 1 row(s) hive> packet_write_wait: Connection to 10.191.58.100 port 32200: Broken pipe ``` In Spark ``` Spark master: local[*], Application Id: local-1609225830376 spark-sql> SELECT TRANSFORM(a, b, c, null) > ROW FORMAT DELIMITED > USING 'cat' > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '&' > FROM (select 1 as a, 2 as b, 3 as c) t; 1 2 3 null NULL Time taken: 4.297 seconds, Fetched 1 row(s) spark-sql> ``` We should keep same. Change default ROW FORMAT FIELD DELIMIT to `\u0001` In hive default value is '1' to char is '\u0001' ``` bucket_count -1 column.name.delimiter , columns columns.comments columns.types file.inputformat org.apache.hadoop.hive.ql.io.NullRowsInputFormat ``` ### Why are the changes needed? Keep same behavior with hive ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30958 from AngersZhuuuu/SPARK-33930. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-29 23:26:27 +09:00
Yuming Wang	872107f67f	[SPARK-33848][SQL][FOLLOWUP] Introduce allowList for push into (if / case) branches ### What changes were proposed in this pull request? Introduce allowList push into (if / case) branches to fix potential bug. ### Why are the changes needed? Fix potential bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30955 from wangyum/SPARK-33848-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:34:43 +00:00
ulysses-you	3b1b209e90	[SPARK-33909][SQL] Check rand functions seed is legal at analyer side ### What changes were proposed in this pull request? Move seed is legal check to `CheckAnalysis`. ### Why are the changes needed? It's better to check seed expression is legal at analyzer side instead of execution, and user can get exception as soon as possible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30923 from ulysses-you/SPARK-33909. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:33:06 +00:00
Max Gekk	e0d2ffec31	[SPARK-33859][SQL] Support V2 ALTER TABLE .. RENAME PARTITION ### What changes were proposed in this pull request? 1. Add `renamePartition()` to the `SupportsPartitionManagement` 2. Implement `renamePartition()` in `InMemoryPartitionTable` 3. Add v2 execution node `AlterTableRenamePartitionExec` 4. Resolve the logical node `AlterTableRenamePartition` to `AlterTableRenamePartitionExec` for v2 tables that support `SupportsPartitionManagement` 5. Move v1 tests to the base suite `org.apache.spark.sql.execution.command.AlterTableRenamePartitionSuiteBase` to run them for v2 table catalogs. ### Why are the changes needed? To have feature parity with Datasource V1. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By running the unified tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #30935 from MaxGekk/alter-table-rename-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 13:29:48 +00:00
Liang-Chi Hsieh	f9fe742442	[SPARK-32968][SQL] Prune unnecessary columns from CsvToStructs ### What changes were proposed in this pull request? This patch proposes to do column pruning for CsvToStructs expression if we only require some fields from it. ### Why are the changes needed? `CsvToStructs` takes a schema parameter used to tell CSV Parser what fields are needed to parse. If `CsvToStructs` is followed by GetStructField. We can prune the schema to only parse certain field. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30912 from viirya/SPARK-32968. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-29 21:37:17 +09:00
Yuming Wang	f7bdea334a	[SPARK-33884][SQL] Simplify CaseWhenclauses with (true and false) and (false and true) ### What changes were proposed in this pull request? This pr simplify `CaseWhen`clauses with (true and false) and (false and true): Expression \| cond.nullable \| After simplify -- \| -- \| -- case when cond then true else false end \| true \| cond <=> true case when cond then true else false end \| false \| cond case when cond then false else true end \| true \| !(cond <=> true) case when cond then false else true end \| false \| !cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30898 from wangyum/SPARK-33884. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 07:09:11 +00:00
Max Gekk	379afcd2ce	[SPARK-33924][SQL][TESTS] Preserve partition metadata by INSERT INTO in v2 table catalog ### What changes were proposed in this pull request? For `InMemoryPartitionTable` used in tests, set empty partition metadata only when a partition doesn't exists. ### Why are the changes needed? This bug fix is needed to use `INSERT INTO .. PARTITION` in other tests. ### Does this PR introduce _any_ user-facing change? No. It affects only the v2 table catalog used in tests. ### How was this patch tested? Added new UT to `DataSourceV2SQLSuite`, and run the affected test suite by: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly org.apache.spark.sql.connector.DataSourceV2SQLSuite" ``` Closes #30952 from MaxGekk/fix-insert-into-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-29 06:49:26 +00:00
HyukjinKwon	b33fa53385	[SPARK-33925][CORE] Remove unused SecurityManager in Utils.fetchFile ### What changes were proposed in this pull request? This is kind of a followup of https://github.com/apache/spark/pull/24033. The first and last usage of that argument `SecurityManager` was removed in https://github.com/apache/spark/pull/24033. After that, we don't need to pass `SecurityManager` anymore in `Utils.fetchFile` and related code paths. This PR proposes to remove it out. ### Why are the changes needed? For better readability of codes. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually complied. GitHub Actions and Jenkins build should test it out as well. Closes #30945 from HyukjinKwon/SPARK-33925. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 16:58:42 -08:00
Wenchen Fan	c2eac1de02	[SPARK-33845][SQL][FOLLOWUP] fix SimplifyConditionals ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30849, to fix a correctness issue caused by null value handling. ### Why are the changes needed? Fix a correctness issue. `If(null, true, false)` should return false, not true. ### Does this PR introduce _any_ user-facing change? Yes, but the bug only exist in the master branch. ### How was this patch tested? updated tests. Closes #30953 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 16:44:57 -08:00
Max Gekk	0617dfce7b	[SPARK-33899][SQL] Fix assert failure in v1 SHOW TABLES/VIEWS on `spark_catalog` ### What changes were proposed in this pull request? Remove `assert(ns.nonEmpty)` in `ResolveSessionCatalog` for: - `SHOW TABLES` - `SHOW TABLE EXTENDED` - `SHOW VIEWS` ### Why are the changes needed? Spark SQL shouldn't fail with internal assert failures even for invalid user inputs. For instance: ```sql spark-sql> show tables in spark_catalog; 20/12/24 11:19:46 ERROR SparkSQLDriver: Failed in [show tables in spark_catalog] java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:208) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:366) at org.apache.spark.sql.catalyst.analysis.ResolveSessionCatalog$$anonfun$apply$1.applyOrElse(ResolveSessionCatalog.scala:49) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsUp$3(AnalysisHelper.scala:90) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:73) ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, for the example above: ```sql spark-sql> show tables in spark_catalog; Error in query: multi-part identifier cannot be empty. ``` ### How was this patch tested? Added new UT to `v1/ShowTablesSuite`. Closes #30915 from MaxGekk/remove-assert-ns-nonempty. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 09:07:21 +00:00
angerszhu	fc508d1898	[SPARK-32685][SQL] When specify serde, default filed.delim is '\t' ### What changes were proposed in this pull request? In hive script transform, when we use specified serde, the `filed.delim` is '\t' ![image](https://user-images.githubusercontent.com/46485123/103187960-7dd77800-4901-11eb-8241-f4636e66fbc8.png) And change to other serde and explain query plan, `filed.delim` is same. In spark current code, the result is as below: ![image](https://user-images.githubusercontent.com/46485123/103187999-95aefc00-4901-11eb-9850-5c385000b78c.png) We should keep same as hive. Notic: the result's NULL value is different is another issue https://issues.apache.org/jira/browse/SPARK-32684 ### Why are the changes needed? Keep same with hive serde ### Does this PR introduce _any_ user-facing change? In script transform, is not specified, `field.delim` keep same with hive as `\t` ### How was this patch tested? UT added Closes #30942 from AngersZhuuuu/SPARK-32685. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 08:23:01 +00:00
yi.wu	00fa49aeaa	[SPARK-33923][SQL][TESTS] Fix some tests with AQE enabled ### What changes were proposed in this pull request? * Remove the explicit AQE disable confs * Use `AdaptiveSparkPlanHelper` to check plans * No longer extending `DisableAdaptiveExecutionSuite` for `BucketedReadSuite` but only disable AQE for two certain tests there. ### Why are the changes needed? Some tests that are fixed in https://github.com/apache/spark/pull/30655 doesn't really require AQE off. Instead, they could use `AdaptiveSparkPlanHelper` to pass when AQE on. It's better to run tests with AQE on since we've turned it on by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass all tests and the updated tests. Closes #30941 from Ngone51/SPARK-33680-follow-up. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-28 00:03:45 -08:00
Liang-Chi Hsieh	c75f779fd7	[SPARK-33827][SS] Unload inactive state store as soon as possible ### What changes were proposed in this pull request? This patch proposes to unload inactive state store as soon as possible. The timing of unload inactive state stores, happens when we get to load active state store provider at executors. At the time, state store coordinator will return back the state store provider list including loaded stores that are already loaded by other executors in new batch. Each state store provider in the list will go to unload. ### Why are the changes needed? Per the discussion at #30770, it makes sense to me we should unload inactive state store asap. Now we run a maintenance task periodically to unload inactive state stores. So there will be some delays between a state store becomes inactive and it is unloaded. However, we can force Spark to always allocate a state store to same executor, by using task locality configuration. This can reduce the possibility to have inactive state store. Normally, with locality configuration, we might not able to see inactive state store generally. There is still chance an executor can be failed and reallocated, but in this case, inactive state store is also lost too. So it is not an issue. Making driver-executor bi-directional for unloading inactive state store looks non-trivial, and seems to me, it is not worth, after considering what we can do with locality. This proposes a simpler but effective approach. We can check if loaded state store is already loaded at other executor during reporting active state store to the coordinator. If so, it means the loaded store is inactive now, and it is going to be unload by the next maintenance task. Then we unload that store immediately. How do we make sure the loaded state store in previous batch is loaded at other executor in this batch before reporting in this executor? With task locality and preferred location, once an executor is ready to be scheduled, Spark should assign the state store provider previously loaded at the executor. So when this executor gets a new assignment other than previously loaded state store, it means the previously loaded one is already assigned to other executor. There is still a delay between the state store is loaded at other executor, and unloading it when reporting active state store at this executor. But it should be minimized now. And there won't be multiple state store belonging to same operator are loaded at the same time at one single executor, because once the executor reports any active store, it will unload all inactive stores. This should not be an issue IMHO. This is a minimal change to unload inactive state store asap without significant change. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30827 from viirya/SPARK-33827. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2020-12-28 16:52:56 +09:00
Max Gekk	4a61fc1a92	[SPARK-33914][SQL][DOCS] Describe the structure of unified DS v1 and v2 tests ### What changes were proposed in this pull request? Add comments for the unified datasource tests, describe what kind of tests they contain, and put refs to other test suits. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30929 from MaxGekk/doc-unified-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 07:03:29 +00:00
angerszhu	0a3f3d609d	[SPARK-33908][CORE] Refactor SparkSubmitUtils.resolveMavenCoordinates() 's return parameter ### What changes were proposed in this pull request? Per discuss in https://github.com/apache/spark/pull/29966#discussion_r531917374 We'd better change `SparkSubmitUtils.resolveMavenCoordinates()` 's return value as `Seq[String]` ### Why are the changes needed? refactor code ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30922 from AngersZhuuuu/SPARK-33908. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 16:00:24 +09:00
Kent Yao	3fdbc48373	[SPARK-33901][SQL] Fix Char and Varchar display error after DDLs ### What changes were proposed in this pull request? After CTAS / CREATE TABLE LIKE / CVAS/ alter table add columns, the target tables will display string instead of char/varchar ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30918 from yaooqinn/SPARK-33901. Lead-authored-by: Kent Yao <yao@apache.org> Co-authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 06:48:27 +00:00
yangjie01	1be9e7e40b	[SPAKR-33801][CORE][SQL] Fix compilation warnings about 'Unicode escapes in triple quoted strings are deprecated' ### What changes were proposed in this pull request? There are total 15 compilation warnings about `Unicode escapes in triple quoted strings are deprecated` in Spark code now: ``` [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2930: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2931: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2932: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2933: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2934: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2935: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2936: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/core/src/main/scala/org/apache/spark/util/Utils.scala:2937: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtils.scala:82: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:32: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/csv/CSVExprUtilsSuite.scala:79: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:97: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ParserUtilsSuite.scala:101: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:76: Unicode escapes in triple quoted strings are deprecated, use the literal character instead [WARNING] /spark-source/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonParsingOptionsSuite.scala:83: Unicode escapes in triple quoted strings are deprecated, use the literal character instead ``` This pr try to fix these warnnings. ### Why are the changes needed? Cleanup compilation warnings about `Unicode escapes in triple quoted strings are deprecated` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30926 from LuciferYang/SPARK-33801. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 15:29:09 +09:00
Terry Kim	fe33262c91	[SPARK-33918][SQL] UnresolvedView should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("DROP VIEW unknown") org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be `10`. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` DROP VIEW v ALTER VIEW v SET TBLPROPERTIES ('k'='v') ALTER VIEW v UNSET TBLPROPERTIES ('k') ALTER VIEW v AS SELECT 1 ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: View not found: unknown; line 1 pos 10; ``` ### How was this patch tested? Add a new suite of tests. Closes #30936 from imback82/position_view_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-28 05:45:40 +00:00
yangjie01	e6f019836c	[SPARK-33532][SQL] Add comments to a unreachable branch in SpecificParquetRecordReaderBase.initialize method ### What changes were proposed in this pull request? This pr mainly adds a comment for the 'rowgroupoffsets! = null' branch in `SpecificParquetRecordReaderBase.init(InputSplit, TaskAttemptContext)` to indicate that spark read parquet process will not enter this branch after SPARK-13883 and SPARK-13989. It is not deleted because PARQUET-131 wants to move `SpecificParquetRecordReaderBase` into the parquet-mr project. ### Why are the changes needed? Add a useful comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30484 from LuciferYang/SPARK-33532. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-28 14:07:50 +09:00
yangjie01	37ae0a6086	[SPARK-33560][TEST-MAVEN][BUILD] Add "unused-import" check to Maven compilation process ### What changes were proposed in this pull request? Similar to SPARK-33441, this pr add `unused-import` check to Maven compilation process. After this pr `unused-import` will trigger Maven compilation error. For Scala 2.13 profile, this pr also left TODO(SPARK-33499) similar to SPARK-33441 because `scala.language.higherKinds` no longer needs to be imported explicitly since Scala 2.13.1 ### Why are the changes needed? Let Maven build also check for unused imports as compilation error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Local manual test：add an unused import intentionally to trigger maven compilation error. Closes #30784 from LuciferYang/SPARK-33560. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-26 17:40:19 -06:00
kozakana	2553d53dc8	[SPARK-33897][SQL] Can't set option 'cross' in join method ### What changes were proposed in this pull request? [The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti." However, I get the following error when I set the cross option. ``` scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show() java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106) at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) ... 53 elided ``` ### Why are the changes needed? The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException. ### Does this PR introduce _any_ user-facing change? Accepting this PR fix will behave the same as the documentation. ### How was this patch tested? There is already a test for [JoinTypes](`1b9fd67904/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala`), but I can't find a test for the join option itself. Closes #30803 from kozakana/allow_cross_option. Authored-by: kozakana <goki727@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-26 16:30:50 +09:00
angerszhu	10b6466e91	[SPARK-33084][CORE][SQL] Add jar support ivy path ### What changes were proposed in this pull request? Support add jar with ivy path ### Why are the changes needed? Since submit app can support ivy, add jar we can also support ivy now. ### Does this PR introduce _any_ user-facing change? User can add jar with sql like ``` add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false ``` core api ``` sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true") sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false") ``` #### Doc Update snapshot ![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png) ### How was this patch tested? Added UT Closes #29966 from AngersZhuuuu/support-add-jar-ivy. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-25 09:07:48 +09:00
Takeshi Yamamuro	65a9ac2ff4	[SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec ### What changes were proposed in this pull request? This pr intends to support code generation for `HashAggregateExec` with filters. Quick benchmark results: ``` $ ./bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=1 -v scala> spark.range(100000000).selectExpr("id % 3 as k1", "id % 5 as k2", "rand() as v1", "rand() as v2").write.saveAsTable("t") scala> sql("SELECT k1, k2, AVG(v1) FILTER (WHERE v2 > 0.5) FROM t GROUP BY k1, k2").write.format("noop").mode("overwrite").save() >> Before this PR Elapsed time: 16.170697619s >> After this PR Elapsed time: 6.7825313s ``` The query above is compiled into code below; ``` ... /* 285 / private void agg_doAggregate_avg_0(boolean agg_exprIsNull_2_0, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_0, double agg_expr_2_0) throws java.io.IOException { / 286 / // evaluate aggregate function for avg / 287 / boolean agg_isNull_10 = true; / 288 / double agg_value_12 = -1.0; / 289 / boolean agg_isNull_11 = agg_unsafeRowAggBuffer_0.isNullAt(0); / 290 / double agg_value_13 = agg_isNull_11 ? / 291 / -1.0 : (agg_unsafeRowAggBuffer_0.getDouble(0)); / 292 / if (!agg_isNull_11) { / 293 / agg_agg_isNull_12_0 = true; / 294 / double agg_value_14 = -1.0; / 295 / do { / 296 / if (!agg_exprIsNull_2_0) { / 297 / agg_agg_isNull_12_0 = false; / 298 / agg_value_14 = agg_expr_2_0; / 299 / continue; / 300 / } / 301 / / 302 / if (!false) { / 303 / agg_agg_isNull_12_0 = false; / 304 / agg_value_14 = 0.0D; / 305 / continue; / 306 / } / 307 / / 308 / } while (false); / 309 / / 310 / agg_isNull_10 = false; // resultCode could change nullability. / 311 / / 312 / agg_value_12 = agg_value_13 + agg_value_14; / 313 / / 314 / } / 315 / boolean agg_isNull_15 = false; / 316 / long agg_value_17 = -1L; / 317 / if (!false && agg_exprIsNull_2_0) { / 318 / boolean agg_isNull_18 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 319 / long agg_value_20 = agg_isNull_18 ? / 320 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 321 / agg_isNull_15 = agg_isNull_18; / 322 / agg_value_17 = agg_value_20; / 323 / } else { / 324 / boolean agg_isNull_19 = true; / 325 / long agg_value_21 = -1L; / 326 / boolean agg_isNull_20 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 327 / long agg_value_22 = agg_isNull_20 ? / 328 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 329 / if (!agg_isNull_20) { / 330 / agg_isNull_19 = false; // resultCode could change nullability. / 331 / / 332 / agg_value_21 = agg_value_22 + 1L; / 333 / / 334 / } / 335 / agg_isNull_15 = agg_isNull_19; / 336 / agg_value_17 = agg_value_21; / 337 / } / 338 / // update unsafe row buffer / 339 / if (!agg_isNull_10) { / 340 / agg_unsafeRowAggBuffer_0.setDouble(0, agg_value_12); / 341 / } else { / 342 / agg_unsafeRowAggBuffer_0.setNullAt(0); / 343 / } / 344 / / 345 / if (!agg_isNull_15) { / 346 / agg_unsafeRowAggBuffer_0.setLong(1, agg_value_17); / 347 / } else { / 348 / agg_unsafeRowAggBuffer_0.setNullAt(1); / 349 / } / 350 */ } ... ``` ### Why are the changes needed? For high performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27019 from maropu/AggregateFilterCodegen. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:44:16 -08:00
ulysses-you	9c30116fb4	[SPARK-33857][SQL] Unify the default seed of random functions ### What changes were proposed in this pull request? Unify the seed of random functions 1. Add a hold place expression `UnresolvedSeed ` as the defualt seed. 2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `. 3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule. ### Why are the changes needed? `Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing. It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test and add test. Closes #30864 from ulysses-you/SPARK-33857. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:30:34 -08:00
Kent Yao	29cca68e9e	[SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE ### What changes were proposed in this pull request? Display char/varchar in - DESC table - DESC column - SHOW CREATE TABLE ### Why are the changes needed? show the correct definition for users ### Does this PR introduce _any_ user-facing change? yes, char/varchar column's will print char/varchar instead of string ### How was this patch tested? new tests Closes #30908 from yaooqinn/SPARK-33892. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:56:02 +00:00
Max Gekk	54a67842e6	[SPARK-33881][SQL][TESTS] Check null and empty string as partition values in DS v1 and v2 tests ### What changes were proposed in this pull request? Add tests to check handling `null` and `''` (empty string) as partition values in commands `SHOW PARTITIONS`, `ALTER TABLE .. ADD PARTITION`, `ALTER TABLE .. DROP PARTITION`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite" ``` Closes #30893 from MaxGekk/partition-value-empty-string. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:54:53 +00:00
gengjiaan	3e9821edfd	[SPARK-33443][SQL] LEAD/LAG should support [ IGNORE NULLS \| RESPECT NULLS ] ### What changes were proposed in this pull request? The mainstream database support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE`. But the current implement of `LEAD`/`LAG` don't support this syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LEAD.html#GUID-0A0481F1-E98F-4535-A739-FCCA8D1B5B77 Presto https://prestodb.io/docs/current/functions/window.html Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LEAD.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html ### Why are the changes needed? Support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #30387 from beliefer/SPARK-33443. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:13:48 +00:00
Yuming Wang	32d4a2b062	[SPARK-33861][SQL] Simplify conditional in predicate ### What changes were proposed in this pull request? This pr simplify conditional in predicate, after this change we can push down the filter to datasource: Expression \| After simplify -- \| -- IF(cond, trueVal, false) \| AND(cond, trueVal) IF(cond, trueVal, true) \| OR(NOT(cond), trueVal) IF(cond, false, falseVal) \| AND(NOT(cond), elseVal) IF(cond, true, falseVal) \| OR(cond, elseVal) CASE WHEN cond THEN trueVal ELSE false END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE null END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE true END \| OR(NOT(cond), trueVal) CASE WHEN cond THEN false ELSE elseVal END \| AND(NOT(cond), elseVal) CASE WHEN cond THEN false END \| false CASE WHEN cond THEN true ELSE elseVal END \| OR(cond, elseVal) CASE WHEN cond THEN true END \| cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30865 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:10:28 +00:00
Kent Yao	d7dc42d5f6	[SPARK-33895][SQL] Char and Varchar fail in MetaOperation of ThriftServer ### What changes were proposed in this pull request? ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: CHAR(10) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:187) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:203) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:195) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4(SparkGetColumnsOperation.scala:99) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4$adapted(SparkGetColumnsOperation.scala:98) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ``` meta operation is targeting raw table schema, we need to handle these types there. ### Why are the changes needed? bugfix, see the above case ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests locally ![image](https://user-images.githubusercontent.com/8326978/103069196-cdfcc480-45f9-11eb-9c6a-d4c42123c6e3.png) Closes #30914 from yaooqinn/SPARK-33895. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 07:40:38 +00:00
Terry Kim	f1d3797291	[SPARK-33886][SQL] UnresolvedTable should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("MSCK REPAIR TABLE unknown") org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be 18. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` MSCK REPAIR TABLE t LOAD DATA LOCAL INPATH 'filepath' INTO TABLE t TRUNCATE TABLE t SHOW PARTITIONS t ALTER TABLE t RECOVER PARTITIONS ALTER TABLE t ADD PARTITION (p=1) ALTER TABLE t PARTITION (p=1) RENAME TO PARTITION (p=2) ALTER TABLE t DROP PARTITION (p=1) ALTER TABLE t SET SERDEPROPERTIES ('a'='b') COMMENT ON TABLE t IS 'hello'" ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 18; ``` ### How was this patch tested? Add a new suite of tests. Closes #30900 from imback82/position_Fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 05:21:39 +00:00
Yuanjian Li	86c1cfc579	[SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API ### What changes were proposed in this pull request? Follow up work for #30521, document the following behaviors in the API doc: - Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table. - Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created. ### Why are the changes needed? We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Document only. Closes #30885 from xuanyuanking/SPARK-33659. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-24 12:44:37 +09:00
Takuya UESHIN	5c9b421c37	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? This is a retry of #30177. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-23 14:48:01 -08:00
Yuming Wang	7ffcfcf7db	[SPARK-33847][SQL] Simplify CaseWhen if elseValue is None ### What changes were proposed in this pull request? 1. Enhance `ReplaceNullWithFalseInPredicate` to replace None of elseValue inside `CaseWhen` with `FalseLiteral` if all branches are `FalseLiteral` . The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN 'a' WHEN id = 3 THEN 'b' end) = 'c'; ``` Before this pr: ``` == Physical Plan == (1) Filter CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` 2. Enhance `SimplifyConditionals` if elseValue is None and all outputs are null. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30852 from wangyum/SPARK-33847. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:35:46 +00:00
Max Gekk	303df64b46	[SPARK-33889][SQL] Fix NPE from `SHOW PARTITIONS` on V2 tables ### What changes were proposed in this pull request? At `ShowPartitionsExec.run()`, check that a row returned by `listPartitionIdentifiers()` contains a `null` field, and convert it to `"null"`. ### Why are the changes needed? Because `SHOW PARTITIONS` throws NPE on V2 table with `null` partition values. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new UT to `v2.ShowPartitionsSuite`. Closes #30904 from MaxGekk/fix-npe-show-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:34:01 +00:00
Max Gekk	cc23581e26	[SPARK-33858][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. RENAME PARTITION` parsing tests to `AlterTableRenamePartitionParserSuite` 2. Place the v1 tests for `ALTER TABLE .. RENAME PARTITION` from `DDLSuite` to `v1.AlterTableRenamePartitionSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to `v2.AlterTableRenamePartitionSuite`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. RENAME PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionSuite" ``` Closes #30863 from MaxGekk/unify-rename-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 12:19:07 +00:00
ulysses-you	f421c172d9	[SPARK-33497][SQL] Override maxRows in some LogicalPlan ### What changes were proposed in this pull request? This PR aims to override maxRows method in these follow `LogicalPlan`: * `ReturnAnswer` * `Join` * `Range` * `Sample` * `RepartitionOperation` * `Deduplicate` * `LocalRelation` * `Window` ### Why are the changes needed? 1. Logically, we know the max rows info with these `LogicalPlan`. 2. Before this PR, we already have some max rows with `LogicalPlan`, so we can eliminate limit with more case if we expand more. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30443 from ulysses-you/SPARK-33497. Lead-authored-by: ulysses-you <youxiduo@weidian.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:20:49 +00:00
Max Gekk	34bfb3a31d	[SPARK-33787][SQL] Allow partition purge for v2 tables ### What changes were proposed in this pull request? 1. Add new methods `purgePartition()`/`purgePartitions()` to the interfaces `SupportsPartitionManagement`/`SupportsAtomicPartitionManagement`. 2. Default implementation of new methods throw the exception `UnsupportedOperationException`. 3. Add tests for new methods to `SupportsPartitionManagementSuite`/`SupportsAtomicPartitionManagementSuite`. 4. Add `ALTER TABLE .. DROP PARTITION` tests for DS v1 and v2. Closes #30776 Closes #30821 ### Why are the changes needed? Currently, the `PURGE` option that user can set in `ALTER TABLE .. DROP PARTITION` is completely ignored. We should pass this flag to the catalog implementation, so, the catalog should decide how to handle the flag. ### Does this PR introduce _any_ user-facing change? The changes can impact on behavior of `ALTER TABLE .. DROP PARTITION` for v2 tables. ### How was this patch tested? By running the affected test suites, for instance: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30886 from MaxGekk/purge-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:09:48 +00:00
Kent Yao	2287f56a3e	[SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns ### What changes were proposed in this pull request? ```sql spark-sql> select * from t10 where c0='abcd'; 20/12/22 15:43:38 ERROR SparkSQLDriver: Failed in [select * from t10 where c0='abcd'] scala.MatchError: CharType(10) (of class org.apache.spark.sql.types.CharType) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:815) at org.apache.spark.sql.catalyst.expressions.CastBase.cast$lzycompute(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:844) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.$anonfun$toRow$2(interface.scala:164) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:102) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.toRow(interface.scala:158) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3(ExternalCatalogUtils.scala:157) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3$adapted(ExternalCatalogUtils.scala:156) ``` c0 is a partition column, it fails in the partition pruning rule In this PR, we relace char/varchar w/ string type before the CAST happends ### Why are the changes needed? bugfix, see the case above ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, new tests Closes #30887 from yaooqinn/SPARK-33879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 16:14:27 +09:00
ulysses-you	e853f068f6	[SPARK-33526][SQL][FOLLOWUP] Fix flaky test due to timeout and fix docs ### What changes were proposed in this pull request? Make test stable and fix docs. ### Why are the changes needed? Query timeout sometime since we set an another config after set query timeout. ``` sbt.ForkMain$ForkError: java.sql.SQLTimeoutException: Query timed out after 0 seconds at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13(ThriftServerWithSparkContextSuite.scala:107) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13$adapted(ThriftServerWithSparkContextSuite.scala:106) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12(ThriftServerWithSparkContextSuite.scala:106) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12$adapted(ThriftServerWithSparkContextSuite.scala:89) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4(SharedThriftServer.scala:95) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4$adapted(SharedThriftServer.scala:95) ``` The reason is: 1. we execute `set spark.sql.thriftServer.queryTimeout = 1`, then all the option will be limited in 1s. 2. we execute `set spark.sql.thriftServer.interruptOnCancel = false/true`. This sql will get timeout exception if there is something hung within 1s. It's not our expected. Reset the timeout before we do the step2 can avoid this problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fix test. Closes #30897 from ulysses-you/SPARK-33526-followup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 22:43:03 -08:00
Takeshi Yamamuro	ea37717f7c	[SPARK-32106][SQL][FOLLOWUP] Fix flaky tests in transform.sql ### What changes were proposed in this pull request? This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests): - https://github.com/apache/spark/runs/1592987501 - https://github.com/apache/spark/runs/1593196242 - https://github.com/apache/spark/runs/1595496305 - https://github.com/apache/spark/runs/1596309555 This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g., ``` # https://github.com/apache/spark/runs/1592987501 Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2 # https://github.com/apache/spark/runs/1593196242 Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2 ``` The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it. This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30896 from maropu/SPARK-32106-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 13:50:05 +09:00
Wenchen Fan	ec1560af25	[SPARK-33364][SQL][FOLLOWUP] Refine the catalog v2 API to purge a table ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30267 Inspired by https://github.com/apache/spark/pull/30886, it's better to have 2 methods `def dropTable` and `def purgeTable`, than `def dropTable(ident)` and `def dropTable(ident, purge)`. ### Why are the changes needed? 1. make the APIs orthogonal. Previously, `def dropTable(ident, purge)` calls `def dropTable(ident)` and is a superset. 2. simplifies the catalog implementation a little bit. Now the `if (purge) ... else ...` check is done at the Spark side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? existing tests Closes #30890 from cloud-fan/purgeTable. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 11:47:13 +09:00
Erik Krogen	303b8c8773	[SPARK-23862][SQL] Support Java enums from Scala Dataset API ### What changes were proposed in this pull request? Add support for Java Enums (`java.lang.Enum`) from the Scala typed Dataset APIs. This involves adding an implicit for `Encoder` creation in `SQLImplicits`, and updating `ScalaReflection` to handle Java Enums on the serialization and deserialization pathways. Enums are mapped to a `StringType` which is just the name of the Enum value. ### Why are the changes needed? In [SPARK-21255](https://issues.apache.org/jira/browse/SPARK-21255), support for (de)serialization of Java Enums was added, but only when called from Java code. It is common for Scala code to rely on Java libraries that are out of control of the Scala developer. Today, if there is a dependency on some Java code which defines an Enum, it would be necessary to define a corresponding Scala class. This change brings closer feature parity between Scala and Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, previously something like: ``` val ds = Seq(MyJavaEnum.VALUE1, MyJavaEnum.VALUE2).toDS // or val ds = Seq(CaseClass(MyJavaEnum.VALUE1), CaseClass(MyJavaEnum.VALUE2)).toDS ``` would fail. Now, it will succeed. ### How was this patch tested? Additional unit tests are added in `DatasetSuite`. Tests include validating top-level enums, enums inside of case classes, enums inside of arrays, and validating that the Enum is stored as the expected string. Closes #30877 from xkrogen/xkrogen-SPARK-23862-scalareflection-java-enums. Lead-authored-by: Erik Krogen <xkrogen@apache.org> Co-authored-by: Fangshi Li <fli@linkedin.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 09:55:33 -08:00
Kent Yao	6da5cdf1db	[SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location ### What changes were proposed in this pull request? This PR adds the length check to the existing ApplyCharPadding rule. Tables will have external locations when users execute SET LOCATION or CREATE TABLE ... LOCATION. If the location contains over length values we should FAIL ON READ. ### Why are the changes needed? ```sql spark-sql> INSERT INTO t2 VALUES ('1', 'b12345'); Time taken: 0.141 seconds spark-sql> alter table t set location '/tmp/hive_one/t2'; Time taken: 0.095 seconds spark-sql> select * from t; 1 b1234 ``` the above case should fail rather than implicitly applying truncation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30882 from yaooqinn/SPARK-33876. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 14:24:12 +00:00
Max Gekk	84bf07bbd7	[SPARK-33878][SQL][TESTS] Fix resolving of `spark_catalog` in v1 Hive catalog tests ### What changes were proposed in this pull request? 1. Recognize `spark_catalog` as the default session catalog in the checks of `TestHiveQueryExecution`. 2. Move v2 and v1 in-memory catalog test `"SPARK-33305: DROP TABLE should also invalidate cache"` to the common trait `command/DropTableSuiteBase`, and run it with v1 Hive external catalog. ### Why are the changes needed? To run In-memory catalog tests in Hive catalog. ### Does this PR introduce _any_ user-facing change? No, the changes influence only on tests. ### How was this patch tested? By running the affected test suites for `DROP TABLE`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite" ``` Closes #30883 from MaxGekk/fix-spark_catalog-hive-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 12:37:16 +00:00
Jacob Kim	43a562035c	[SPARK-33846][SQL] Include Comments for a nested schema in StructType.toDDL ### What changes were proposed in this pull request? ```scala val nestedStruct = new StructType() .add(StructField("b", StringType).withComment("Nested comment")) val struct = new StructType() .add(StructField("a", nestedStruct).withComment("comment")) struct.toDDL ``` Currently, returns: ``` `a` STRUCT<`b`: STRING> COMMENT 'comment'` ``` With this PR, the code above returns: ``` `a` STRUCT<`b`: STRING COMMENT 'Nested comment'> COMMENT 'comment'` ``` ### Why are the changes needed? My team is using nested columns as first citizens, and I thought it would be nice to have comments for nested columns. ### Does this PR introduce _any_ user-facing change? Now, when users call something like this, ```scala spark.table("foo.bar").schema.fields.map(_.toDDL).mkString(", ") ``` they will get comments for the nested columns. ### How was this patch tested? I added unit tests under `org.apache.spark.sql.types.StructTypeSuite`. They test if nested StructType's comment is included in the DDL string. Closes #30851 from jacobhjkim/structtype-toddl. Authored-by: Jacob Kim <me@jacobkim.io> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 17:55:16 +09:00
Anton Okolnychyi	7bbcbb84c2	[SPARK-33784][SQL] Rename dataSourceRewriteRules batch ### What changes were proposed in this pull request? This PR tries to rename `dataSourceRewriteRules` into something more generic. ### Why are the changes needed? These changes are needed to address the post-review discussion [here](https://github.com/apache/spark/pull/30558#discussion_r533885837). ### Does this PR introduce _any_ user-facing change? Yes but the changes haven't been released yet. ### How was this patch tested? Existing tests. Closes #30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:29:22 +00:00
Anton Okolnychyi	2562183987	[SPARK-33808][SQL] DataSource V2: Build logical writes in the optimizer ### What changes were proposed in this pull request? This PR adds logic to build logical writes introduced in SPARK-33779. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30806 from aokolnychyi/spark-33808. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:23:56 +00:00
ulysses-you	1dd63dccd8	[SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 15:10:46 +09:00
yangjie01	b88745565b	[SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Orc ### What changes were proposed in this pull request? Orc support filter push down optimization, but this optimization will read file meta from external storage even if filters is empty. This pr add a extra `filters.nonEmpty` when `spark.sql.orc.filterPushdown` is true ### Why are the changes needed? Orc filters push down operation should only triggered when `filters.nonEmpty` is true ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30663 from LuciferYang/pushdownfilter-when-filter-nonempty. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 20:24:23 -08:00
Kent Yao	f5fd10b1bc	[SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar ### What changes were proposed in this pull request? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns For v2 table, char/varchar to string, char(x) to char(x), char(x)/varchar(x) to varchar(y) if x <=y are valid cases, other changes are invalid ### Why are the changes needed? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30833 from yaooqinn/SPARK-33834. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 03:07:26 +00:00
angerszhu	7466031632	[SPARK-32106][SQL] Implement script transform in sql/core ### What changes were proposed in this pull request? * Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec` * Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data * Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode) * Add `SparkScriptTransformationSuite` test spark spec case * add test in `SQLQueryTestSuite` And we will close #29085 . ### Why are the changes needed? Support user use Script Transform without Hive ### Does this PR introduce _any_ user-facing change? User can use Script Transformation without hive in no serde mode. Such as : default no serde ``` SELECT TRANSFORM(a, b, c) USING 'cat' AS (a int, b string, c long) FROM testData ``` no serde with spec ROW FORMAT DELIMITED ``` SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0002' MAP KEYS TERMINATED BY '\u0003' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0004' MAP KEYS TERMINATED BY '\u0005' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData ``` ### How was this patch tested? Added UT Closes #29414 from AngersZhuuuu/SPARK-32106-MINOR. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-22 11:37:59 +09:00
Yuming Wang	1c77605682	[SPARK-33848][SQL] Push the UnaryExpression into (if / case) branches ### What changes were proposed in this pull request? This pr push the `UnaryExpression` into (if / case) branches. The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3; ``` Before this pr: ``` == Physical Plan == (1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` This change can also improve this case: `a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30853 from wangyum/SPARK-33848. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 10:25:23 -08:00
Max Gekk	661ac10901	[SPARK-33838][SQL][DOCS] Comment the `PURGE` option in the DropTable and in AlterTableDropPartition commands ### What changes were proposed in this pull request? Add comments for the `PURGE` option to the logical nodes `DropTable` and `AlterTableDropPartition`. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #30837 from MaxGekk/comment-purge-logical-node. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 14:06:31 +00:00
Takeshi Yamamuro	69aa727ff4	[SPARK-33124][SQL] Fills missing group tags and re-categorizes all the group tags for built-in functions ### What changes were proposed in this pull request? This PR proposes to fill missing group tags and re-categorize all the group tags for built-in functions. New groups below are added in this PR: - binary_funcs - bitwise_funcs - collection_funcs - predicate_funcs - conditional_funcs - conversion_funcs - csv_funcs - generator_funcs - hash_funcs - lambda_funcs - math_funcs - misc_funcs - string_funcs - struct_funcs - xml_funcs A basic policy to re-categorize functions is that functions in the same file are categorized into the same group. For example, all the functions in `hash.scala` are categorized into `hash_funcs`. But, there are some exceptional/ambiguous cases when categorizing them. Here are some special notes: - All the aggregate functions are categorized into `agg_funcs`. - `array_funcs` and `map_funcs` are sub-groups of `collection_funcs`. For example, `array_contains` is used only for arrays, so it is assigned to `array_funcs`. On the other hand, `reverse` is used for both arrays and strings, so it is assigned to `collection_funcs`. - Some functions logically belong to multiple groups. In this case, these functions are categorized based on the file that they belong to. For example, `schema_of_csv` can be grouped into both `csv_funcs` and `struct_funcs` in terms of input types, but it is assigned to `csv_funcs` because it belongs to the `csvExpressions.scala` file that holds the other CSV-related functions. - Functions in `nullExpressions.scala`, `complexTypeCreator.scala`, `randomExpressions.scala`, and `regexExpressions.scala` are categorized based on their functionalities. For example: - `isnull` in `nullExpressions` is assigned to `predicate_funcs` because this is a predicate function. - `array` in `complexTypeCreator.scala` is assigned to `array_funcs`based on its output type (The other functions in `array_funcs` are categorized based on their input types though). A category list (after this PR) is as follows (the list below includes the exprs that already have a group tag in the current master): \|group\|name\|class\| \|-----\|----\|-----\| \|agg_funcs\|any\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|approx_count_distinct\|org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus\| \|agg_funcs\|approx_percentile\|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile\| \|agg_funcs\|avg\|org.apache.spark.sql.catalyst.expressions.aggregate.Average\| \|agg_funcs\|bit_and\|org.apache.spark.sql.catalyst.expressions.aggregate.BitAndAgg\| \|agg_funcs\|bit_or\|org.apache.spark.sql.catalyst.expressions.aggregate.BitOrAgg\| \|agg_funcs\|bit_xor\|org.apache.spark.sql.catalyst.expressions.aggregate.BitXorAgg\| \|agg_funcs\|bool_and\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd\| \|agg_funcs\|bool_or\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|collect_list\|org.apache.spark.sql.catalyst.expressions.aggregate.CollectList\| \|agg_funcs\|collect_set\|org.apache.spark.sql.catalyst.expressions.aggregate.CollectSet\| \|agg_funcs\|corr\|org.apache.spark.sql.catalyst.expressions.aggregate.Corr\| \|agg_funcs\|count_if\|org.apache.spark.sql.catalyst.expressions.aggregate.CountIf\| \|agg_funcs\|count_min_sketch\|org.apache.spark.sql.catalyst.expressions.aggregate.CountMinSketchAgg\| \|agg_funcs\|count\|org.apache.spark.sql.catalyst.expressions.aggregate.Count\| \|agg_funcs\|covar_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.CovPopulation\| \|agg_funcs\|covar_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.CovSample\| \|agg_funcs\|cube\|org.apache.spark.sql.catalyst.expressions.Cube\| \|agg_funcs\|every\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolAnd\| \|agg_funcs\|first_value\|org.apache.spark.sql.catalyst.expressions.aggregate.First\| \|agg_funcs\|first\|org.apache.spark.sql.catalyst.expressions.aggregate.First\| \|agg_funcs\|grouping_id\|org.apache.spark.sql.catalyst.expressions.GroupingID\| \|agg_funcs\|grouping\|org.apache.spark.sql.catalyst.expressions.Grouping\| \|agg_funcs\|kurtosis\|org.apache.spark.sql.catalyst.expressions.aggregate.Kurtosis\| \|agg_funcs\|last_value\|org.apache.spark.sql.catalyst.expressions.aggregate.Last\| \|agg_funcs\|last\|org.apache.spark.sql.catalyst.expressions.aggregate.Last\| \|agg_funcs\|max_by\|org.apache.spark.sql.catalyst.expressions.aggregate.MaxBy\| \|agg_funcs\|max\|org.apache.spark.sql.catalyst.expressions.aggregate.Max\| \|agg_funcs\|mean\|org.apache.spark.sql.catalyst.expressions.aggregate.Average\| \|agg_funcs\|min_by\|org.apache.spark.sql.catalyst.expressions.aggregate.MinBy\| \|agg_funcs\|min\|org.apache.spark.sql.catalyst.expressions.aggregate.Min\| \|agg_funcs\|percentile_approx\|org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile\| \|agg_funcs\|percentile\|org.apache.spark.sql.catalyst.expressions.aggregate.Percentile\| \|agg_funcs\|rollup\|org.apache.spark.sql.catalyst.expressions.Rollup\| \|agg_funcs\|skewness\|org.apache.spark.sql.catalyst.expressions.aggregate.Skewness\| \|agg_funcs\|some\|org.apache.spark.sql.catalyst.expressions.aggregate.BoolOr\| \|agg_funcs\|stddev_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevPop\| \|agg_funcs\|stddev_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|stddev\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|std\|org.apache.spark.sql.catalyst.expressions.aggregate.StddevSamp\| \|agg_funcs\|sum\|org.apache.spark.sql.catalyst.expressions.aggregate.Sum\| \|agg_funcs\|var_pop\|org.apache.spark.sql.catalyst.expressions.aggregate.VariancePop\| \|agg_funcs\|var_samp\|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp\| \|agg_funcs\|variance\|org.apache.spark.sql.catalyst.expressions.aggregate.VarianceSamp\| \|array_funcs\|array_contains\|org.apache.spark.sql.catalyst.expressions.ArrayContains\| \|array_funcs\|array_distinct\|org.apache.spark.sql.catalyst.expressions.ArrayDistinct\| \|array_funcs\|array_except\|org.apache.spark.sql.catalyst.expressions.ArrayExcept\| \|array_funcs\|array_intersect\|org.apache.spark.sql.catalyst.expressions.ArrayIntersect\| \|array_funcs\|array_join\|org.apache.spark.sql.catalyst.expressions.ArrayJoin\| \|array_funcs\|array_max\|org.apache.spark.sql.catalyst.expressions.ArrayMax\| \|array_funcs\|array_min\|org.apache.spark.sql.catalyst.expressions.ArrayMin\| \|array_funcs\|array_position\|org.apache.spark.sql.catalyst.expressions.ArrayPosition\| \|array_funcs\|array_remove\|org.apache.spark.sql.catalyst.expressions.ArrayRemove\| \|array_funcs\|array_repeat\|org.apache.spark.sql.catalyst.expressions.ArrayRepeat\| \|array_funcs\|array_union\|org.apache.spark.sql.catalyst.expressions.ArrayUnion\| \|array_funcs\|arrays_overlap\|org.apache.spark.sql.catalyst.expressions.ArraysOverlap\| \|array_funcs\|arrays_zip\|org.apache.spark.sql.catalyst.expressions.ArraysZip\| \|array_funcs\|array\|org.apache.spark.sql.catalyst.expressions.CreateArray\| \|array_funcs\|flatten\|org.apache.spark.sql.catalyst.expressions.Flatten\| \|array_funcs\|sequence\|org.apache.spark.sql.catalyst.expressions.Sequence\| \|array_funcs\|shuffle\|org.apache.spark.sql.catalyst.expressions.Shuffle\| \|array_funcs\|slice\|org.apache.spark.sql.catalyst.expressions.Slice\| \|array_funcs\|sort_array\|org.apache.spark.sql.catalyst.expressions.SortArray\| \|bitwise_funcs\|&\|org.apache.spark.sql.catalyst.expressions.BitwiseAnd\| \|bitwise_funcs\|^\|org.apache.spark.sql.catalyst.expressions.BitwiseXor\| \|bitwise_funcs\|bit_count\|org.apache.spark.sql.catalyst.expressions.BitwiseCount\| \|bitwise_funcs\|shiftrightunsigned\|org.apache.spark.sql.catalyst.expressions.ShiftRightUnsigned\| \|bitwise_funcs\|shiftright\|org.apache.spark.sql.catalyst.expressions.ShiftRight\| \|bitwise_funcs\|~\|org.apache.spark.sql.catalyst.expressions.BitwiseNot\| \|collection_funcs\|cardinality\|org.apache.spark.sql.catalyst.expressions.Size\| \|collection_funcs\|concat\|org.apache.spark.sql.catalyst.expressions.Concat\| \|collection_funcs\|reverse\|org.apache.spark.sql.catalyst.expressions.Reverse\| \|collection_funcs\|size\|org.apache.spark.sql.catalyst.expressions.Size\| \|conditional_funcs\|coalesce\|org.apache.spark.sql.catalyst.expressions.Coalesce\| \|conditional_funcs\|ifnull\|org.apache.spark.sql.catalyst.expressions.IfNull\| \|conditional_funcs\|if\|org.apache.spark.sql.catalyst.expressions.If\| \|conditional_funcs\|nanvl\|org.apache.spark.sql.catalyst.expressions.NaNvl\| \|conditional_funcs\|nullif\|org.apache.spark.sql.catalyst.expressions.NullIf\| \|conditional_funcs\|nvl2\|org.apache.spark.sql.catalyst.expressions.Nvl2\| \|conditional_funcs\|nvl\|org.apache.spark.sql.catalyst.expressions.Nvl\| \|conditional_funcs\|when\|org.apache.spark.sql.catalyst.expressions.CaseWhen\| \|conversion_funcs\|bigint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|binary\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|boolean\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|cast\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|date\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|decimal\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|double\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|float\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|int\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|smallint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|string\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|timestamp\|org.apache.spark.sql.catalyst.expressions.Cast\| \|conversion_funcs\|tinyint\|org.apache.spark.sql.catalyst.expressions.Cast\| \|csv_funcs\|from_csv\|org.apache.spark.sql.catalyst.expressions.CsvToStructs\| \|csv_funcs\|schema_of_csv\|org.apache.spark.sql.catalyst.expressions.SchemaOfCsv\| \|csv_funcs\|to_csv\|org.apache.spark.sql.catalyst.expressions.StructsToCsv\| \|datetime_funcs\|add_months\|org.apache.spark.sql.catalyst.expressions.AddMonths\| \|datetime_funcs\|current_date\|org.apache.spark.sql.catalyst.expressions.CurrentDate\| \|datetime_funcs\|current_timestamp\|org.apache.spark.sql.catalyst.expressions.CurrentTimestamp\| \|datetime_funcs\|current_timezone\|org.apache.spark.sql.catalyst.expressions.CurrentTimeZone\| \|datetime_funcs\|date_add\|org.apache.spark.sql.catalyst.expressions.DateAdd\| \|datetime_funcs\|date_format\|org.apache.spark.sql.catalyst.expressions.DateFormatClass\| \|datetime_funcs\|date_from_unix_date\|org.apache.spark.sql.catalyst.expressions.DateFromUnixDate\| \|datetime_funcs\|date_part\|org.apache.spark.sql.catalyst.expressions.DatePart\| \|datetime_funcs\|date_sub\|org.apache.spark.sql.catalyst.expressions.DateSub\| \|datetime_funcs\|date_trunc\|org.apache.spark.sql.catalyst.expressions.TruncTimestamp\| \|datetime_funcs\|datediff\|org.apache.spark.sql.catalyst.expressions.DateDiff\| \|datetime_funcs\|dayofmonth\|org.apache.spark.sql.catalyst.expressions.DayOfMonth\| \|datetime_funcs\|dayofweek\|org.apache.spark.sql.catalyst.expressions.DayOfWeek\| \|datetime_funcs\|dayofyear\|org.apache.spark.sql.catalyst.expressions.DayOfYear\| \|datetime_funcs\|day\|org.apache.spark.sql.catalyst.expressions.DayOfMonth\| \|datetime_funcs\|extract\|org.apache.spark.sql.catalyst.expressions.Extract\| \|datetime_funcs\|from_unixtime\|org.apache.spark.sql.catalyst.expressions.FromUnixTime\| \|datetime_funcs\|from_utc_timestamp\|org.apache.spark.sql.catalyst.expressions.FromUTCTimestamp\| \|datetime_funcs\|hour\|org.apache.spark.sql.catalyst.expressions.Hour\| \|datetime_funcs\|last_day\|org.apache.spark.sql.catalyst.expressions.LastDay\| \|datetime_funcs\|make_date\|org.apache.spark.sql.catalyst.expressions.MakeDate\| \|datetime_funcs\|make_interval\|org.apache.spark.sql.catalyst.expressions.MakeInterval\| \|datetime_funcs\|make_timestamp\|org.apache.spark.sql.catalyst.expressions.MakeTimestamp\| \|datetime_funcs\|minute\|org.apache.spark.sql.catalyst.expressions.Minute\| \|datetime_funcs\|months_between\|org.apache.spark.sql.catalyst.expressions.MonthsBetween\| \|datetime_funcs\|month\|org.apache.spark.sql.catalyst.expressions.Month\| \|datetime_funcs\|next_day\|org.apache.spark.sql.catalyst.expressions.NextDay\| \|datetime_funcs\|now\|org.apache.spark.sql.catalyst.expressions.Now\| \|datetime_funcs\|quarter\|org.apache.spark.sql.catalyst.expressions.Quarter\| \|datetime_funcs\|second\|org.apache.spark.sql.catalyst.expressions.Second\| \|datetime_funcs\|timestamp_micros\|org.apache.spark.sql.catalyst.expressions.MicrosToTimestamp\| \|datetime_funcs\|timestamp_millis\|org.apache.spark.sql.catalyst.expressions.MillisToTimestamp\| \|datetime_funcs\|timestamp_seconds\|org.apache.spark.sql.catalyst.expressions.SecondsToTimestamp\| \|datetime_funcs\|to_date\|org.apache.spark.sql.catalyst.expressions.ParseToDate\| \|datetime_funcs\|to_timestamp\|org.apache.spark.sql.catalyst.expressions.ParseToTimestamp\| \|datetime_funcs\|to_unix_timestamp\|org.apache.spark.sql.catalyst.expressions.ToUnixTimestamp\| \|datetime_funcs\|to_utc_timestamp\|org.apache.spark.sql.catalyst.expressions.ToUTCTimestamp\| \|datetime_funcs\|trunc\|org.apache.spark.sql.catalyst.expressions.TruncDate\| \|datetime_funcs\|unix_date\|org.apache.spark.sql.catalyst.expressions.UnixDate\| \|datetime_funcs\|unix_micros\|org.apache.spark.sql.catalyst.expressions.UnixMicros\| \|datetime_funcs\|unix_millis\|org.apache.spark.sql.catalyst.expressions.UnixMillis\| \|datetime_funcs\|unix_seconds\|org.apache.spark.sql.catalyst.expressions.UnixSeconds\| \|datetime_funcs\|unix_timestamp\|org.apache.spark.sql.catalyst.expressions.UnixTimestamp\| \|datetime_funcs\|weekday\|org.apache.spark.sql.catalyst.expressions.WeekDay\| \|datetime_funcs\|weekofyear\|org.apache.spark.sql.catalyst.expressions.WeekOfYear\| \|datetime_funcs\|year\|org.apache.spark.sql.catalyst.expressions.Year\| \|generator_funcs\|explode_outer\|org.apache.spark.sql.catalyst.expressions.Explode\| \|generator_funcs\|explode\|org.apache.spark.sql.catalyst.expressions.Explode\| \|generator_funcs\|inline_outer\|org.apache.spark.sql.catalyst.expressions.Inline\| \|generator_funcs\|inline\|org.apache.spark.sql.catalyst.expressions.Inline\| \|generator_funcs\|posexplode_outer\|org.apache.spark.sql.catalyst.expressions.PosExplode\| \|generator_funcs\|posexplode\|org.apache.spark.sql.catalyst.expressions.PosExplode\| \|generator_funcs\|stack\|org.apache.spark.sql.catalyst.expressions.Stack\| \|hash_funcs\|crc32\|org.apache.spark.sql.catalyst.expressions.Crc32\| \|hash_funcs\|hash\|org.apache.spark.sql.catalyst.expressions.Murmur3Hash\| \|hash_funcs\|md5\|org.apache.spark.sql.catalyst.expressions.Md5\| \|hash_funcs\|sha1\|org.apache.spark.sql.catalyst.expressions.Sha1\| \|hash_funcs\|sha2\|org.apache.spark.sql.catalyst.expressions.Sha2\| \|hash_funcs\|sha\|org.apache.spark.sql.catalyst.expressions.Sha1\| \|hash_funcs\|xxhash64\|org.apache.spark.sql.catalyst.expressions.XxHash64\| \|json_funcs\|from_json\|org.apache.spark.sql.catalyst.expressions.JsonToStructs\| \|json_funcs\|get_json_object\|org.apache.spark.sql.catalyst.expressions.GetJsonObject\| \|json_funcs\|json_array_length\|org.apache.spark.sql.catalyst.expressions.LengthOfJsonArray\| \|json_funcs\|json_object_keys\|org.apache.spark.sql.catalyst.expressions.JsonObjectKeys\| \|json_funcs\|json_tuple\|org.apache.spark.sql.catalyst.expressions.JsonTuple\| \|json_funcs\|schema_of_json\|org.apache.spark.sql.catalyst.expressions.SchemaOfJson\| \|json_funcs\|to_json\|org.apache.spark.sql.catalyst.expressions.StructsToJson\| \|lambda_funcs\|aggregate\|org.apache.spark.sql.catalyst.expressions.ArrayAggregate\| \|lambda_funcs\|array_sort\|org.apache.spark.sql.catalyst.expressions.ArraySort\| \|lambda_funcs\|exists\|org.apache.spark.sql.catalyst.expressions.ArrayExists\| \|lambda_funcs\|filter\|org.apache.spark.sql.catalyst.expressions.ArrayFilter\| \|lambda_funcs\|forall\|org.apache.spark.sql.catalyst.expressions.ArrayForAll\| \|lambda_funcs\|map_filter\|org.apache.spark.sql.catalyst.expressions.MapFilter\| \|lambda_funcs\|map_zip_with\|org.apache.spark.sql.catalyst.expressions.MapZipWith\| \|lambda_funcs\|transform_keys\|org.apache.spark.sql.catalyst.expressions.TransformKeys\| \|lambda_funcs\|transform_values\|org.apache.spark.sql.catalyst.expressions.TransformValues\| \|lambda_funcs\|transform\|org.apache.spark.sql.catalyst.expressions.ArrayTransform\| \|lambda_funcs\|zip_with\|org.apache.spark.sql.catalyst.expressions.ZipWith\| \|map_funcs\|element_at\|org.apache.spark.sql.catalyst.expressions.ElementAt\| \|map_funcs\|map_concat\|org.apache.spark.sql.catalyst.expressions.MapConcat\| \|map_funcs\|map_entries\|org.apache.spark.sql.catalyst.expressions.MapEntries\| \|map_funcs\|map_from_arrays\|org.apache.spark.sql.catalyst.expressions.MapFromArrays\| \|map_funcs\|map_from_entries\|org.apache.spark.sql.catalyst.expressions.MapFromEntries\| \|map_funcs\|map_keys\|org.apache.spark.sql.catalyst.expressions.MapKeys\| \|map_funcs\|map_values\|org.apache.spark.sql.catalyst.expressions.MapValues\| \|map_funcs\|map\|org.apache.spark.sql.catalyst.expressions.CreateMap\| \|map_funcs\|str_to_map\|org.apache.spark.sql.catalyst.expressions.StringToMap\| \|math_funcs\|%\|org.apache.spark.sql.catalyst.expressions.Remainder\| \|math_funcs\|*\|org.apache.spark.sql.catalyst.expressions.Multiply\| \|math_funcs\|+\|org.apache.spark.sql.catalyst.expressions.Add\| \|math_funcs\|-\|org.apache.spark.sql.catalyst.expressions.Subtract\| \|math_funcs\|/\|org.apache.spark.sql.catalyst.expressions.Divide\| \|math_funcs\|abs\|org.apache.spark.sql.catalyst.expressions.Abs\| \|math_funcs\|acosh\|org.apache.spark.sql.catalyst.expressions.Acosh\| \|math_funcs\|acos\|org.apache.spark.sql.catalyst.expressions.Acos\| \|math_funcs\|asinh\|org.apache.spark.sql.catalyst.expressions.Asinh\| \|math_funcs\|asin\|org.apache.spark.sql.catalyst.expressions.Asin\| \|math_funcs\|atan2\|org.apache.spark.sql.catalyst.expressions.Atan2\| \|math_funcs\|atanh\|org.apache.spark.sql.catalyst.expressions.Atanh\| \|math_funcs\|atan\|org.apache.spark.sql.catalyst.expressions.Atan\| \|math_funcs\|bin\|org.apache.spark.sql.catalyst.expressions.Bin\| \|math_funcs\|bround\|org.apache.spark.sql.catalyst.expressions.BRound\| \|math_funcs\|cbrt\|org.apache.spark.sql.catalyst.expressions.Cbrt\| \|math_funcs\|ceiling\|org.apache.spark.sql.catalyst.expressions.Ceil\| \|math_funcs\|ceil\|org.apache.spark.sql.catalyst.expressions.Ceil\| \|math_funcs\|conv\|org.apache.spark.sql.catalyst.expressions.Conv\| \|math_funcs\|cosh\|org.apache.spark.sql.catalyst.expressions.Cosh\| \|math_funcs\|cos\|org.apache.spark.sql.catalyst.expressions.Cos\| \|math_funcs\|cot\|org.apache.spark.sql.catalyst.expressions.Cot\| \|math_funcs\|degrees\|org.apache.spark.sql.catalyst.expressions.ToDegrees\| \|math_funcs\|div\|org.apache.spark.sql.catalyst.expressions.IntegralDivide\| \|math_funcs\|expm1\|org.apache.spark.sql.catalyst.expressions.Expm1\| \|math_funcs\|exp\|org.apache.spark.sql.catalyst.expressions.Exp\| \|math_funcs\|e\|org.apache.spark.sql.catalyst.expressions.EulerNumber\| \|math_funcs\|factorial\|org.apache.spark.sql.catalyst.expressions.Factorial\| \|math_funcs\|floor\|org.apache.spark.sql.catalyst.expressions.Floor\| \|math_funcs\|greatest\|org.apache.spark.sql.catalyst.expressions.Greatest\| \|math_funcs\|hex\|org.apache.spark.sql.catalyst.expressions.Hex\| \|math_funcs\|hypot\|org.apache.spark.sql.catalyst.expressions.Hypot\| \|math_funcs\|least\|org.apache.spark.sql.catalyst.expressions.Least\| \|math_funcs\|ln\|org.apache.spark.sql.catalyst.expressions.Log\| \|math_funcs\|log10\|org.apache.spark.sql.catalyst.expressions.Log10\| \|math_funcs\|log1p\|org.apache.spark.sql.catalyst.expressions.Log1p\| \|math_funcs\|log2\|org.apache.spark.sql.catalyst.expressions.Log2\| \|math_funcs\|log\|org.apache.spark.sql.catalyst.expressions.Logarithm\| \|math_funcs\|mod\|org.apache.spark.sql.catalyst.expressions.Remainder\| \|math_funcs\|negative\|org.apache.spark.sql.catalyst.expressions.UnaryMinus\| \|math_funcs\|pi\|org.apache.spark.sql.catalyst.expressions.Pi\| \|math_funcs\|pmod\|org.apache.spark.sql.catalyst.expressions.Pmod\| \|math_funcs\|positive\|org.apache.spark.sql.catalyst.expressions.UnaryPositive\| \|math_funcs\|power\|org.apache.spark.sql.catalyst.expressions.Pow\| \|math_funcs\|pow\|org.apache.spark.sql.catalyst.expressions.Pow\| \|math_funcs\|radians\|org.apache.spark.sql.catalyst.expressions.ToRadians\| \|math_funcs\|randn\|org.apache.spark.sql.catalyst.expressions.Randn\| \|math_funcs\|random\|org.apache.spark.sql.catalyst.expressions.Rand\| \|math_funcs\|rand\|org.apache.spark.sql.catalyst.expressions.Rand\| \|math_funcs\|rint\|org.apache.spark.sql.catalyst.expressions.Rint\| \|math_funcs\|round\|org.apache.spark.sql.catalyst.expressions.Round\| \|math_funcs\|shiftleft\|org.apache.spark.sql.catalyst.expressions.ShiftLeft\| \|math_funcs\|signum\|org.apache.spark.sql.catalyst.expressions.Signum\| \|math_funcs\|sign\|org.apache.spark.sql.catalyst.expressions.Signum\| \|math_funcs\|sinh\|org.apache.spark.sql.catalyst.expressions.Sinh\| \|math_funcs\|sin\|org.apache.spark.sql.catalyst.expressions.Sin\| \|math_funcs\|sqrt\|org.apache.spark.sql.catalyst.expressions.Sqrt\| \|math_funcs\|tanh\|org.apache.spark.sql.catalyst.expressions.Tanh\| \|math_funcs\|tan\|org.apache.spark.sql.catalyst.expressions.Tan\| \|math_funcs\|unhex\|org.apache.spark.sql.catalyst.expressions.Unhex\| \|math_funcs\|width_bucket\|org.apache.spark.sql.catalyst.expressions.WidthBucket\| \|misc_funcs\|assert_true\|org.apache.spark.sql.catalyst.expressions.AssertTrue\| \|misc_funcs\|current_catalog\|org.apache.spark.sql.catalyst.expressions.CurrentCatalog\| \|misc_funcs\|current_database\|org.apache.spark.sql.catalyst.expressions.CurrentDatabase\| \|misc_funcs\|input_file_block_length\|org.apache.spark.sql.catalyst.expressions.InputFileBlockLength\| \|misc_funcs\|input_file_block_start\|org.apache.spark.sql.catalyst.expressions.InputFileBlockStart\| \|misc_funcs\|input_file_name\|org.apache.spark.sql.catalyst.expressions.InputFileName\| \|misc_funcs\|java_method\|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection\| \|misc_funcs\|monotonically_increasing_id\|org.apache.spark.sql.catalyst.expressions.MonotonicallyIncreasingID\| \|misc_funcs\|raise_error\|org.apache.spark.sql.catalyst.expressions.RaiseError\| \|misc_funcs\|reflect\|org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection\| \|misc_funcs\|spark_partition_id\|org.apache.spark.sql.catalyst.expressions.SparkPartitionID\| \|misc_funcs\|typeof\|org.apache.spark.sql.catalyst.expressions.TypeOf\| \|misc_funcs\|uuid\|org.apache.spark.sql.catalyst.expressions.Uuid\| \|misc_funcs\|version\|org.apache.spark.sql.catalyst.expressions.SparkVersion\| \|predicate_funcs\|!\|org.apache.spark.sql.catalyst.expressions.Not\| \|predicate_funcs\|<=>\|org.apache.spark.sql.catalyst.expressions.EqualNullSafe\| \|predicate_funcs\|<=\|org.apache.spark.sql.catalyst.expressions.LessThanOrEqual\| \|predicate_funcs\|<\|org.apache.spark.sql.catalyst.expressions.LessThan\| \|predicate_funcs\|==\|org.apache.spark.sql.catalyst.expressions.EqualTo\| \|predicate_funcs\|=\|org.apache.spark.sql.catalyst.expressions.EqualTo\| \|predicate_funcs\|>=\|org.apache.spark.sql.catalyst.expressions.GreaterThanOrEqual\| \|predicate_funcs\|>\|org.apache.spark.sql.catalyst.expressions.GreaterThan\| \|predicate_funcs\|and\|org.apache.spark.sql.catalyst.expressions.And\| \|predicate_funcs\|in\|org.apache.spark.sql.catalyst.expressions.In\| \|predicate_funcs\|isnan\|org.apache.spark.sql.catalyst.expressions.IsNaN\| \|predicate_funcs\|isnotnull\|org.apache.spark.sql.catalyst.expressions.IsNotNull\| \|predicate_funcs\|isnull\|org.apache.spark.sql.catalyst.expressions.IsNull\| \|predicate_funcs\|like\|org.apache.spark.sql.catalyst.expressions.Like\| \|predicate_funcs\|not\|org.apache.spark.sql.catalyst.expressions.Not\| \|predicate_funcs\|or\|org.apache.spark.sql.catalyst.expressions.Or\| \|predicate_funcs\|regexp_like\|org.apache.spark.sql.catalyst.expressions.RLike\| \|predicate_funcs\|rlike\|org.apache.spark.sql.catalyst.expressions.RLike\| \|string_funcs\|ascii\|org.apache.spark.sql.catalyst.expressions.Ascii\| \|string_funcs\|base64\|org.apache.spark.sql.catalyst.expressions.Base64\| \|string_funcs\|bit_length\|org.apache.spark.sql.catalyst.expressions.BitLength\| \|string_funcs\|char_length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|character_length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|char\|org.apache.spark.sql.catalyst.expressions.Chr\| \|string_funcs\|chr\|org.apache.spark.sql.catalyst.expressions.Chr\| \|string_funcs\|concat_ws\|org.apache.spark.sql.catalyst.expressions.ConcatWs\| \|string_funcs\|decode\|org.apache.spark.sql.catalyst.expressions.Decode\| \|string_funcs\|elt\|org.apache.spark.sql.catalyst.expressions.Elt\| \|string_funcs\|encode\|org.apache.spark.sql.catalyst.expressions.Encode\| \|string_funcs\|find_in_set\|org.apache.spark.sql.catalyst.expressions.FindInSet\| \|string_funcs\|format_number\|org.apache.spark.sql.catalyst.expressions.FormatNumber\| \|string_funcs\|format_string\|org.apache.spark.sql.catalyst.expressions.FormatString\| \|string_funcs\|initcap\|org.apache.spark.sql.catalyst.expressions.InitCap\| \|string_funcs\|instr\|org.apache.spark.sql.catalyst.expressions.StringInstr\| \|string_funcs\|lcase\|org.apache.spark.sql.catalyst.expressions.Lower\| \|string_funcs\|left\|org.apache.spark.sql.catalyst.expressions.Left\| \|string_funcs\|length\|org.apache.spark.sql.catalyst.expressions.Length\| \|string_funcs\|levenshtein\|org.apache.spark.sql.catalyst.expressions.Levenshtein\| \|string_funcs\|locate\|org.apache.spark.sql.catalyst.expressions.StringLocate\| \|string_funcs\|lower\|org.apache.spark.sql.catalyst.expressions.Lower\| \|string_funcs\|lpad\|org.apache.spark.sql.catalyst.expressions.StringLPad\| \|string_funcs\|ltrim\|org.apache.spark.sql.catalyst.expressions.StringTrimLeft\| \|string_funcs\|octet_length\|org.apache.spark.sql.catalyst.expressions.OctetLength\| \|string_funcs\|overlay\|org.apache.spark.sql.catalyst.expressions.Overlay\| \|string_funcs\|parse_url\|org.apache.spark.sql.catalyst.expressions.ParseUrl\| \|string_funcs\|position\|org.apache.spark.sql.catalyst.expressions.StringLocate\| \|string_funcs\|printf\|org.apache.spark.sql.catalyst.expressions.FormatString\| \|string_funcs\|regexp_extract_all\|org.apache.spark.sql.catalyst.expressions.RegExpExtractAll\| \|string_funcs\|regexp_extract\|org.apache.spark.sql.catalyst.expressions.RegExpExtract\| \|string_funcs\|regexp_replace\|org.apache.spark.sql.catalyst.expressions.RegExpReplace\| \|string_funcs\|repeat\|org.apache.spark.sql.catalyst.expressions.StringRepeat\| \|string_funcs\|replace\|org.apache.spark.sql.catalyst.expressions.StringReplace\| \|string_funcs\|right\|org.apache.spark.sql.catalyst.expressions.Right\| \|string_funcs\|rpad\|org.apache.spark.sql.catalyst.expressions.StringRPad\| \|string_funcs\|rtrim\|org.apache.spark.sql.catalyst.expressions.StringTrimRight\| \|string_funcs\|sentences\|org.apache.spark.sql.catalyst.expressions.Sentences\| \|string_funcs\|soundex\|org.apache.spark.sql.catalyst.expressions.SoundEx\| \|string_funcs\|space\|org.apache.spark.sql.catalyst.expressions.StringSpace\| \|string_funcs\|split\|org.apache.spark.sql.catalyst.expressions.StringSplit\| \|string_funcs\|substring_index\|org.apache.spark.sql.catalyst.expressions.SubstringIndex\| \|string_funcs\|substring\|org.apache.spark.sql.catalyst.expressions.Substring\| \|string_funcs\|substr\|org.apache.spark.sql.catalyst.expressions.Substring\| \|string_funcs\|translate\|org.apache.spark.sql.catalyst.expressions.StringTranslate\| \|string_funcs\|trim\|org.apache.spark.sql.catalyst.expressions.StringTrim\| \|string_funcs\|ucase\|org.apache.spark.sql.catalyst.expressions.Upper\| \|string_funcs\|unbase64\|org.apache.spark.sql.catalyst.expressions.UnBase64\| \|string_funcs\|upper\|org.apache.spark.sql.catalyst.expressions.Upper\| \|struct_funcs\|named_struct\|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct\| \|struct_funcs\|struct\|org.apache.spark.sql.catalyst.expressions.CreateNamedStruct\| \|window_funcs\|cume_dist\|org.apache.spark.sql.catalyst.expressions.CumeDist\| \|window_funcs\|dense_rank\|org.apache.spark.sql.catalyst.expressions.DenseRank\| \|window_funcs\|lag\|org.apache.spark.sql.catalyst.expressions.Lag\| \|window_funcs\|lead\|org.apache.spark.sql.catalyst.expressions.Lead\| \|window_funcs\|nth_value\|org.apache.spark.sql.catalyst.expressions.NthValue\| \|window_funcs\|ntile\|org.apache.spark.sql.catalyst.expressions.NTile\| \|window_funcs\|percent_rank\|org.apache.spark.sql.catalyst.expressions.PercentRank\| \|window_funcs\|rank\|org.apache.spark.sql.catalyst.expressions.Rank\| \|window_funcs\|row_number\|org.apache.spark.sql.catalyst.expressions.RowNumber\| \|xml_funcs\|xpath_boolean\|org.apache.spark.sql.catalyst.expressions.xml.XPathBoolean\| \|xml_funcs\|xpath_double\|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble\| \|xml_funcs\|xpath_float\|org.apache.spark.sql.catalyst.expressions.xml.XPathFloat\| \|xml_funcs\|xpath_int\|org.apache.spark.sql.catalyst.expressions.xml.XPathInt\| \|xml_funcs\|xpath_long\|org.apache.spark.sql.catalyst.expressions.xml.XPathLong\| \|xml_funcs\|xpath_number\|org.apache.spark.sql.catalyst.expressions.xml.XPathDouble\| \|xml_funcs\|xpath_short\|org.apache.spark.sql.catalyst.expressions.xml.XPathShort\| \|xml_funcs\|xpath_string\|org.apache.spark.sql.catalyst.expressions.xml.XPathString\| \|xml_funcs\|xpath\|org.apache.spark.sql.catalyst.expressions.xml.XPathList\| Closes #30040 NOTE: An original author of this PR is tanelk, so the credit should be given to tanelk. ### Why are the changes needed? For better documents. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add a test to check if exprs have a group tag in `ExpressionInfoSuite`. Closes #30867 from maropu/pr30040. Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 04:24:04 -08:00
Yuming Wang	4b19f49dd0	[SPARK-33845][SQL] Remove unnecessary if when trueValue and falseValue are foldable boolean types ### What changes were proposed in this pull request? Improve `SimplifyConditionals`. Simplify `If(cond, TrueLiteral, FalseLiteral)` to `cond`. Simplify `If(cond, FalseLiteral, TrueLiteral)` to `Not(cond)`. The use case is: ```sql create table t1 using parquet as select id from range(10); select if (id > 2, false, true) from t1; ``` Before this pr: ``` == Physical Plan == (1) Project [if ((id#1L > 2)) false else true AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == (1) Project [(id#1L <= 2) AS (IF((id > CAST(2 AS BIGINT)), false, true))#2] +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30849 from wangyum/SPARK-33798-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 04:15:29 -08:00
Wenchen Fan	b4bea1aa89	[SPARK-28863][SQL][FOLLOWUP] Make sure optimized plan will not be re-analyzed ### What changes were proposed in this pull request? It's a known issue that re-analyzing an optimized plan can lead to various issues. We made several attempts to avoid it from happening, but the current solution `AlreadyOptimized` is still not 100% safe, as people can inject catalyst rules to call analyzer directly. This PR proposes a simpler and safer idea: we set the `analyzed` flag to true after optimization, and analyzer will skip processing plans whose `analyzed` flag is true. ### Why are the changes needed? make the code simpler and safer ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests. Closes #30777 from cloud-fan/ds. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-21 20:59:33 +09:00
Max Gekk	cdd1752ad1	[SPARK-33862][SQL] Throw `PartitionAlreadyExistsException` if the target partition exists while renaming ### What changes were proposed in this pull request? Throw `PartitionAlreadyExistsException` from `ALTER TABLE .. RENAME TO PARTITION` for a table from Hive V1 External Catalog in the case when the target partition already exists. ### Why are the changes needed? 1. To have the same behavior of V1 In-Memory and Hive External Catalog. 2. To not propagate internal Hive's exceptions to users. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the partition renaming command throws `PartitionAlreadyExistsException` for tables from the Hive catalog. ### How was this patch tested? Added new UT: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *HiveCatalogedDDLSuite" ``` Closes #30866 from MaxGekk/throw-PartitionAlreadyExistsException. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 03:37:30 -08:00
Kousuke Saruta	f4e1069bb8	[SPARK-33853][SQL] EXPLAIN CODEGEN and BenchmarkQueryTest don't show subquery code ### What changes were proposed in this pull request? This PR fixes an issue that `EXPLAIN CODEGEN` and `BenchmarkQueryTest` don't show the corresponding code for subqueries. The following example is about `EXPLAIN CODEGEN`. ``` spark.conf.set("spark.sql.adaptive.enabled", "false") val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") scala> spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("CODEGEN") Found 1 WholeStageCodegen subtrees. == Subtree 1 / 1 (maxMethodCodeSize:55; maxConstantPoolSize:97(0.15% used); numInnerClasses:0) == (1) Project [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] : +- Subquery scalar-subquery#3, [id=#24] : +- (2) HashAggregate(keys=[], functions=[min(id#0L)], output=[v#2L]) : +- Exchange SinglePartition, ENSURE_REQUIREMENTS, [id=#20] : +- (1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) : +- (1) Range (1, 100, step=1, splits=12) +- (1) Scan OneRowRelation[] Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private scala.collection.Iterator rdd_input_0; / 010 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] project_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1]; / 011 / / 012 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 013 / this.references = references; / 014 / } / 015 / / 016 / public void init(int index, scala.collection.Iterator[] inputs) { / 017 / partitionIndex = index; / 018 / this.inputs = inputs; / 019 / rdd_input_0 = inputs[0]; / 020 / project_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 021 / / 022 / } / 023 / / 024 / private void project_doConsume_0() throws java.io.IOException { / 025 / // common sub-expressions / 026 / / 027 / project_mutableStateArray_0[0].reset(); / 028 / / 029 / if (false) { / 030 / project_mutableStateArray_0[0].setNullAt(0); / 031 / } else { / 032 / project_mutableStateArray_0[0].write(0, 1L); / 033 / } / 034 / append((project_mutableStateArray_0[0].getRow())); / 035 / / 036 / } / 037 / / 038 / protected void processNext() throws java.io.IOException { / 039 / while ( rdd_input_0.hasNext()) { / 040 / InternalRow rdd_row_0 = (InternalRow) rdd_input_0.next(); / 041 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(1); / 042 / project_doConsume_0(); / 043 / if (shouldStop()) return; / 044 / } / 045 / } / 046 / / 047 / } ``` After this change, the corresponding code for subqueries are shown. ``` Found 3 WholeStageCodegen subtrees. == Subtree 1 / 3 (maxMethodCodeSize:282; maxConstantPoolSize:206(0.31% used); numInnerClasses:0) == (1) HashAggregate(keys=[], functions=[partial_min(id#0L)], output=[min#8L]) +- (1) Range (1, 100, step=1, splits=12) Generated code: / 001 / public Object generate(Object[] references) { / 002 / return new GeneratedIteratorForCodegenStage1(references); / 003 / } / 004 / / 005 / // codegenStageId=1 / 006 / final class GeneratedIteratorForCodegenStage1 extends org.apache.spark.sql.execution.BufferedRowIterator { / 007 / private Object[] references; / 008 / private scala.collection.Iterator[] inputs; / 009 / private boolean agg_initAgg_0; / 010 / private boolean agg_bufIsNull_0; / 011 / private long agg_bufValue_0; / 012 / private boolean range_initRange_0; / 013 / private long range_nextIndex_0; / 014 / private TaskContext range_taskContext_0; / 015 / private InputMetrics range_inputMetrics_0; / 016 / private long range_batchEnd_0; / 017 / private long range_numElementsTodo_0; / 018 / private boolean agg_agg_isNull_2_0; / 019 / private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] range_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[3]; / 020 / / 021 / public GeneratedIteratorForCodegenStage1(Object[] references) { / 022 / this.references = references; / 023 / } / 024 / / 025 / public void init(int index, scala.collection.Iterator[] inputs) { / 026 / partitionIndex = index; / 027 / this.inputs = inputs; / 028 / / 029 / range_taskContext_0 = TaskContext.get(); / 030 / range_inputMetrics_0 = range_taskContext_0.taskMetrics().inputMetrics(); / 031 / range_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 032 / range_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 033 / range_mutableStateArray_0[2] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0); / 034 / / 035 / } / 036 / / 037 / private void agg_doAggregateWithoutKey_0() throws java.io.IOException { / 038 / // initialize aggregation buffer / 039 / agg_bufIsNull_0 = true; / 040 / agg_bufValue_0 = -1L; / 041 / / 042 / // initialize Range / 043 / if (!range_initRange_0) { / 044 / range_initRange_0 = true; / 045 / initRange(partitionIndex); / 046 / } / 047 / / 048 / while (true) { / 049 / if (range_nextIndex_0 == range_batchEnd_0) { / 050 / long range_nextBatchTodo_0; / 051 / if (range_numElementsTodo_0 > 1000L) { / 052 / range_nextBatchTodo_0 = 1000L; / 053 / range_numElementsTodo_0 -= 1000L; / 054 / } else { / 055 / range_nextBatchTodo_0 = range_numElementsTodo_0; / 056 / range_numElementsTodo_0 = 0; / 057 / if (range_nextBatchTodo_0 == 0) break; / 058 / } / 059 / range_batchEnd_0 += range_nextBatchTodo_0 1L; /* 060 / } / 061 / / 062 / int range_localEnd_0 = (int)((range_batchEnd_0 - range_nextIndex_0) / 1L); / 063 / for (int range_localIdx_0 = 0; range_localIdx_0 < range_localEnd_0; range_localIdx_0++) { / 064 / long range_value_0 = ((long)range_localIdx_0 1L) + range_nextIndex_0; /* 065 / / 066 / agg_doConsume_0(range_value_0); / 067 / / 068 / // shouldStop check is eliminated / 069 / } / 070 / range_nextIndex_0 = range_batchEnd_0; / 071 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] / numOutputRows /).add(range_localEnd_0); / 072 / range_inputMetrics_0.incRecordsRead(range_localEnd_0); / 073 / range_taskContext_0.killTaskIfInterrupted(); / 074 / } / 075 / / 076 / } / 077 / / 078 / private void initRange(int idx) { / 079 / java.math.BigInteger index = java.math.BigInteger.valueOf(idx); / 080 / java.math.BigInteger numSlice = java.math.BigInteger.valueOf(12L); / 081 / java.math.BigInteger numElement = java.math.BigInteger.valueOf(99L); / 082 / java.math.BigInteger step = java.math.BigInteger.valueOf(1L); / 083 / java.math.BigInteger start = java.math.BigInteger.valueOf(1L); / 084 / long partitionEnd; / 085 / / 086 / java.math.BigInteger st = index.multiply(numElement).divide(numSlice).multiply(step).add(start); / 087 / if (st.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 088 / range_nextIndex_0 = Long.MAX_VALUE; / 089 / } else if (st.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 090 / range_nextIndex_0 = Long.MIN_VALUE; / 091 / } else { / 092 / range_nextIndex_0 = st.longValue(); / 093 / } / 094 / range_batchEnd_0 = range_nextIndex_0; / 095 / / 096 / java.math.BigInteger end = index.add(java.math.BigInteger.ONE).multiply(numElement).divide(numSlice) / 097 / .multiply(step).add(start); / 098 / if (end.compareTo(java.math.BigInteger.valueOf(Long.MAX_VALUE)) > 0) { / 099 / partitionEnd = Long.MAX_VALUE; / 100 / } else if (end.compareTo(java.math.BigInteger.valueOf(Long.MIN_VALUE)) < 0) { / 101 / partitionEnd = Long.MIN_VALUE; / 102 / } else { / 103 / partitionEnd = end.longValue(); / 104 / } / 105 / / 106 / java.math.BigInteger startToEnd = java.math.BigInteger.valueOf(partitionEnd).subtract( / 107 / java.math.BigInteger.valueOf(range_nextIndex_0)); / 108 / range_numElementsTodo_0 = startToEnd.divide(step).longValue(); / 109 / if (range_numElementsTodo_0 < 0) { / 110 / range_numElementsTodo_0 = 0; / 111 / } else if (startToEnd.remainder(step).compareTo(java.math.BigInteger.valueOf(0L)) != 0) { / 112 / range_numElementsTodo_0++; / 113 / } / 114 / } / 115 / / 116 / private void agg_doConsume_0(long agg_expr_0_0) throws java.io.IOException { / 117 / // do aggregate / 118 / // common sub-expressions / 119 / / 120 / // evaluate aggregate functions and update aggregation buffers / 121 / / 122 / agg_agg_isNull_2_0 = true; / 123 / long agg_value_2 = -1L; / 124 / / 125 / if (!agg_bufIsNull_0 && (agg_agg_isNull_2_0 \|\| / 126 / agg_value_2 > agg_bufValue_0)) { / 127 / agg_agg_isNull_2_0 = false; / 128 / agg_value_2 = agg_bufValue_0; / 129 / } / 130 / / 131 / if (!false && (agg_agg_isNull_2_0 \|\| / 132 / agg_value_2 > agg_expr_0_0)) { / 133 / agg_agg_isNull_2_0 = false; / 134 / agg_value_2 = agg_expr_0_0; / 135 / } / 136 / / 137 / agg_bufIsNull_0 = agg_agg_isNull_2_0; / 138 / agg_bufValue_0 = agg_value_2; / 139 / / 140 / } / 141 / / 142 / protected void processNext() throws java.io.IOException { / 143 / while (!agg_initAgg_0) { / 144 / agg_initAgg_0 = true; / 145 / long agg_beforeAgg_0 = System.nanoTime(); / 146 / agg_doAggregateWithoutKey_0(); / 147 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[2] / aggTime /).add((System.nanoTime() - agg_beforeAgg_0) / 1000000); / 148 / / 149 / // output the result / 150 / / 151 / ((org.apache.spark.sql.execution.metric.SQLMetric) references[1] / numOutputRows /).add(1); / 152 / range_mutableStateArray_0[2].reset(); / 153 / / 154 / range_mutableStateArray_0[2].zeroOutNullBytes(); / 155 / / 156 / if (agg_bufIsNull_0) { / 157 / range_mutableStateArray_0[2].setNullAt(0); / 158 / } else { / 159 / range_mutableStateArray_0[2].write(0, agg_bufValue_0); / 160 / } / 161 / append((range_mutableStateArray_0[2].getRow())); / 162 / } / 163 / } / 164 / / 165 */ } ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. After this change, users can see subquery code by `EXPLAIN CODEGEN`. ### How was this patch tested? New test. Closes #30859 from sarutak/explain-codegen-subqueries. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 03:29:00 -08:00
Max Gekk	b313a1e9e6	[SPARK-33849][SQL][TESTS] Unify v1 and v2 DROP TABLE tests ### What changes were proposed in this pull request? 1. Move the `DROP TABLE` parsing tests to `DropTableParserSuite` 2. Place the v1 tests for `DROP TABLE` from `DDLSuite` and v2 tests from `DataSourceV2SQLSuite` to the common trait `DropTableSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `DROP TABLE` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DropTableParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly DropTableSuite" ``` Closes #30854 from MaxGekk/unify-drop-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 08:34:12 +00:00
Terry Kim	1c7b79c057	[SPARK-33856][SQL] Migrate ALTER TABLE ... RENAME TO PARTITION to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RENAME TO PARTITION` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RENAME TO PARTITION` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ``` sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") // works fine assuming id=1 exists. ``` , but after this PR: ``` sql("ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2)") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RENAME TO PARTITION' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30862 from imback82/alter_table_rename_partition_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 04:58:56 +00:00
Kousuke Saruta	3c8be3983c	[SPARK-33850][SQL][FOLLOWUP] Improve and cleanup the test code ### What changes were proposed in this pull request? This PR mainly improves and cleans up the test code introduced in #30855 based on the comment. The test code is actually taken from another test `explain formatted - check presence of subquery in case of DPP` so this PR cleans the code too ( removed unnecessary `withTable`). ### Why are the changes needed? To keep the test code clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `ExplainSuite` passes. Closes #30861 from sarutak/followup-SPARK-33850. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-21 09:40:42 +09:00
Terry Kim	df2314b63a	[SPARK-33852][SQL][TESTS] Use assertAnalysisError in HiveDDLSuite.scala ### What changes were proposed in this pull request? `HiveDDLSuite` has many of the following patterns: ```scala val e = intercept[AnalysisException] { sql(sqlString) } assert(e.message.contains(exceptionMessage)) ``` However, there already exists `assertAnalysisError` helper function which does exactly the same thing. ### Why are the changes needed? To refactor code to simplify. ### Does this PR introduce _any_ user-facing change? No, just refactoring the test code. ### How was this patch tested? Existing tests Closes #30857 from imback82/hive_ddl_suite_use_assertAnalysisError. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 14:37:15 -08:00
Kousuke Saruta	70da86a085	[SPARK-33850][SQL] EXPLAIN FORMATTED doesn't show the plan for subqueries if AQE is enabled ### What changes were proposed in this pull request? This PR fixes an issue that when AQE is enabled, EXPLAIN FORMATTED doesn't show the plan for subqueries. ```scala val df = spark.range(1, 100) df.createTempView("df") spark.sql("SELECT (SELECT min(id) AS v FROM df)").explain("FORMATTED") == Physical Plan == AdaptiveSparkPlan (3) +- Project (2) +- Scan OneRowRelation (1) (1) Scan OneRowRelation Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project Output [1]: [Subquery subquery#3, [id=#20] AS scalarsubquery()#5L] Input: [] (3) AdaptiveSparkPlan Output [1]: [scalarsubquery()#5L] Arguments: isFinalPlan=false ``` After this change, the plan for the subquerie is shown. ```scala == Physical Plan == * Project (2) +- * Scan OneRowRelation (1) (1) Scan OneRowRelation [codegen id : 1] Output: [] Arguments: ParallelCollectionRDD[0] at explain at <console>:24, OneRowRelation, UnknownPartitioning(0) (2) Project [codegen id : 1] Output [1]: [Subquery scalar-subquery#3, [id=#24] AS scalarsubquery()#5L] Input: [] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#3, [id=#24] * HashAggregate (6) +- Exchange (5) +- * HashAggregate (4) +- * Range (3) (3) Range [codegen id : 1] Output [1]: [id#0L] Arguments: Range (1, 100, step=1, splits=Some(12)) (4) HashAggregate [codegen id : 1] Input [1]: [id#0L] Keys: [] Functions [1]: [partial_min(id#0L)] Aggregate Attributes [1]: [min#7L] Results [1]: [min#8L] (5) Exchange Input [1]: [min#8L] Arguments: SinglePartition, ENSURE_REQUIREMENTS, [id=#20] (6) HashAggregate [codegen id : 2] Input [1]: [min#8L] Keys: [] Functions [1]: [min(id#0L)] Aggregate Attributes [1]: [min(id#0L)#4L] Results [1]: [min(id#0L)#4L AS v#2L] ``` ### Why are the changes needed? For better debuggability. ### Does this PR introduce _any_ user-facing change? Yes. Users can see the formatted plan for subqueries. ### How was this patch tested? New test. Closes #30855 from sarutak/fix-aqe-explain. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 14:10:20 -08:00
Ammar Al-Batool	37c4cd8f05	[MINOR][DOCS] Fix typos in ScalaDocs for DataStreamWriter#foreachBatch The title is pretty self-explanatory. ### What changes were proposed in this pull request? Fixing typos in the docs for `foreachBatch` functions. ### Why are the changes needed? To fix typos in JavaDoc/ScalaDoc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Yes. Closes #30782 from ammar1x/patch-1. Lead-authored-by: Ammar Al-Batool <ammar.albatool@gmail.com> Co-authored-by: Ammar Al-Batool <ammar.al-batool@disneystreaming.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-19 14:53:40 -06:00
Terry Kim	06075d849e	[SPARK-33829][SQL] Renaming v2 tables should recreate the cache ### What changes were proposed in this pull request? Currently, renaming v2 tables does not invalidate/recreate the cache, leading to an incorrect behavior (cache not being used) when v2 tables are renamed. This PR fixes the behavior. ### Why are the changes needed? Fixing a bug since the cache associated with the renamed table is not being cleaned up/recreated. ### Does this PR introduce _any_ user-facing change? Yes, now when a v2 table is renamed, cache is correctly updated. ### How was this patch tested? Added a new test Closes #30825 from imback82/rename_recreate_cache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 08:32:58 -08:00
Kent Yao	dd44ba5460	[SPARK-32976][SQL][FOLLOWUP] SET and RESTORE hive.exec.dynamic.partition.mode for HiveSQLInsertTestSuite to avoid flakiness ### What changes were proposed in this pull request? As https://github.com/apache/spark/pull/29893#discussion_r545303780 mentioned: > We need to set spark.conf.set("hive.exec.dynamic.partition.mode", "nonstrict") before executing this suite; otherwise, test("insert with column list - follow table output order + partitioned table") will fail. The reason why it does not fail because some test cases [running before this suite] do not change the default value of hive.exec.dynamic.partition.mode back to strict. However, the order of test suite execution is not deterministic. ### Why are the changes needed? avoid flakiness in tests ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30843 from yaooqinn/SPARK-32976-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-19 08:00:09 -08:00
Wenchen Fan	de234eec8f	[SPARK-33812][SQL] Split the histogram column stats when saving to hive metastore as table property ### What changes were proposed in this pull request? Hive metastore has a limitation for the table property length. To work around it, Spark split the schema json string into several parts when saving to hive metastore as table properties. We need to do the same for histogram column stats as it can go very big. This PR refactors the table property splitting code, so that we can share it between the schema json string and histogram column stats. ### Why are the changes needed? To be able to analyze table when histogram data is big. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test and new tests Closes #30809 from cloud-fan/cbo. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-19 14:35:28 +09:00
Kent Yao	c17c76dd16	[SPARK-33599][SQL][FOLLOWUP] FIX Github Action with unidoc ### What changes were proposed in this pull request? FIX Github Action with unidoc ### Why are the changes needed? FIX Github Action with unidoc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Pass GA Closes #30846 from yaooqinn/SPARK-33599. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-18 11:23:38 -08:00
gengjiaan	6dca2e5d35	[SPARK-33599][SQL] Group exception messages in catalyst/analysis ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30717 from beliefer/SPARK-33599. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 14:12:35 +00:00
gengjiaan	f239128802	[SPARK-33597][SQL] Support REGEXP_LIKE for consistent with mainstream databases ### What changes were proposed in this pull request? There are a lot of mainstream databases support regex function `REGEXP_LIKE`. Currently, Spark supports `RLike` and we just need add a new alias `REGEXP_LIKE` for it. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-D2124F3A-C6E4-4CCA-A40E-2FFCABFD8E19 Presto https://prestodb.io/docs/current/functions/regexp.html Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_LIKE.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CRegular%20Expression%20Functions%7C_____5 Snowflake https://docs.snowflake.com/en/sql-reference/functions/regexp_like.html Additional modifications 1. Because test case named `check outputs of expression examples` in ExpressionInfoSuite executes the example SQL of built-in function, so the below SQL be executed: `SELECT '%SystemDrive%\Users\John' regexp_like '%SystemDrive%\\Users.'` But Spark SQL not supports this syntax yet. 2. Another reason: `SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.';` is an SQL syntax, not the usecase for function `RLike`. As the above reason, this PR changes the example SQL of `RLike`. ### Why are the changes needed? No ### Does this PR introduce _any_ user-facing change? Make the behavior of Spark SQL consistent with mainstream databases. ### How was this patch tested? Jenkins test Closes #30543 from beliefer/SPARK-33597. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 13:47:31 +00:00
Yuming Wang	06b1bbbbab	[SPARK-33798][SQL] Add new rule to push down the foldable expressions through CaseWhen/If ### What changes were proposed in this pull request? This pr add a new rule(`PushFoldableIntoBranches`) to push down the foldable expressions through `CaseWhen/If`. This is a real case from production: ```sql create table t1 using parquet as select * from range(100); create table t2 using parquet as select * from range(200); create temp view v1 as select 'a' as event_type, * from t1 union all select CASE WHEN id = 1 THEN 'b' WHEN id = 3 THEN 'c' end as event_type, * from t2 explain select * from v1 where event_type = 'a'; ``` Before this PR: ``` == Physical Plan == Union :- (1) Project [a AS event_type#30533, id#30535L] : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#30535L] Batched: true, DataFilters: [], Format: Parquet +- (2) Project [CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END AS event_type#30534, id#30536L] +- (2) Filter (CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a) +- (2) ColumnarToRow +- FileScan parquet default.t2[id#30536L] Batched: true, DataFilters: [(CASE WHEN (id#30536L = 1) THEN b WHEN (id#30536L = 3) THEN c END = a)], Format: Parquet ``` After this PR: ``` == Physical Plan == (1) Project [a AS event_type#8, id#4L] +- *(1) ColumnarToRow +- FileScan parquet default.t1[id#4L] Batched: true, DataFilters: [], Format: Parquet ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30790 from wangyum/SPARK-33798. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 13:20:58 +00:00
angerszhu	0603913c66	[SPARK-33593][SQL] Vector reader got incorrect data with binary partition value ### What changes were proposed in this pull request? Currently when enable parquet vectorized reader, use binary type as partition col will return incorrect value as below UT ```scala test("Parquet vector reader incorrect with binary partition value") { Seq(false, true).foreach(tag => { withSQLConf("spark.sql.parquet.enableVectorizedReader" -> tag.toString) { withTable("t1") { sql( """CREATE TABLE t1(name STRING, id BINARY, part BINARY) \| USING PARQUET PARTITIONED BY (part)""".stripMargin) sql(s"INSERT INTO t1 PARTITION(part = 'Spark SQL') VALUES('a', X'537061726B2053514C')") if (tag) { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "")) } else { checkAnswer(sql("SELECT name, cast(id as string), cast(part as string) FROM t1"), Row("a", "Spark SQL", "Spark SQL")) } } } }) } ``` ### Why are the changes needed? Fix data incorrect issue ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #30824 from AngersZhuuuu/SPARK-33593. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-18 00:01:13 -08:00
Terry Kim	0f1a18370a	[SPARK-33817][SQL] CACHE TABLE uses a logical plan when caching a query to avoid creating a dataframe ### What changes were proposed in this pull request? This PR proposes to update `CACHE TABLE` to use a `LogicalPlan` when caching a query to avoid creating a `DataFrame` as suggested here: https://github.com/apache/spark/pull/30743#discussion_r543123190 For reference, `UNCACHE TABLE` also uses `LogicalPlan`: `0c12900120/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/CacheTableExec.scala (L91-L98)` ### Why are the changes needed? To avoid creating an unnecessary dataframe and make it consistent with `uncacheQuery` used in `UNCACHE TABLE`. ### Does this PR introduce _any_ user-facing change? No, just internal changes. ### How was this patch tested? Existing tests since this is an internal refactoring change. Closes #30815 from imback82/cache_with_logical_plan. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-18 04:30:15 +00:00
Takeshi Yamamuro	51ef4430dc	[SPARK-33822][SQL] Use the `CastSupport.cast` method in HashJoin ### What changes were proposed in this pull request? This PR intends to fix the bug that throws a unsupported exception when running [the TPCDS q5](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q5.sql) with AQE enabled ([this option is enabled by default now via SPARK-33679](`031c5ef280`)): ``` java.lang.UnsupportedOperationException: BroadcastExchange does not support the execute() code path. at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecute(BroadcastExchangeExec.scala:189) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.exchange.ReusedExchangeExec.doExecute(Exchange.scala:60) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.adaptive.QueryStageExec.doExecute(QueryStageExec.scala:115) at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180) at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176) at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:321) at org.apache.spark.sql.execution.SparkPlan.executeCollectIterator(SparkPlan.scala:397) at org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.$anonfun$relationFuture$1(BroadcastExchangeExec.scala:118) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withThreadLocalCaptured$1(SQLExecution.scala:185) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) ... ``` I've checked the AQE code and I found `EnsureRequirements` wrongly puts `BroadcastExchange` on a top of `BroadcastQueryStage` in the `reOptimize` phase as follows: ``` +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#2183] +- BroadcastQueryStage 2 +- ReusedExchange [d_date_sk#1086], BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint)),false), [id=#1963] ``` A root cause is that a `Cast` class in a required child's distribution does not have a `timeZoneId` field (`timeZoneId=None`), and a `Cast` class in `child.outputPartitioning` has it. So, this difference can make the distribution requirement check fail in `EnsureRequirements`: `1e85707738/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/EnsureRequirements.scala (L47-L50)` The `Cast` class that does not have a `timeZoneId` field is generated in the `HashJoin` object. To fix this issue, this PR proposes to use the `CastSupport.cast` method there. ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked that q5 passed. Closes #30818 from maropu/BugfixInAQE. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-17 16:16:05 -08:00
allisonwang-db	1e85707738	[SPARK-33697][SQL] RemoveRedundantProjects should require column ordering by default ### What changes were proposed in this pull request? This PR changes the rule `RemoveRedundantProjects` from by default passing column ordering requirements from parent nodes to always require column orders regardless of the requirements from parent nodes unless otherwise specified. More specifically, instead of excluding a few nodes like GenerateExec, UnionExec that are known to require children columns to be ordered, the rule now includes a whitelist of nodes that allow passing through the ordering requirements from their parents. ### Why are the changes needed? Currently, this rule passes through ordering requirements from parents directly to children except for a few excluded nodes. This incorrectly removes the necessary project nodes below a UnionExec since it is not excluded. An earlier PR also fixed a similar issue for GenerateExec (SPARK-32861). In order to prevent similar issues, the rule should be changed to always require column ordering except for a few specific nodes that we know for sure can pass through the requirements. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests Closes #30659 from allisonwang-db/spark-33697-remove-project-union. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-17 05:47:44 +00:00
Terry Kim	0c19497222	[SPARK-33815][SQL] Migrate ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES] to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t SET SERDE 'serdename'") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t SET SERDE 'serdename'") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES\' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `t` in the above example is resolved to a temp view first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30813 from imback82/alter_table_serde_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-17 05:25:51 +00:00
Terry Kim	e7e29fd0af	[SPARK-33514][SQL][FOLLOW-UP] Remove unused TruncateTableStatement case class ### What changes were proposed in this pull request? This PR removes unused `TruncateTableStatement`: https://github.com/apache/spark/pull/30457#discussion_r544433820 ### Why are the changes needed? To remove unused `TruncateTableStatement` from #30457. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not needed. Closes #30811 from imback82/remove_truncate_table_stmt. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 14:13:02 -08:00
Kent Yao	728a1298af	[SPARK-33806][SQL] limit partition num to 1 when distributing by foldable expressions ### What changes were proposed in this pull request? It seems a very popular way that people use DISTRIBUTE BY clause with a literal to coalesce partition in the pure SQL data processing. For example ``` insert into table src select * from values (1), (2), (3) t(a) distribute by 1 ``` Users may want the final output to be one single data file, but if the reality is not always true. Spark will always create a file for partition 0 whether it contains data or not, so when the data all goes to a partition(IDX >0), there will be always 2 files there and the part-00000 is empty. On the other hand, a lot of empty tasks will be launched too, this is unnecessary. When users repeat the insert statement daily, hourly, or minutely, it causes small file issues. ``` spark-sql> set spark.sql.shuffle.partitions=3;drop table if exists test2;create table test2 using parquet as select * from values (1), (2), (3) t(a) distribute by 1; kentyaohulk  ~/spark   SPARK-33806  tree /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ -s /Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20201202/spark-warehouse/test2/ ├── [ 0] _SUCCESS ├── [ 298] part-00000-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet └── [ 426] part-00001-5dc19733-9405-414b-9681-d25c4d3e9ee6-c000.snappy.parquet ``` To avoid this, there are some options you can take. 1. use `distribute by null`, let the data go to the partition 0 2. set spark.sql.adaptive.enabled to true for Spark to automatically coalesce 3. using hints instead of `distribute by` 4. set spark.sql.shuffle.partitions to 1 In this PR, we set the partition number to 1 in this particular case. ### Why are the changes needed? 1. avoid small file issues 2. avoid unnecessary empty tasks when no adaptive execution ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30800 from yaooqinn/SPARK-33806. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-16 14:09:28 -08:00
Terry Kim	8666d1c39c	[SPARK-33800][SQL] Remove command name in AnalysisException message when a relation is not resolved ### What changes were proposed in this pull request? Based on the discussion https://github.com/apache/spark/pull/30743#discussion_r543124594, this PR proposes to remove the command name in AnalysisException message when a relation is not resolved. For some of the commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier, when the identifier cannot be resolved, the exception will be something like `Table or view not found for 'SHOW TBLPROPERTIES': badtable`. The command name (`SHOW TBLPROPERTIES` in this case) should be dropped to be consistent with other existing commands. ### Why are the changes needed? To make the exception message consistent. ### Does this PR introduce _any_ user-facing change? Yes, the exception message will be changed from ``` Table or view not found for 'SHOW TBLPROPERTIES': badtable ``` to ``` Table or view not found: badtable ``` for commands that use `UnresolvedTable`, `UnresolvedView`, and `UnresolvedTableOrView` to resolve an identifier. ### How was this patch tested? Updated existing tests. Closes #30794 from imback82/remove_cmd_from_exception_msg. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 15:56:50 +00:00
Kent Yao	205d8e40bc	[SPARK-32991][SQL] [FOLLOWUP] Reset command relies on session initials first ### What changes were proposed in this pull request? As a follow-up of https://github.com/apache/spark/pull/30045, we modify the RESET command here to respect the session initial configs per session first then fall back to the `SharedState` conf, which makes each session could maintain a different copy of initial configs for resetting. ### Why are the changes needed? to make reset command saner. ### Does this PR introduce _any_ user-facing change? yes, RESET will respect session initials first not always go to the system defaults ### How was this patch tested? add new tests Closes #30642 from yaooqinn/SPARK-32991-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 14:36:38 +00:00
Max Gekk	9d9d4a8e12	[SPARK-33789][SQL][TESTS] Refactor unified V1 and V2 datasource tests ### What changes were proposed in this pull request? 1. Move common utility functions such as `test()`, `withNsTable()` and `checkPartitions()` to `DDLCommandTestUtils`. 2. Place common settings such as `version`, `catalog`, `defaultUsing`, `sparkConf` to `CommandSuiteBase`. ### Why are the changes needed? To improve code maintenance of the unified tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly ShowTablesSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #30779 from MaxGekk/refactor-unified-tests. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 13:49:49 +00:00
HyukjinKwon	7845865b8d	[SPARK-33803][SQL] Sort table properties by key in DESCRIBE TABLE command ### What changes were proposed in this pull request? This PR proposes to sort table properties in DESCRIBE TABLE command. This is consistent with DSv2 command as well: `e3058ba17c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/DescribeTableExec.scala (L63)` This PR fixes the test case in Scala 2.13 build as well where the table properties have different order in the map. ### Why are the changes needed? To keep the deterministic and pretty output, and fix the tests in Scala 2.13 build. See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-scala-2.13/49/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/describe_sql/ ``` describe.sql Expected "...spark_catalog, view.[query.out.col.2=c, view.referredTempFunctionsNames=[], view.catalogAndNamespace.part.1=default]]", but got "...spark_catalog, view.[catalogAndNamespace.part.1=default, view.query.out.col.2=c, view.referredTempFunctionsNames=[]]]" Result did not match for query #29 DESC FORMATTED v ``` ### Does this PR introduce _any_ user-facing change? Yes, it will change the text output from `DESCRIBE [EXTENDED\|FORMATTED] table_name`. Now the table properties are sorted by its key. ### How was this patch tested? Related unittests were fixed accordingly. Closes #30799 from HyukjinKwon/SPARK-33803. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 13:42:30 +00:00
Terry Kim	ef7f6903b4	[SPARK-33786][SQL] The storage level for a cache should be respected when a table name is altered ### What changes were proposed in this pull request? This PR proposes to retain the cache's storage level when a table name is altered by `ALTER TABLE ... RENAME TO ...`. ### Why are the changes needed? Currently, when a table name is altered, the table's cache is refreshed (if exists), but the storage level is not retained. For example: ```scala def getStorageLevel(tableName: String): StorageLevel = { val table = spark.table(tableName) val cachedData = spark.sharedState.cacheManager.lookupCachedData(table).get cachedData.cachedRepresentation.cacheBuilder.storageLevel } Seq(1 -> "a").toDF("i", "j").write.parquet(path.getCanonicalPath) sql(s"CREATE TABLE old USING parquet LOCATION '${path.toURI}'") sql("CACHE TABLE old OPTIONS('storageLevel' 'MEMORY_ONLY')") val oldStorageLevel = getStorageLevel("old") sql("ALTER TABLE old RENAME TO new") val newStorageLevel = getStorageLevel("new") ``` `oldStorageLevel` will be `StorageLevel(memory, deserialized, 1 replicas)` whereas `newStorageLevel` will be `StorageLevel(disk, memory, deserialized, 1 replicas)`, which is the default storage level. ### Does this PR introduce _any_ user-facing change? Yes, now the storage level for the cache will be retained. ### How was this patch tested? Added a unit test. Closes #30774 from imback82/alter_table_rename_cache_fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 05:45:44 +00:00
Terry Kim	62be2483d7	[SPARK-33765][SQL] Migrate UNCACHE TABLE to use UnresolvedRelation to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `UNCACHE TABLE` to use `UnresolvedRelation` to resolve the table/view identifier in Analyzer as discussed https://github.com/apache/spark/pull/30403/files#r532360022. ### Why are the changes needed? To resolve the table/view in the analyzer. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Updated existing tests Closes #30743 from imback82/uncache_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-16 05:37:56 +00:00
Max Gekk	3dfdcf4f92	[SPARK-33788][SQL] Throw NoSuchPartitionsException from HiveExternalCatalog.dropPartitions() ### What changes were proposed in this pull request? Throw `NoSuchPartitionsException` from `ALTER TABLE .. DROP TABLE` for not existing partitions of a table in V1 Hive external catalog. ### Why are the changes needed? The behaviour of Hive external catalog deviates from V1/V2 in-memory catalogs that throw `NoSuchPartitionsException`. To improve user experience with Spark SQL, it would be better to throw the same exception. ### Does this PR introduce _any_ user-facing change? Yes, the command throws `NoSuchPartitionsException` instead of the general exception `AnalysisException`. ### How was this patch tested? By running tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30778 from MaxGekk/hive-drop-partition-exception. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-16 10:03:48 +09:00
Anton Okolnychyi	4d56d43838	[SPARK-33735][SQL] Handle UPDATE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `UpdateTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize update conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `UpdateTable`. Closes #30787 from aokolnychyi/spark-33735. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-15 13:50:58 -08:00
Wenchen Fan	40c37d69fd	[SPARK-33617][SQL][FOLLOWUP] refine the default parallelism SQL config ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30559 . The default parallelism config in Spark core is not good, as it's unclear where it applies. To not inherit this problem in Spark SQL, this PR refines the default parallelism SQL config, to make it clear that it only applies to leaf nodes. ### Why are the changes needed? Make the config clearer. ### Does this PR introduce _any_ user-facing change? It changes an unreleased config. ### How was this patch tested? existing tests Closes #30736 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 14:16:43 +00:00
Prakhar Jain	23083aa594	[SPARK-33758][SQL] Prune unrequired partitionings from AliasAwareOutputPartitionings when some columns are dropped from projection ### What changes were proposed in this pull request? This PR tries to prune the unrequired output partitionings in cases when the columns are dropped from Project/Aggregates etc. ### Why are the changes needed? Consider this query: select t1.id from t1 JOIN t2 on t1.id = t2.id This query will have top level Project node which will just project t1.id. But the outputPartitioning of this project node will be: PartitioningCollection(HashPartitioning(t1.id), HashPartitioning(t2.id)). But since we are not propagating t2.id column, so we can drop HashPartitioning(t2.id) from the output partitioning of Project node. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UTs. Closes #30762 from prakharjain09/SPARK-33758-prune-partitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 13:46:58 +00:00
gengjiaan	58cb2bae74	[SPARK-33752][SQL] Avoid the getSimpleMessage of AnalysisException adds semicolon repeatedly ### What changes were proposed in this pull request? The current `getSimpleMessage` of `AnalysisException` may adds semicolon repeatedly. There show an example below: `select decode()` The output will be: ``` org.apache.spark.sql.AnalysisException Invalid number of arguments for function decode. Expected: 2; Found: 0;; line 1 pos 7 ``` ### Why are the changes needed? Fix a bug, because it adds semicolon repeatedly. ### Does this PR introduce _any_ user-facing change? Yes. the message of AnalysisException will be correct. ### How was this patch tested? Jenkins test. Closes #30724 from beliefer/SPARK-33752. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 19:20:01 +09:00
Chongguang LIU	20f6d63bc1	[SPARK-33769][SQL] Improve the next-day function of the sql component to deal with Column type ### What changes were proposed in this pull request? The proposition of this pull request is described in this JIRA ticket: [https://issues.apache.org/jira/browse/SPARK-33769](url) It proposes to improve the next-day function of the sql component to deal with Column type for the parameter dayOfWeek. ### Why are the changes needed? It makes this functionality easier to use. Actually the signature of this function is: > def next_day(date: Column, dayOfWeek: String): Column. It accepts the dayOfWeek parameter as a String. However in some cases, the dayOfWeek is in a Column, so a different value for each row of the dataframe. A current workaround is to use the NextDay function like this: > NextDay(dateCol.expr, dayOfWeekCol.expr). The proposition is to add another signature for this function: > def next_day(date: Column, dayOfWeek: Column): Column In fact it is already the case for some other functions in this scala object, exemple: > def date_sub(start: Column, days: Int): Column = date_sub(start, lit(days)) > def date_sub(start: Column, days: Column): Column = withExpr \{ DateSub(start.expr, days.expr) } or > def add_months(startDate: Column, numMonths: Int): Column = add_months(startDate, lit(numMonths)) > def add_months(startDate: Column, numMonths: Column): Column = withExpr { > AddMonths(startDate.expr, numMonths.expr) > } This pull request is the same idea for the function next_day. ### Does this PR introduce _any_ user-facing change? Yes With this pull request, users of spark will have a new signature of the function: > def next_day(date: Column, dayOfWeek: Column): Column But the existing function signature should still work: > def next_day(date: Column, dayOfWeek: String): Column So this change should be retrocompatible. ### How was this patch tested? The unit tests of the next_day function has been enhanced. It tests the dayOfWeek parameter both as String and Column. I also added a test case for the existing signature where the dayOfWeek is a non valid String. This should return null. Closes #30761 from chongguang/SPARK-33769. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 18:55:48 +09:00
Wenchen Fan	03042529e3	[SPARK-33273][SQL] Fix a race condition in subquery execution ### What changes were proposed in this pull request? If we call `SubqueryExec.executeTake`, it will call `SubqueryExec.execute` which will trigger the codegen of the query plan and create an RDD. However, `SubqueryExec` already has a thread (`SubqueryExec.relationFuture`) to execute the query plan, which means we have 2 threads triggering codegen of the same query plan at the same time. Spark codegen is not thread-safe, as we have places like `HashAggregateExec.bufferVars` that is a shared variable. The bug in `SubqueryExec` may lead to correctness bugs. Since https://issues.apache.org/jira/browse/SPARK-33119, `ScalarSubquery` will call `SubqueryExec.executeTake`, so flaky tests start to appear. This PR fixes the bug by reimplementing https://github.com/apache/spark/pull/30016 . We should pass the number of rows we want to collect to `SubqueryExec` at planning time, so that we can use `executeTake` inside `SubqueryExec.relationFuture`, and the caller side should always call `SubqueryExec.executeCollect`. This PR also adds checks so that we can make sure only `SubqueryExec.executeCollect` is called. ### Why are the changes needed? fix correctness bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? run `build/sbt "sql/testOnly *SQLQueryTestSuite -- -z scalar-subquery-select"` more than 10 times. Previously it fails, now it passes. Closes #30765 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 18:29:28 +09:00
Max Gekk	141e26d65b	[SPARK-33767][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. DROP PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. DROP PARTITION` parsing tests to `AlterTableDropPartitionParserSuite` 2. Place v1 tests for `ALTER TABLE .. DROP PARTITION` from `DDLSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to the common trait `AlterTableDropPartitionSuiteBase`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. DROP PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly AlterTableDropPartitionParserSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #30747 from MaxGekk/unify-alter-table-drop-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 05:36:57 +00:00
Terry Kim	366beda54a	[SPARK-33785][SQL] Migrate ALTER TABLE ... RECOVER PARTITIONS to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... RECOVER PARTITIONS` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ALTER TABLE ... RECOVER PARTITIONS` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("ALTER TABLE t RECOVER PARTITIONS") // works fine ``` , but after this PR: ``` sql("ALTER TABLE t RECOVER PARTITIONS") org.apache.spark.sql.AnalysisException: t is a temp view. 'ALTER TABLE ... RECOVER PARTITIONS' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE t RECOVER PARTITIONS` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30773 from imback82/alter_table_recover_part_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-15 05:23:39 +00:00
Chao Sun	49d3256497	[SPARK-33653][SQL] DSv2: REFRESH TABLE should recache the table itself ### What changes were proposed in this pull request? This changes DSv2 refresh table semantics to also recache the target table itself. ### Why are the changes needed? Currently "REFRESH TABLE" in DSv2 only invalidate all caches referencing the table. With #30403 merged which adds support for caching a DSv2 table, we should also recache the target table itself to make the behavior consistent with DSv1. ### Does this PR introduce _any_ user-facing change? Yes, now refreshing table in DSv2 also recache the target table itself. ### How was this patch tested? Added coverage of this new behavior in the existing UT for v2 refresh table command Closes #30742 from sunchao/SPARK-33653. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 15:18:50 -08:00
Max Gekk	f156718587	[SPARK-33777][SQL] Sort output of V2 SHOW PARTITIONS ### What changes were proposed in this pull request? List partitions returned by the V2 `SHOW PARTITIONS` command in alphabetical order. ### Why are the changes needed? To have the same behavior as: 1. V1 in-memory catalog, see `a28ed86a38/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/InMemoryCatalog.scala (L546)` 2. V1 Hive catalogs, see `fab2995972/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveClientImpl.scala (L715)` ### Does this PR introduce _any_ user-facing change? Yes, after the changes, V2 SHOW PARTITIONS sorts its output. ### How was this patch tested? Added new UT to the base trait `ShowPartitionsSuiteBase` which contains tests for V1 and V2. Closes #30764 from MaxGekk/sort-show-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 14:28:47 -08:00
Yuming Wang	412d86e711	[SPARK-33771][SQL][TESTS] Fix Invalid value for HourOfAmPm when testing on JDK 14 ### What changes were proposed in this pull request? This pr fix invalid value for HourOfAmPm when testing on JDK 14. ### Why are the changes needed? Run test on JDK 14. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30754 from wangyum/SPARK-33771. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 13:34:23 -08:00
Anton Okolnychyi	bb60fb1bbd	[SPARK-33779][SQL][FOLLOW-UP] Fix Java Linter error ### What changes were proposed in this pull request? This PR removes unused imports. ### Why are the changes needed? These changes are required to fix the build. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Via `dev/lint-java`. Closes #30767 from aokolnychyi/fix-linter. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-14 11:39:42 -08:00
Anton Okolnychyi	82aca7eb8f	[SPARK-33779][SQL] DataSource V2: API to request distribution and ordering on write ### What changes were proposed in this pull request? This PR adds connector interfaces proposed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? Data sources should be able to request a specific distribution and ordering of data on write. In particular, these scenarios are considered useful: - global sort - cluster data and sort within partitions - local sort within partitions - no sort Please see the design doc above for a more detailed explanation of requirements. ### Does this PR introduce _any_ user-facing change? This PR introduces public changes to the DS V2 by adding a logical write abstraction as we have on the read path as well as additional interfaces to represent distribution and ordering of data (please see the doc for more info). The existing `Distribution` interface in `read` package is read-specific and not flexible enough like discussed in the design doc. The current proposal is to evolve these interfaces separately until they converge. ### How was this patch tested? This patch adds only interfaces. Closes #30706 from aokolnychyi/spark-23889-interfaces. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Ryan Blue <blue@apache.org>	2020-12-14 10:54:18 -08:00
ulysses-you	839d6899ad	[SPARK-33733][SQL] PullOutNondeterministic should check and collect deterministic field ### What changes were proposed in this pull request? The deterministic field is wider than `NonDerterministic`, we should keep same range between pull out and check analysis. ### Why are the changes needed? For example ``` select * from values(1), (4) as t(c1) order by java_method('java.lang.Math', 'abs', c1) ``` We will get exception since `java_method` deterministic field is false but not a `NonDeterministic` ``` Exception in thread "main" org.apache.spark.sql.AnalysisException: nondeterministic expressions are only allowed in Project, Filter, Aggregate or Window, found: java_method('java.lang.Math', 'abs', t.`c1`) ASC NULLS FIRST in operator Sort [java_method(java.lang.Math, abs, c1#1) ASC NULLS FIRST], true ;; ``` ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30703 from ulysses-you/SPARK-33733. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 14:35:24 +00:00
angerszhu	5f9a7fea06	[SPARK-33428][SQL] Conv UDF use BigInt to avoid Long value overflow ### What changes were proposed in this pull request? Use Long value store encode value will overflow and return unexpected result, use BigInt to replace Long value and make logical more simple. ### Why are the changes needed? Fix value overflow issue ### Does this PR introduce _any_ user-facing change? People can sue `conf` function to convert value big then LONG.MAX_VALUE ### How was this patch tested? Added UT #### BenchMark ``` /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. / package org.apache.spark.sql.execution.benchmark import scala.util.Random import org.apache.spark.benchmark.Benchmark import org.apache.spark.sql.functions._ object ConvFuncBenchMark extends SqlBasedBenchmark { val charset = Array[String]("0", "1", "2", "3", "4", "5", "6", "7", "8", "9", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z") def constructString(from: Int, length: Int): String = { val chars = charset.slice(0, from) (0 to length).map(x => { val v = Random.nextInt(from) chars(v) }).mkString("") } private def doBenchmark(cardinality: Long, length: Int, from: Int, toBase: Int): Unit = { spark.range(cardinality) .withColumn("str", lit(constructString(from, length))) .select(conv(col("str"), from, toBase)) .noop() } /* * Main process of the whole benchmark. * Implementations of this method are supposed to use the wrapper method `runBenchmark` * for each benchmark scenario. */ override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val N = 1000000L val benchmark = new Benchmark("conv", N, output = output) benchmark.addCase("length 10 from 2 to 16") { _ => doBenchmark(N, 10, 2, 16) } benchmark.addCase("length 10 from 2 to 10") { _ => doBenchmark(N, 10, 2, 10) } benchmark.addCase("length 10 from 10 to 16") { _ => doBenchmark(N, 10, 10, 16) } benchmark.addCase("length 10 from 10 to 36") { _ => doBenchmark(N, 10, 10, 36) } benchmark.addCase("length 10 from 16 to 10") { _ => doBenchmark(N, 10, 10, 10) } benchmark.addCase("length 10 from 16 to 36") { _ => doBenchmark(N, 10, 16, 36) } benchmark.addCase("length 10 from 36 to 10") { _ => doBenchmark(N, 10, 36, 10) } benchmark.addCase("length 10 from 36 to 16") { _ => doBenchmark(N, 10, 36, 16) } // benchmark.addCase("length 20 from 10 to 16") { _ => doBenchmark(N, 20, 10, 16) } benchmark.addCase("length 20 from 10 to 36") { _ => doBenchmark(N, 20, 10, 36) } benchmark.addCase("length 30 from 10 to 16") { _ => doBenchmark(N, 30, 10, 16) } benchmark.addCase("length 30 from 10 to 36") { _ => doBenchmark(N, 30, 10, 36) } // benchmark.addCase("length 20 from 16 to 10") { _ => doBenchmark(N, 20, 16, 10) } benchmark.addCase("length 20 from 16 to 36") { _ => doBenchmark(N, 20, 16, 36) } benchmark.addCase("length 30 from 16 to 10") { _ => doBenchmark(N, 30, 16, 10) } benchmark.addCase("length 30 from 16 to 36") { _ => doBenchmark(N, 30, 16, 36) } benchmark.run() } } ``` Result with patch : ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.6 Intel(R) Core(TM) i5-8259U CPU 2.30GHz conv: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ length 10 from 2 to 16 54 73 18 18.7 53.6 1.0X length 10 from 2 to 10 43 47 5 23.5 42.5 1.3X length 10 from 10 to 16 39 47 12 25.5 39.2 1.4X length 10 from 10 to 36 38 42 3 26.5 37.7 1.4X length 10 from 16 to 10 39 41 3 25.7 38.9 1.4X length 10 from 16 to 36 36 41 4 27.6 36.3 1.5X length 10 from 36 to 10 38 40 2 26.3 38.0 1.4X length 10 from 36 to 16 37 39 2 26.8 37.2 1.4X length 20 from 10 to 16 36 39 2 27.4 36.5 1.5X length 20 from 10 to 36 37 39 2 27.2 36.8 1.5X length 30 from 10 to 16 37 39 2 27.0 37.0 1.4X length 30 from 10 to 36 36 38 2 27.5 36.3 1.5X length 20 from 16 to 10 35 38 2 28.3 35.4 1.5X length 20 from 16 to 36 34 38 3 29.2 34.3 1.6X length 30 from 16 to 10 38 40 2 26.3 38.1 1.4X length 30 from 16 to 36 37 38 1 27.2 36.8 1.5X ``` Result without patch: ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.14.6 Intel(R) Core(TM) i5-8259U CPU 2.30GHz conv: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ length 10 from 2 to 16 66 101 29 15.1 66.1 1.0X length 10 from 2 to 10 50 55 5 20.2 49.5 1.3X length 10 from 10 to 16 46 51 5 21.8 45.9 1.4X length 10 from 10 to 36 43 48 4 23.4 42.7 1.5X length 10 from 16 to 10 44 47 4 22.9 43.7 1.5X length 10 from 16 to 36 40 44 2 24.7 40.5 1.6X length 10 from 36 to 10 40 44 4 25.0 40.1 1.6X length 10 from 36 to 16 41 43 2 24.3 41.2 1.6X length 20 from 10 to 16 39 41 2 25.7 38.9 1.7X length 20 from 10 to 36 40 42 2 24.9 40.2 1.6X length 30 from 10 to 16 39 40 1 25.9 38.6 1.7X length 30 from 10 to 36 40 41 1 25.0 40.0 1.7X length 20 from 16 to 10 40 41 1 25.1 39.8 1.7X length 20 from 16 to 36 40 42 2 25.2 39.7 1.7X length 30 from 16 to 10 39 42 2 25.6 39.0 1.7X length 30 from 16 to 36 39 40 2 25.7 38.8 1.7X ``` Closes #30350 from AngersZhuuuu/SPARK-33428. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 14:32:08 +00:00
yangjie01	cd0356df9e	[SPARK-33673][SQL] Avoid push down partition filters to ParquetScan for DataSourceV2 ### What changes were proposed in this pull request? As described in SPARK-33673, some test suites in `ParquetV2SchemaPruningSuite` will failed when set `parquet.version` to 1.11.1 because Parquet will return empty results for non-existent column since PARQUET-1765. This pr change to use `readDataSchema()` instead of `schema` to build `pushedParquetFilters` in `ParquetScanBuilder` to avoid push down partition filters to `ParquetScan` for `DataSourceV2` ### Why are the changes needed? Prepare for upgrade using Parquet 1.11.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test as follows: ``` mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.execution.datasources.parquet.ParquetV2SchemaPruningSuite -Dparquet.version=1.11.1 test -pl sql/core -am ``` Before ``` Run completed in 3 minutes, 13 seconds. Total number of tests run: 134 Suites: completed 2, aborted 0 Tests: succeeded 120, failed 14, canceled 0, ignored 0, pending 0 * 14 TESTS FAILED * ``` After ``` Run completed in 3 minutes, 46 seconds. Total number of tests run: 134 Suites: completed 2, aborted 0 Tests: succeeded 134, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #30652 from LuciferYang/SPARK-33673. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-12-14 17:51:40 +08:00
Terry Kim	a84c8d842c	[SPARK-33751][SQL] Migrate ALTER VIEW ... AS command to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR migrates `ALTER VIEW ... AS` to use `UnresolvedView` to resolve the view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). The `TempViewOrV1Table` extractor in `ResolveSessionCatalog.scala` can now be removed as well. ### Why are the changes needed? To use `UnresolvedView` for view resolution. ### Does this PR introduce _any_ user-facing change? The exception message changes if a table is found instead of view: ``` // OLD `tab1` is not a view" ``` ``` // NEW "tab1 is a table. 'ALTER VIEW ... AS' expects a view." ``` ### How was this patch tested? Updated existing tests. Closes #30723 from imback82/alter_view_as_statement. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:39:01 +00:00
Linhong Liu	b7c8210135	[SPARK-33142][SPARK-33647][SQL][FOLLOW-UP] Add docs and test cases ### What changes were proposed in this pull request? Addressed comments in PR #30567, including: 1. add test case for SPARK-33647 and SPARK-33142 2. add migration guide 3. add `getRawTempView` and `getRawGlobalTempView` to return the raw view info (i.e. TemporaryViewRelation) 4. other minor code clean ### Why are the changes needed? Code clean and more test cases ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing and newly added test cases Closes #30666 from linhongliu-db/SPARK-33142-followup. Lead-authored-by: Linhong Liu <linhong.liu@databricks.com> Co-authored-by: Linhong Liu <67896261+linhongliu-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:31:50 +00:00
xuewei.linxuewei	e7fe92f129	[SPARK-33546][SQL] Enable row format file format validation in CREATE TABLE LIKE ### What changes were proposed in this pull request? [SPARK-33546] stated the there are three inconsistency behaviors for CREATE TABLE LIKE. 1. CREATE TABLE LIKE does not validate the user-specified hive serde. e.g., STORED AS PARQUET can't be used with ROW FORMAT SERDE. 2. CREATE TABLE LIKE requires STORED AS and ROW FORMAT SERDE to be specified together, which is not necessary. 3. CREATE TABLE LIKE does not respect the default hive serde. This PR fix No.1, and after investigate, No.2 and No.3 turn out not to be issue. Within Hive. CREATE TABLE abc ... ROW FORMAT SERDE 'xxx.xxx.SerdeClass' (Without Stored as) will have following result. Using the user specific SerdeClass and fetch default input/output format from default textfile format. ``` SerDe Library: xxx.xxx.SerdeClass InputFormat: org.apache.hadoop.mapred.TextInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat ``` But for CREATE TABLE dst LIKE src ROW FORMAT SERDE 'xxx.xxx.SerdeClass' (Without Stored as) will just ignore user specific SerdeClass and using (input, output, serdeClass) from src table. It's better to just throw an exception on such ambiguous behavior, so No.2 is not an issue, but in the PR, we add some comments. For No.3, in fact, CreateTableLikeCommand is using following logical to try to follow src table's storageFormat if current fileFormat.inputFormat is empty ``` val newStorage = if (fileFormat.inputFormat.isDefined) { fileFormat } else { sourceTableDesc.storage.copy(locationUri = fileFormat.locationUri) } ``` If we try to fill the new target table with HiveSerDe.getDefaultStorage if file format and row format is not explicity spefified, it will break the CREATE TABLE LIKE semantic. ### Why are the changes needed? Bug Fix. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30705 from leanken/leanken-SPARK-33546. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:27:18 +00:00
Max Gekk	817f58ddcb	[SPARK-33768][SQL] Remove `retainData` from `AlterTableDropPartition` ### What changes were proposed in this pull request? Remove the `retainData` parameter from the logical node `AlterTableDropPartition`. ### Why are the changes needed? The `AlterTableDropPartition` command reflects the sql statement (see SqlBase.g4): ``` \| ALTER (TABLE \| VIEW) multipartIdentifier DROP (IF EXISTS)? partitionSpec (',' partitionSpec)* PURGE? #dropTablePartitions ``` but Spark doesn't allow to specify data retention. So, the parameter can be removed to improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the test suite `DDLParserSuite`. Closes #30748 from MaxGekk/remove-retainData. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 08:16:33 +00:00
Max Gekk	9160d59ae3	[SPARK-33770][SQL][TESTS] Fix the `ALTER TABLE .. DROP PARTITION` tests that delete files out of partition path ### What changes were proposed in this pull request? Modify the tests that add partitions with `LOCATION`, and where the number of nested folders in `LOCATION` doesn't match to the number of partitioned columns. In that case, `ALTER TABLE .. DROP PARTITION` tries to access (delete) folder out of the "base" path in `LOCATION`. The problem belongs to Hive's MetaStore method `drop_partition_common`: `8696c82d07/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java (L4876)` which tries to delete empty partition sub-folders recursively starting from the most deeper partition sub-folder up to the base folder. In the case when the number of sub-folder is not equal to the number of partitioned columns `part_vals.size()`, the method will try to list and delete folders out of the base path. ### Why are the changes needed? To fix test failures like https://github.com/apache/spark/pull/30643#issuecomment-743774733: ``` org.apache.spark.sql.hive.execution.command.AlterTableAddPartitionSuite.ALTER TABLE .. ADD PARTITION Hive V1: SPARK-33521: universal type conversions of partition values sbt.ForkMain$ForkError: org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:112) at org.apache.spark.sql.hive.HiveExternalCatalog.dropPartitions(HiveExternalCatalog.scala:1014) ... Caused by: sbt.ForkMain$ForkError: org.apache.hadoop.hive.metastore.api.MetaException: File file:/home/jenkins/workspace/SparkPullRequestBuilder/target/tmp/spark-832cb19c-65fd-41f3-ae0b-937d76c07897 does not exist at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.drop_partition_with_environment_context(HiveMetaStore.java:3381) at sun.reflect.GeneratedMethodAccessor304.invoke(Unknown Source) ``` The issue can be reproduced by the following steps: 1. Create a base folder, for example: `/Users/maximgekk/tmp/part-location` 2. Create a sub-folder in the base folder and drop permissions for it: ``` $ mkdir /Users/maximgekk/tmp/part-location/aaa $ chmod a-rwx chmod a-rwx /Users/maximgekk/tmp/part-location/aaa $ ls -al /Users/maximgekk/tmp/part-location total 0 drwxr-xr-x 3 maximgekk staff 96 Dec 13 18:42 . drwxr-xr-x 33 maximgekk staff 1056 Dec 13 18:32 .. d--------- 2 maximgekk staff 64 Dec 13 18:42 aaa ``` 3. Create a table with a partition folder in the base folder: ```sql spark-sql> create table tbl (id int) partitioned by (part0 int, part1 int); spark-sql> alter table tbl add partition (part0=1,part1=2) location '/Users/maximgekk/tmp/part-location/tbl'; ``` 4. Try to drop this partition: ``` spark-sql> alter table tbl drop partition (part0=1,part1=2); 20/12/13 18:46:07 ERROR HiveClientImpl: ====================== Attempt to drop the partition specs in table 'tbl' database 'default': Map(part0 -> 1, part1 -> 2) In this attempt, the following partitions have been dropped successfully: The remaining partitions have not been dropped: [1, 2] ====================== Error in query: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Error accessing file:/Users/maximgekk/tmp/part-location/aaa; ``` The command fails because it tries to access to the sub-folder `aaa` that is out of the partition path `/Users/maximgekk/tmp/part-location/tbl`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected tests from local IDEA which does not have access to folders out of partition paths. Closes #30752 from MaxGekk/fix-drop-partition-location. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-14 15:56:46 +09:00
Kent Yao	4d47ac4b4b	[SPARK-33705][SQL][TEST] Fix HiveThriftHttpServerSuite flakiness ### What changes were proposed in this pull request? TO FIX flaky tests: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/132345/testReport/ ``` org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.JDBC query execution org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.Checks Hive version org.apache.spark.sql.hive.thriftserver.HiveThriftHttpServerSuite.SPARK-24829 Checks cast as float ``` The root cause here is a jar conflict issue. `NewCookie.isHttpOnly` is not defined in the `jsr311-api.jar` which conflicts The transitive artifact `jsr311-api.jar` of `hadoop-client` is excluded at the maven side. See https://issues.apache.org/jira/browse/SPARK-27179. The Jenkins PR builder and Github Action use `SBT` as the compiler tool. First, the exclusion rule from maven is not followed by sbt, so I was able to see `jsr311-api.jar` from maven cache to be added to the classpath directly. This seems to be a bug of `sbt-pom-reader` plugin but I'm not that sure. Then I added an `ExcludeRule` for the `hive-thriftserver` module at the SBT side and did see the `jsr311-api.jar` gone, but the CI jobs still failed with the same error. I added a trace log in ThriftHttpServlet ```s ERROR ThriftHttpServlet: !!!!!!!!! Suspect???????? ---> file:/home/jenkins/workspace/SparkPullRequestBuilder/assembly/target/scala-2.12/jars/jsr311-api-1.1.1.jar ``` And the log pointed out that the assembly phase copied it to `assembly/target/scala-2.12/jars/` which will be added to the classpath too. With the help of SBT `dependencyTree` tool, I saw the `jsr311-api` again as a transitive of `jersery-core` from `yarn` module with a `test` scope. So This seems to be another bug from the SBT side of the `sbt-assembly` plugin. It copied a test scope transitive artifact to the assembly output. In this PR, I defined some rules in SparkBuild.scala to bypass the potential bugs from the SBT side. First, exclude the `jsr311` from all over the project and then add it back separately to the YARN module for SBT. Additionally, the HiveThriftServerSuites was reflected for reducing flakiness too, but not related to the bugs I have found so far. ### Why are the changes needed? fix test here ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? passing jenkins and ga Closes #30643 from yaooqinn/HiveThriftHttpServerSuite. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-14 05:14:38 +00:00
Gengliang Wang	6e862792fb	[SPARK-33723][SQL] ANSI mode: Casting String to Date should throw exception on parse error ### What changes were proposed in this pull request? Currently, when casting a string as timestamp type in ANSI mode, Spark throws a runtime exception on parsing error. However, the result for casting a string to date is always null. We should throw an exception on parsing error as well. ### Why are the changes needed? Add missing feature for ANSI mode ### Does this PR introduce _any_ user-facing change? Yes for ANSI mode, Casting string to date will throw an exception on parsing error ### How was this patch tested? Unit test Closes #30687 from gengliangwang/castDate. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-14 10:22:37 +09:00
Takeshi Yamamuro	8197ee3b15	[SPARK-33690][SQL] Escape meta-characters in showString ### What changes were proposed in this pull request? This PR intends to escape meta-characters (e.g., \n and \t) in `Dataset.showString`. Before this PR: ``` scala> Seq("aaa\nbbb\t\tccccc").toDF("value").show() +--------------+ \| value\| +--------------+ \|aaa bbb ccccc\| +--------------+ ``` After this PR: ``` +-----------------+ \| value\| +-----------------+ \|aaa\nbbb\t\tccccc\| +-----------------+ ``` ### Why are the changes needed? For better output. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added a unit test. Closes #30647 from maropu/EscapeMetaInShow. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-13 15:04:23 -08:00

... 2 3 4 5 6 ...

10704 commits