ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Chao Sun	62d82b5b27	[SPARK-34076][SQL] SQLContext.dropTempTable fails if cache is non-empty ### What changes were proposed in this pull request? This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved. ### Why are the changes needed? Currently, `CatalogImpl.dropTempView` is implemented as following: ```scala override def dropTempView(viewName: String): Boolean = { sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef => sparkSession.sharedState.cacheManager.uncacheQuery( sparkSession, viewDef, cascade = false) sessionCatalog.dropTempView(viewName) } } ``` Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like: ``` Invalid call to qualifier on unresolved object, tree: 'key ``` This can be reproduced via: ```scala sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10") sql(s"CREATE TABLE $t AS SELECT * FROM src") sql(s"CACHE TABLE $t") dropTempTable(v) ``` ### Does this PR introduce _any_ user-facing change? The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix. ### How was this patch tested? Added new unit tests. Closes #31136 from sunchao/SPARK-34076. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 13:22:21 +00:00
LantaoJin	f1b21ba505	[SPARK-34064][SQL] Cancel the running broadcast sub-jobs when SQL statement is cancelled ### What changes were proposed in this pull request? #24595 introduced `private val runId: UUID = UUID.randomUUID` in `BroadcastExchangeExec` to cancel the broadcast execution in the Future when timeout happens. Since the runId is a random UUID instead of inheriting the job group id, when a SQL statement is cancelled, these broadcast sub-jobs are still executing. This PR uses the job group id of the outside thread as its `runId` to abort these broadcast sub-jobs when the SQL statement is cancelled. ### Why are the changes needed? When broadcasting a table takes too long and the SQL statement is cancelled. However, the background Spark job is still running and it wastes resources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. Since broadcasting a table is too fast to cancel in UT, but it is very easy to verify manually: 1. Start a Spark thrift-server with less resource in YARN. 2. When the driver is running but no executors are launched, submit a SQL which will broadcast tables from beeline. 3. Cancel the SQL in beeline Without the patch, broadcast sub-jobs won't be cancelled. ![Screen Shot 2021-01-11 at 12 03 13 PM](https://user-images.githubusercontent.com/1853780/104150975-ab024b00-5416-11eb-8bf9-b5167bdad80a.png) With this patch, broadcast sub-jobs will be cancelled. ![Screen Shot 2021-01-11 at 11 43 40 AM](https://user-images.githubusercontent.com/1853780/104150994-be151b00-5416-11eb-80ff-313d423c8a2e.png) Closes #31119 from LantaoJin/SPARK-34064. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 12:58:27 +00:00
Kent Yao	04f031acb3	[SPARK-34086][SQL] RaiseError generates too much code and may fails codegen in length check for char varchar ### What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression. W/ this fix, for q41 in TPCDS with [Using TPCDS original definitions for char/varchar columns](https://github.com/apache/spark/pull/31012) applied, we can reduce the stage code-gen size from 22523 to 14369 ``` 14369 - 22523 = - 8154 ``` ### Why are the changes needed? fix the perf regression(we need other improvements for q41 works), there will be a huge performance regression if codegen fails ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? modified uts Closes #31150 from yaooqinn/SPARK-34086. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 09:52:36 +00:00
Max Gekk	861f8bb5fb	[SPARK-34071][SQL][TESTS] Check stats of cached v1 tables after altering ### What changes were proposed in this pull request? Port the test added by https://github.com/apache/spark/pull/31112 to: 1. v1 In-Memory catalog for `ALTER TABLE .. DROP PARTITION` 2. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. ADD PARTITION` 3. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. RENAME PARTITION` ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite" ``` Closes #31131 from MaxGekk/cache-stats-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 04:58:01 +00:00
Takuya UESHIN	ad8e40e2ab	[SPARK-32338][SQL][PYSPARK][FOLLOW-UP][TEST] Add more tests for slice function ### What changes were proposed in this pull request? This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function. ### Why are the changes needed? The original PRs are missing tests with column-based arguments instead of literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests and existing tests. Closes #31159 from ueshin/issues/SPARK-32338/slice_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-13 09:56:38 +09:00
yi.wu	0099715aae	[SPARK-34091][SQL] Shuffle batch fetch should be able to disable after it's been enabled ### What changes were proposed in this pull request? Fix the setting issue of shuffle batch fetch in `ShuffledRowRDD`. ### Why are the changes needed? Currently, we can not disable the shuffle batch fetch mode once the batch fetch mode has been enabled. This PR fixes the issue to make `ShuffledRowRDD` respects the `spark.sql.adaptive.fetchShuffleBlocksInBatch` at runtime. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, users can not disable batch fetch if they enabled first. After this PR, they can. ### How was this patch tested? Added unit test. Closes #31155 from Ngone51/fix-batchfetch-set-issue. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 15:45:15 +00:00
Max Gekk	6c047958f9	[SPARK-34084][SQL] Fix auto updating of table stats in `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Fix an issue in `ALTER TABLE .. ADD PARTITION` which happens when: - A table doesn't have stats - `spark.sql.statistics.size.autoUpdate.enabled` is `true` In that case, `ALTER TABLE .. ADD PARTITION` does not update table stats automatically. ### Why are the changes needed? The changes fix the issue demonstrated by the example: ```sql spark-sql> create table tbl (col0 int, part int) partitioned by (part); spark-sql> insert into tbl partition (part = 0) select 0; spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; spark-sql> alter table tbl add partition (part = 1); ``` the `add partition` command should update table stats but it does not. There is no stats in the output of: ``` spark-sql> describe table extended tbl; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. ADD PARTITION` updates stats even when a table does have them before the command: ```sql spark-sql> alter table tbl add partition (part = 1); spark-sql> describe table extended tbl; col0 int NULL part int NULL # Partition Information # col_name data_type comment part int NULL # Detailed Table Information ... Statistics 2 bytes ``` ### How was this patch tested? By running new UT and existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31149 from MaxGekk/fix-stats-in-add-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 14:34:17 +00:00
Kent Yao	99f84892a5	[SPARK-34003][SQL][FOLLOWUP] Avoid pushing modified Char/Varchar sort attributes into aggregate for existing ones ### What changes were proposed in this pull request? In `0f8e5dd445`, we partially fix the rule conflicts between `PaddingAndLengthCheckForCharVarchar` and `ResolveAggregateFunctions`, as error still exists in sql like ```SELECT substr(v, 1, 2), sum(i) FROM t GROUP BY v ORDER BY substr(v, 1, 2)``` ```sql [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'spark_catalog.default.t.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; [info] Project [substr(v, 1, 2)#100, sum(i)#101L] [info] +- Sort [aggOrder#102 ASC NULLS FIRST], true [info] +- !Aggregate [v#106], [substr(v#106, 1, 2) AS substr(v, 1, 2)#100, sum(cast(i#98 as bigint)) AS sum(i)#101L, substr(v#103, 1, 2) AS aggOrder#102 [info] +- SubqueryAlias spark_catalog.default.t [info] +- Project [if ((length(v#97) <= 3)) v#97 else if ((length(rtrim(v#97, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#97) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#97, None), 3, ) AS v#106, i#98] [info] +- Relation[v#97,i#98] parquet [info] [info] Project [substr(v, 1, 2)#100, sum(i)#101L] [info] +- Sort [aggOrder#102 ASC NULLS FIRST], true [info] +- !Aggregate [v#106], [substr(v#106, 1, 2) AS substr(v, 1, 2)#100, sum(cast(i#98 as bigint)) AS sum(i)#101L, substr(v#103, 1, 2) AS aggOrder#102 [info] +- SubqueryAlias spark_catalog.default.t [info] +- Project [if ((length(v#97) <= 3)) v#97 else if ((length(rtrim(v#97, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#97) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#97, None), 3, ) AS v#106, i#98] [info] +- Relation[v#97,i#98] parquet ``` We need to look recursively into children to find char/varchars. In this PR, we try to resolve the full attributes including the original `Aggregate` expressions and the candidates in `SortOrder` together, then use the new re-resolved `Aggregate` expressions to determine which candidate in the `SortOrder` shall be pushed. This can avoid mismatch for the same attributes w/o this change, as the expressions returned by `executeSameContext` will change when `PaddingAndLengthCheckForCharVarchar` takes effects. W/ this change, the expressions can be matched correctly. For those unmatched, w need to look recursively into children to find char/varchars instead of the expression itself only. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new tests Closes #31129 from yaooqinn/SPARK-34003-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 08:20:39 +00:00
Gengliang Wang	02a17e92f1	[SPARK-28646][SQL][FOLLOWUP] Add legacy config for allowing parameterless count ### What changes were proposed in this pull request? Add a legacy configuration `spark.sql.legacy.allowParameterlessCount` in case users need the parameterless count. This is a follow-up for https://github.com/apache/spark/pull/30541. ### Why are the changes needed? There can be some users depends on the legacy behavior. We need a legacy flag for it. ### Does this PR introduce _any_ user-facing change? Yes, adding a legacy flag `spark.sql.legacy.allowParameterlessCount`. ### How was this patch tested? Unit tests Closes #31143 from gengliangwang/countLegacy. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-12 16:31:22 +09:00
Max Gekk	f7cbeec487	[SPARK-34074][SQL] Update stats only when table size changes ### What changes were proposed in this pull request? Do not alter table stats if they are the same as in the catalog (at least since the recent retrieve). ### Why are the changes needed? The changes reduce the number of calls to Hive external catalog. ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" ``` Closes #31135 from MaxGekk/optimize-updateTableStats. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 03:28:28 +00:00
Liang-Chi Hsieh	0bcbafb4b8	[SPARK-34002][SQL] Fix the usage of encoder in ScalaUDF ### What changes were proposed in this pull request? This patch fixes few issues when using encoders to serialize input/output in `ScalaUDF`. ### Why are the changes needed? This fixes a bug when using encoders in Scala UDF. First, the output data type should be corrected to the corresponding data type of the object serializer. Second, `catalystConverter` should not serialize `Option[_]` as the ordinary row because in `ScalaUDF` case it is serialized to a column, not the top-level row. Otherwise, there will be a redundant `value` struct wrapping the serialized `Option[_]` object. ### Does this PR introduce _any_ user-facing change? Yes, fixing a bug of `ScalaUDF`. ### How was this patch tested? Unit test. Closes #31103 from viirya/SPARK-34002. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-11 11:31:35 -08:00
yi.wu	4afca0f706	[SPARK-31952][SQL] Fix incorrect memory spill metric when doing Aggregate ### What changes were proposed in this pull request? This PR takes over https://github.com/apache/spark/pull/28780. 1. Counted the spilled memory size when creating the `UnsafeExternalSorter` with the existing `InMemorySorter` 2. Accumulate the `totalSpillBytes` when merging two `UnsafeExternalSorter` ### Why are the changes needed? As mentioned in https://github.com/apache/spark/pull/28780: > It happends when hash aggregate downgrades to sort based aggregate. `UnsafeExternalSorter.createWithExistingInMemorySorter` calls spill on an `InMemorySorter` immediately, but the memory pointed by `InMemorySorter` is acquired by outside `BytesToBytesMap`, instead the allocatedPages in `UnsafeExternalSorter`. So the memory spill bytes metric is always 0, but disk bytes spill metric is right. Besides, this PR also fixes the `UnsafeExternalSorter.merge` by accumulating the `totalSpillBytes` of two sorters. Thus, we can report the correct spilled size in `HashAggregateExec.finishAggregate`. Issues can be reproduced by the following step by checking the SQL metrics in UI: ``` bin/spark-shell --driver-memory 512m --executor-memory 512m --executor-cores 1 --conf "spark.default.parallelism=1" scala> sql("select id, count(1) from range(10000000) group by id").write.csv("/tmp/result.json") ``` Before: <img width="200" alt="WeChatfe5146180d91015e03b9a27852e9a443" src="https://user-images.githubusercontent.com/16397174/103625414-e6fc6280-4f75-11eb-8b93-c55095bdb5b8.png"> After: <img width="200" alt="WeChat42ab0e73c5fbc3b14c12ab85d232071d" src="https://user-images.githubusercontent.com/16397174/103625420-e8c62600-4f75-11eb-8e1f-6f5e8ab561b9.png"> ### Does this PR introduce _any_ user-facing change? Yes, users can see the correct spill metrics after this PR. ### How was this patch tested? Tested manually and added UTs. Closes #31035 from Ngone51/SPARK-31952. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:15:28 +00:00
Max Gekk	d97e99157e	[SPARK-34060][SQL] Fix Hive table caching while updating stats by `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Fix canonicalisation of `HiveTableRelation` by normalisation of `CatalogTable`, and exclude table stats and temporary fields from the canonicalized plan. ### Why are the changes needed? This fixes the issue demonstrated by the example below: ```scala scala> spark.conf.set("spark.sql.statistics.size.autoUpdate.enabled", true) scala> sql(s"CREATE TABLE tbl (id int, part int) USING hive PARTITIONED BY (part)") scala> sql("INSERT INTO tbl PARTITION (part=0) SELECT 0") scala> sql("INSERT INTO tbl PARTITION (part=1) SELECT 1") scala> sql("CACHE TABLE tbl") scala> sql("SELECT * FROM tbl").show(false) +---+----+ \|id \|part\| +---+----+ \|0 \|0 \| \|1 \|1 \| +---+----+ scala> spark.catalog.isCached("tbl") scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = false ``` `ALTER TABLE .. DROP PARTITION` must keep the table in the cache. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the drop partition command keeps the table in the cache while updating table stats: ```scala scala> sql("ALTER TABLE tbl DROP PARTITION (part=0)") scala> spark.catalog.isCached("tbl") res19: Boolean = true ``` ### How was this patch tested? By running new UT in `AlterTableDropPartitionSuite`. Closes #31112 from MaxGekk/fix-caching-hive-table-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:03:44 +00:00
Max Gekk	664ef184c1	[SPARK-34055][SQL][TESTS][FOLLOWUP] Check partition adding to cached Hive table ### What changes were proposed in this pull request? Replace `USING parquet` by `$defaultUsing` which is `USING parquet` for v1 In-Memory catalog and `USING hive` for v1 Hive external catalog. ### Why are the changes needed? The PR https://github.com/apache/spark/pull/31101 added UT test but it checks only v1 In-Memory catalog. This PR runs this test for Hive external catalog as well to improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31117 from MaxGekk/add-partition-refresh-cache-2-followup-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 07:02:49 +00:00
Terry Kim	8391a4a687	[SPARK-34057][SQL] UnresolvedTableOrView should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("DROP TABLE unknown") org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be `11`. This PR proposes to fix this issue for commands using `UnresolvedTableOrView`: ``` DROP TABLE unknown DESCRIBE TABLE unknown ANALYZE TABLE unknown COMPUTE STATISTICS ANALYZE TABLE unknown COMPUTE STATISTICS FOR COLUMNS col ANALYZE TABLE unknown COMPUTE STATISTICS FOR ALL COLUMNS SHOW CREATE TABLE unknown REFRESH TABLE unknown SHOW COLUMNS FROM unknown SHOW COLUMNS FROM unknown IN db ALTER TABLE unknown RENAME TO t ALTER VIEW unknown RENAME TO v ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 11; ``` ### How was this patch tested? Add a new test. Closes #31106 from imback82/unresolved_table_or_view_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-11 04:28:39 +00:00
HyukjinKwon	830249284d	[SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure execute it eagerly ### What changes were proposed in this pull request? This PR is basically a followup of https://github.com/apache/spark/pull/14332. Calling `map` alone might leave it not executed due to lazy evaluation, e.g.) ``` scala> val foo = Seq(1,2,3) foo: Seq[Int] = List(1, 2, 3) scala> foo.map(println) 1 2 3 res0: Seq[Unit] = List((), (), ()) scala> foo.view.map(println) res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) scala> foo.view.foreach(println) 1 2 3 ``` We should better use `foreach` to make sure it's executed where the output is unused or `Unit`. ### Why are the changes needed? To prevent the potential issues by not executing `map`. ### Does this PR introduce _any_ user-facing change? No, the current codes look not causing any problem for now. ### How was this patch tested? I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally. Closes #31110 from HyukjinKwon/SPARK-34059. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-01-10 15:22:24 -08:00
Max Gekk	e0e06c18fd	[SPARK-34055][SQL] Refresh cache in `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. ADD PARTITION`. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been added by `ALTER TABLE .. ADD PARTITION`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl add partition (part=1) location '/Users/maximgekk/proj/add-partition-refresh-cache-2/spark-warehouse/tbl/part=1'; spark-sql> select * from tbl; 0 0 0 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableAddPartitionSuite" ``` Closes #31101 from MaxGekk/add-partition-refresh-cache-2. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-10 14:06:17 +09:00
HyukjinKwon	105ba6e5f0	Revert "[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE" This reverts commit `d36cdd5541`.	2021-01-10 13:52:48 +09:00
ulysses-you	48cd11c483	[SPARK-34030][SQL] Fold RepartitionByExpression num partition should at Optimizer ### What changes were proposed in this pull request? Move `RepartitionByExpression` fold partition number code to a new rule at `Optimizer`. ### Why are the changes needed? We meet some ploblem when backport SPARK-33806. It is because the UnresolvedFunction.foldable will throw a exception. It's ok with master branch, but it's better to do it at Optimizer. Some reason: 1. It's not always safe to call Expression.foldable before analysis. 2. fold num partition to 1 more like a optimize behavior. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31077 from ulysses-you/SPARK-34030. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-10 13:00:40 +09:00
Anton Okolnychyi	6b34745cb9	[SPARK-34049][SS] DataSource V2: Use Write abstraction in StreamExecution ### What changes were proposed in this pull request? This PR makes `StreamExecution` use the `Write` abstraction introduced in SPARK-33779. Note: we will need separate plans for streaming writes in order to support the required distribution and ordering in SS. This change only migrates to the `Write` abstraction. ### Why are the changes needed? These changes prevent exceptions from data sources that implement only the `build` method in `WriteBuilder`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31093 from aokolnychyi/spark-34049. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-08 20:37:35 -08:00
Max Gekk	157b72ac9f	[SPARK-33591][SQL] Recognize `null` in partition spec values ### What changes were proposed in this pull request? 1. Recognize `null` while parsing partition specs, and put `null` instead of `"null"` as partition values. 2. For V1 catalog: replace `null` by `__HIVE_DEFAULT_PARTITION__`. 3. For V2 catalogs: pass `null` AS IS, and let catalog implementations to decide how to handle `null`s as partition values in spec. ### Why are the changes needed? Currently, `null` in partition specs is recognized as the `"null"` string which could lead to incorrect results, for example: ```sql spark-sql> CREATE TABLE tbl5 (col1 INT, p1 STRING) USING PARQUET PARTITIONED BY (p1); spark-sql> INSERT INTO TABLE tbl5 PARTITION (p1 = null) SELECT 0; spark-sql> SELECT isnull(p1) FROM tbl5; false ``` Even we inserted a row to the partition with the `null` value, the resulted table doesn't contain `null`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works as expected: ```sql spark-sql> SELECT isnull(p1) FROM tbl5; true ``` ### How was this patch tested? 1. By running the affected test suites `SQLQuerySuite`, `AlterTablePartitionV2SQLSuite` and `v1/ShowPartitionsSuite`. 2. Compiling by Scala 2.13: ``` $ ./dev/change-scala-version.sh 2.13 $ ./build/sbt -Pscala-2.13 compile ``` Closes #30538 from MaxGekk/partition-spec-value-null. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-08 14:14:27 +00:00
Kent Yao	0f8e5dd445	[SPARK-34003][SQL] Fix Rule conflicts between PaddingAndLengthCheckForCharVarchar and ResolveAggregateFunctions ### What changes were proposed in this pull request? ResolveAggregateFunctions is a hacky rule and it calls `executeSameContext` to generate a `resolved agg` to determine which unresolved sort attribute should be pushed into the agg. However, after we add the PaddingAndLengthCheckForCharVarchar rule which will rewrite the query output, thus, the `resolved agg` cannot match original attributes anymore. It causes some dissociative sort attribute to be pushed in and fails the query ``` logtalk [info] Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 'testcat.t1.`v`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.; [info] Project [v#14, sum(i)#11L] [info] +- Sort [aggOrder#12 ASC NULLS FIRST], true [info] +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12] [info] +- SubqueryAlias testcat.t1 [info] +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3, ) AS v#14, i#7] [info] +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1 [info] [info] Project [v#14, sum(i)#11L] [info] +- Sort [aggOrder#12 ASC NULLS FIRST], true [info] +- !Aggregate [v#14], [v#14, sum(cast(i#7 as bigint)) AS sum(i)#11L, v#13 AS aggOrder#12] [info] +- SubqueryAlias testcat.t1 [info] +- Project [if ((length(v#6) <= 3)) v#6 else if ((length(rtrim(v#6, None)) > 3)) cast(raise_error(concat(input string of length , cast(length(v#6) as string), exceeds varchar type length limitation: 3)) as string) else rpad(rtrim(v#6, None), 3, ) AS v#14, i#7] [info] +- RelationV2[v#6, i#7, index#15, _partition#16] testcat.t1 ``` ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31027 from yaooqinn/SPARK-34003. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-08 09:05:22 +00:00
Gengliang Wang	b95a847ce1	[SPARK-34046][SQL][TESTS] Use join hint for constructing joins in JoinSuite and WholeStageCodegenSuite ### What changes were proposed in this pull request? There are some existing test cases that constructing various joins by tuning the SQL configuration AUTO_BROADCASTJOIN_THRESHOLD, PREFER_SORTMERGEJOIN,SHUFFLE_PARTITIONS, etc. This can be tricky and not straight-forward. In the future development we might have to tweak the configurations again . This PR is to construct specific joins by using join hint in test cases. ### Why are the changes needed? Make test cases for join simpler and more robust. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #31087 from gengliangwang/joinhintInTest. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-08 07:52:39 +00:00
Chao Sun	0de7f2ff1e	[SPARK-34039][SQL] ReplaceTable should invalidate cache ### What changes were proposed in this pull request? This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec. ### Why are the changes needed? Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added a new unit test. Closes #31081 from sunchao/SPARK-34039. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 21:13:22 -08:00
Yu Zhong	d36cdd5541	[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE ### What changes were proposed in this pull request? In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? 1. Add UT 2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933 Closes #30998 from zhongyu09/aqe-broadcast. Authored-by: Yu Zhong <yzhong@freewheel.tv> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-07 08:59:26 +00:00
Dongjoon Hyun	194edc86a2	Revert "[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider" This reverts commit `8bb70bf0d6`.	2021-01-06 23:41:27 -08:00
Yuming Wang	aa509c1eee	[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled ### What changes were proposed in this pull request? This pr add row count to `Union` operator when CBO enabled. ```scala spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)") spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Union false, false, Statistics(sizeInBytes=320.0 B) :- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) +- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) ``` After this pr: ``` == Optimized Logical Plan == Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20) :- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) +- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) ``` ### Why are the changes needed? Improve query performance, [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31068 from wangyum/SPARK-34031. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:41:10 +09:00
ulysses-you	f9daf035f4	[SPARK-33806][SQL][FOLLOWUP] Fold RepartitionExpression num partition should check if partition expression is empty ### What changes were proposed in this pull request? Add check partition expressions is empty. ### Why are the changes needed? We should keep `spark.range(1).hint("REPARTITION_BY_RANGE")` has default shuffle number instead of 1. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #31074 from ulysses-you/SPARK-33806-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 17:22:14 -08:00
Dongjoon Hyun	8bb70bf0d6	[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider ### What changes were proposed in this pull request? This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`. Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe. ### Why are the changes needed? Apache ORC 1.6 supports columnar encryption. ### Does this PR introduce _any_ user-facing change? No. This is for a test case. ### How was this patch tested? Pass the newly added test suite. Closes #31065 from dongjoon-hyun/SPARK-34029. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 12:59:47 -08:00
angerszhu	c0d0dbabdb	[SPARK-33934][SQL][FOLLOW-UP] Use SubProcessor's exit code as assert condition to fix flaky test ### What changes were proposed in this pull request? Follow comment and fix. flaky test https://github.com/apache/spark/pull/30973#issuecomment-754852130. This flaky test is similar as https://github.com/apache/spark/pull/30896 Some task's failed with root cause but in driver may return error without root cause , change. UT to check with status exit code since different root cause's exit code is not same. ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #31046 from AngersZhuuuu/SPARK-33934-FOLLOW-UP. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 22:33:15 -08:00
Max Gekk	b77d11dfd9	[SPARK-34011][SQL] Refresh cache in `ALTER TABLE .. RENAME TO PARTITION` ### What changes were proposed in this pull request? 1. Invoke `refreshTable()` from `AlterTableRenamePartitionCommand.run()` after partitions renaming. In particular, this re-creates the cache associated with the modified table. 2. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. RENAME TO PARTITION` command. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 PARTITION (part0=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 2` since `0 0` was renamed by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 PARTITION (part=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 2 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31044 from MaxGekk/rename-partition-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 11:19:44 +09:00
angerszhu	e279ed3044	[SPARK-34012][SQL] Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after https://github.com/apache/spark/pull/28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-06 08:48:24 +09:00
Max Gekk	122f8f0fdb	[SPARK-33919][SQL][TESTS] Unify v1 and v2 SHOW NAMESPACES tests ### What changes were proposed in this pull request? 1. Port DS V2 tests from `DataSourceV2SQLSuite` to the base test suite `ShowNamespacesSuiteBase` to run those tests for v1 catalogs. 2. Port DS v1 tests from `DDLSuite` to `ShowNamespacesSuiteBase` to run the tests for v2 catalogs too. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowNamespacesSuite" ``` Closes #30937 from MaxGekk/unify-show-namespaces-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 07:30:59 +00:00
tanel.kiis@gmail.com	f252a9334e	[SPARK-33935][SQL] Fix CBO cost function ### What changes were proposed in this pull request? Changed the cost function in CBO to match documentation. ### Why are the changes needed? The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as: ``` The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). ``` The implementation in `JoinReorderDP.betterThan` does not match this documentaiton: ``` def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 \|\| other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } ``` This different implementation has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. This happens with several of the TPCDS queries. The new implementation does not have this behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and existing UTs Closes #30965 from tanelk/SPARK-33935_cbo_cost_function. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-05 16:00:24 +09:00
Kent Yao	f0ffe0cd65	[SPARK-33992][SQL] override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer because it and resolveOperatorsUpWithNewOutput are called in the analyzer. For example, PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput with ```logtalk [info] - char/varchar resolution in sub query * FAILED * (367 milliseconds) [info] java.lang.RuntimeException: This method should not be called in the analyzer [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267) ``` ### Why are the changes needed? trivial bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31013 from yaooqinn/SPARK-33992. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:34:11 +00:00
Terry Kim	15a863fd54	[SPARK-34001][SQL][TESTS] Remove unused runShowTablesSql() in DataSourceV2SQLSuite.scala ### What changes were proposed in this pull request? After #30287, `runShowTablesSql()` in `DataSourceV2SQLSuite.scala` is no longer used. This PR removes the unused method. ### Why are the changes needed? To remove unused method. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #31022 from imback82/33382-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 21:32:49 -08:00
Terry Kim	6b00fdc756	[SPARK-33998][SQL] Provide an API to create an InternalRow in V2CommandExec ### What changes were proposed in this pull request? There are many v2 commands such as `SHOW TABLES`, `DESCRIBE TABLE`, etc. that require creating `InternalRow`s. Currently, the code to create `InternalRow`s are duplicated across many commands and it can be moved into `V2CommandExec` to remove duplicate code. ### Why are the changes needed? To clean up duplicate code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test since this is just refactoring. Closes #31020 from imback82/refactor_v2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:32:36 +00:00
Chongguang LIU	976e97a80d	[SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode ### What changes were proposed in this pull request? Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter. ### Why are the changes needed? For ansiMode. ### Does this PR introduce _any_ user-facing change? Yes. When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter. When spark.sql.ansi.enabled = false, same behaviour as before. ### How was this patch tested? Ansi mode is tested with existing tests. End-to-end tests have been added. Closes #30807 from chongguang/SPARK-33794. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:20:16 +00:00
tanel.kiis@gmail.com	bb6d6b5602	[SPARK-33964][SQL] Combine distinct unions in more cases ### What changes were proposed in this pull request? Added the `RemoveNoopOperators` rule to optimization batch `Union`. Also made sure that the `RemoveNoopOperators` would be idempotent. ### Why are the changes needed? In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them. The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs and the output of `PlanStabilitySuite` Closes #30996 from tanelk/SPARK-33964_combine_unions. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 11:01:31 +09:00
Max Gekk	84c1f43669	[SPARK-33987][SQL] Refresh cache in v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? 1. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. DROP PARTITION` command. 2. Port the test for v1 catalogs to the base suite to run it for v2 table catalog. ### Why are the changes needed? The changes fix incorrect query results from cached V2 table altered by `ALTER TABLE .. DROP PARTITION`, see the added test and SPARK-33987. ### Does this PR introduce _any_ user-facing change? Yes, it could if users have v2 table catalogs. ### How was this patch tested? By running unified tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31017 from MaxGekk/drop-partition-refresh-cache-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:00:48 -08:00
Kent Yao	ac4651a7d1	[SPARK-33980][SS] Invalidate char/varchar in spark.readStream.schema ### What changes were proposed in this pull request? invalidate char/varchar in `spark.readStream.schema` just like what we've done for `spark.read.schema` in `da72b87374` ### Why are the changes needed? bugfix, char/varchar is only for table schema while `spark.sql.legacy.charVarcharAsString=false` ### Does this PR introduce _any_ user-facing change? yes, char/varchar will fail to define ss readers when `spark.sql.legacy.charVarcharAsString=false` ### How was this patch tested? new tests Closes #31003 from yaooqinn/SPARK-33980. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 12:59:45 -08:00
Takeshi Yamamuro	414d323d6c	[SPARK-33988][SQL][TEST] Add an option to enable CBO in TPCDSQueryBenchmark ### What changes were proposed in this pull request? This PR intends to add a new option `--cbo` to enable CBO in TPCDSQueryBenchmark. I think this option is useful so as to monitor performance changes with CBO enabled. ### Why are the changes needed? To monitor performance chaneges with CBO enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checked. Closes #31011 from maropu/AddOptionForCBOInTPCDSBenchmark. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:31:20 -08:00
Max Gekk	fc3f22645e	[SPARK-33990][SQL][TESTS] Remove partition data by v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Remove partition data by `ALTER TABLE .. DROP PARTITION` in V2 table catalog used in tests. ### Why are the changes needed? This is a bug fix. Before the fix, `ALTER TABLE .. DROP PARTITION` does not remove the data belongs to the dropped partition. As a consequence of that, the `select` query returns removed data. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running tests suites for v1 and v2 catalogs: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31014 from MaxGekk/fix-drop-partition-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 10:26:39 -08:00
Terry Kim	ddc0d5148a	[SPARK-33875][SQL] Implement DESCRIBE COLUMN for v2 tables ### What changes were proposed in this pull request? This PR proposes to implement `DESCRIBE COLUMN` for v2 tables. Note that `isExnteded` option is not implemented in this PR. ### Why are the changes needed? Parity with v1 tables. ### Does this PR introduce _any_ user-facing change? Yes, now, `DESCRIBE COLUMN` works for v2 tables. ```scala sql("CREATE TABLE testcat.tbl (id bigint, data string COMMENT 'hello') USING foo") sql("DESCRIBE testcat.tbl data").show ``` ``` +---------+----------+ \|info_name\|info_value\| +---------+----------+ \| col_name\| data\| \|data_type\| string\| \| comment\| hello\| +---------+----------+ ``` Before this PR, the command would fail with: `Describing columns is not supported for v2 tables.` ### How was this patch tested? Added new test. Closes #30881 from imback82/describe_col_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 16:14:33 +00:00
Dongjoon Hyun	271c4f6e00	[SPARK-33978][SQL] Support ZSTD compression in ORC data source ### What changes were proposed in this pull request? This PR aims to support ZSTD compression in ORC data source. ### Why are the changes needed? Apache ORC 1.6 supports ZSTD compression to generate more compact files and save the storage cost. - https://issues.apache.org/jira/browse/ORC-363 BEFORE ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") java.lang.IllegalArgumentException: Codec [zstd] is not available. Available codecs are uncompressed, lzo, snappy, zlib, none. ``` AFTER ```scala scala> spark.range(10).write.option("compression", "zstd").orc("/tmp/zstd") ``` ```bash $ orc-tools meta /tmp/zstd Processing data file file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc [length: 230] Structure for file:/tmp/zstd/part-00011-a63d9a17-456f-42d3-87a1-d922112ed28c-c000.orc File Version: 0.12 with ORC_14 Rows: 1 Compression: ZSTD Compression size: 262144 Calendar: Julian/Gregorian Type: struct<id:bigint> Stripe Statistics: Stripe 1: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 File Statistics: Column 0: count: 1 hasNull: false Column 1: count: 1 hasNull: false bytesOnDisk: 6 min: 9 max: 9 sum: 9 Stripes: Stripe: offset: 3 data: 6 rows: 1 tail: 35 index: 35 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 1 section ROW_INDEX start: 14 length 24 Stream: column 1 section DATA start: 38 length 6 Encoding column 0: DIRECT Encoding column 1: DIRECT_V2 File length: 230 bytes Padding length: 0 bytes Padding ratio: 0% User Metadata: org.apache.spark.version=3.2.0 ``` ### Does this PR introduce _any_ user-facing change? Yes, this is a new feature. ### How was this patch tested? Pass the newly added test case. Closes #31002 from dongjoon-hyun/SPARK-33978. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 00:54:47 -08:00
Hoa	0b647fe69c	[SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis ### What changes were proposed in this pull request? JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now. ### Why are the changes needed? Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Close #30902 Closes #30902 from saikocat/SPARK-33888. Lead-authored-by: Hoa <hoameomu@gmail.com> Co-authored-by: Hoa <saikocatz@gmail.com> Co-authored-by: Duc Hoa, Nguyen <hoa.nd@teko.vn> Co-authored-by: Duc Hoa, Nguyen <hoameomu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 06:53:12 +00:00
angerszhu	adac633f93	[SPARK-33934][SQL] Add SparkFile's root dir to env property PATH ### What changes were proposed in this pull request? In hive we always use ``` add file /path/to/script.py; select transform(col1, col2, ..) using 'script.py' as (col1, col2, ...) from ... ``` Since in spark we wrapper script command with `/bash/bin -c`, in this case we will throw `script.py command not found`. This pr add a SparkFile's root dir path to execution env property `PATH`, then sub-processor will find `scrip.py` as program under `PATH`. ### Why are the changes needed? Support SQL migration form Hive to Spark. ### Does this PR introduce _any_ user-facing change? User can direct use script file name as program in script transform SQL. ``` add file /path/to/script.py; select transform(col1, col2, ..) using 'script.py' as (col1, col2, ...) from ... ``` ### How was this patch tested? UT Closes #30973 from AngersZhuuuu/SPARK-33934. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-04 15:46:49 +09:00
Max Gekk	67195d0d97	[SPARK-33950][SQL] Refresh cache in v1 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `AlterTableDropPartitionCommand.run()` after partitions dropping. In particular, this invalidates the cache associated with the modified table. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 0` since it was deleted by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 DROP PARTITION (part0=0); spark-sql> SELECT * FROM tbl1; 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30983 from MaxGekk/drop-partition-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-04 04:11:39 +00:00
Liang-Chi Hsieh	963c60fe49	[SPARK-33955][SS] Add latest offsets to source progress ### What changes were proposed in this pull request? This patch proposes to add latest offset to source progress for streaming queries. ### Why are the changes needed? Currently we record start and end offsets per source in streaming process. Latest offset is an important information for streaming process but the progress lacks of this info. We can use it to track the process lag and adjust streaming queries. We should add latest offset to source progress. ### Does this PR introduce _any_ user-facing change? Yes, for new metric about latest source offset in source progress. ### How was this patch tested? Unit test. Manually test in Spark cluster: ``` "description" : "KafkaV2[Subscribe[page_view_events]]", "startOffset" : { "page_view_events" : { "2" : 582370921, "4" : 391910836, "1" : 631009201, "3" : 406601346, "0" : 195799112 } }, "endOffset" : { "page_view_events" : { "2" : 583764414, "4" : 392338002, "1" : 632183480, "3" : 407101489, "0" : 197304028 } }, "latestOffset" : { "page_view_events" : { "2" : 589852545, "4" : 394204277, "1" : 637313869, "3" : 409286602, "0" : 203878962 } }, "numInputRows" : 4999997, "inputRowsPerSecond" : 29287.70501405811, ``` Closes #30988 from viirya/latest-offset. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-03 01:31:38 -08:00
Kent Yao	ed9f728801	[SPARK-33944][SQL] Incorrect logging for warehouse keys in SharedState options ### What changes were proposed in this pull request? While using SparkSession's initial options to generate the sharable Spark conf and Hadoop conf in ShardState, we shall put the log in the codeblock that the warehouse keys being handled. ### Why are the changes needed? bugfix, rm ambiguous log when setting spark.sql.warehouse.dir in SparkSession.builder.config, but only warn setting hive.metastore.warehouse.dir ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30978 from yaooqinn/SPARK-33944. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-31 13:20:31 -08:00

1 2 3 4 5 ...

7548 commits