ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
ulysses-you	da4b50f8e2	[SPARK-33901][SQL][FOLLOWUP] Add drop table in charvarchar test ### What changes were proposed in this pull request? Add `drop table` in charvarchar sql test. ### Why are the changes needed? 1. `drop table` is also a test case, for better coverage. 2. It's more clear to drop table which created in current test. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31277 from ulysses-you/SPARK-33901-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 12:41:52 +00:00
Max Gekk	31d828379c	[SPARK-34099][SQL][TESTS] Check re-caching of v2 table dependents after table altering ### What changes were proposed in this pull request? Add tests to check that v2 table dependents are re-cached after table altering via the commands: - `ALTER TABLE .. ADD PARTITION` - `ALTER TABLE .. DROP PARTITION` - `ALTER TABLE .. RENAME PARTITION` ### Why are the changes needed? To improve test coverage and prevent regressions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite" ``` Closes #31250 from MaxGekk/check-v2-dependents-recached. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 08:42:17 +00:00
beliefer	140538ea5b	[SPARK-34096][SQL] Improve performance for nth_value ignore nulls over offset window ### What changes were proposed in this pull request? The current implement of `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame` only support `nth_value` that respect nulls. So nth_value will execute with `UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame`. `UnboundedWindowFunctionFrame` and `UnboundedPrecedingWindowFunctionFrame` will call `updateExpressions` of `nth_value` multiple times. ### Why are the changes needed? Improve performance for nth_value ignore nulls over offset window ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Jenkins test Closes #31178 from beliefer/SPARK-34096. Authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 07:31:36 +00:00
Kent Yao	d640631e36	[SPARK-34164][SQL] Improve write side varchar check to visit only last few tailing spaces ### What changes were proposed in this pull request? For varchar(N), we currently trim all spaces first to check whether the remained length exceeds, it not necessary to visit them all but at most to those after N. ### Why are the changes needed? improve varchar performance for write side ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? benchmark and existing ut Closes #31253 from yaooqinn/SPARK-34164. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 05:30:57 +00:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
yangjie01	d68612a008	[SPARK-34176][BUILD] Restore the independent mvn test ability of sql/hive module in Scala 2.13 ### What changes were proposed in this pull request? There is one Java UT error when testing sql/hive module independently in Scala 2.13 after SPARK-33212, the error message as follow: ``` [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 18.548 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) ``` This pr add a Scala-2.13 profile with dependency of `scala-parallel-collections_` to `sql/hive` module to fix the Java UT in Scala 2.13. ### Why are the changes needed? Recover the independent mvn test ability of sql/hive module in Scala 2.13. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test ``` dev/change-scala-version.sh 2.13 mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive -am -DskipTests mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive ``` Before ``` [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 18.725 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 16.853 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite 16:15:36.186 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 16:15:36.288 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 16:15:36.396 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.481 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite [INFO] [INFO] Results: [INFO] [ERROR] Errors: [ERROR] JavaDataFrameSuite.testUDAF:92->checkAnswer:41 » NoClassDefFound scala/collect... [INFO] [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0 ``` After ``` [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.287 s - in org.apache.spark.sql.hive.JavaDataFrameSuite [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite 16:12:16.697 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 16:12:17.540 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 16:12:17.654 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.58 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite [INFO] [INFO] Results: [INFO] [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0 ``` Closes #31259 from LuciferYang/SPARK-34176. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:33:31 -08:00
yi.wu	f498977222	[SPARK-34178][SQL] Copy tags for the new node created by MultiInstanceRelation.newInstance ### What changes were proposed in this pull request? Call `copyTagsFrom` for the new node created by `MultiInstanceRelation.newInstance()`. ### Why are the changes needed? ```scala val df = spark.range(2) df.join(df, df("id") <=> df("id")).show() ``` For this query, it's supposed to be non-ambiguous join by the rule `DetectAmbiguousSelfJoin` because of the same attribute reference in the condition: `537a49fc09/sql/core/src/main/scala/org/apache/spark/sql/execution/analysis/DetectAmbiguousSelfJoin.scala (L125)` However, `DetectAmbiguousSelfJoin` can not apply this prediction due to the right side plan doesn't contain the dataset_id TreeNodeTag, which is missing after `MultiInstanceRelation.newInstance`. That's why we should preserve the tags info for the copied node. Fortunately, the query is still considered as non-ambiguous join because `DetectAmbiguousSelfJoin` only checks the left side plan and the reference is the same as the left side plan. However, this's not the expected behavior but only a coincidence. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated a unit test Closes #31260 from Ngone51/fix-missing-tags. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 13:36:14 +00:00
Chao Sun	902a08b9e6	[SPARK-34052][SQL] store SQL text for a temp view created using "CACHE TABLE .. AS SELECT" ### What changes were proposed in this pull request? This passes original SQL text to `CacheTableAsSelect` command in DSv1 and v2 so that it will be stored instead of the analyzed logical plan, similar to `CREATE VIEW` command. In addition, this changes the behavior of dropping temporary view to also invalidate dependent caches in a cascade, when the config `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is false (which is the default value). ### Why are the changes needed? Currently, after creating a temporary view with `CACHE TABLE ... AS SELECT` command, the view can still be queried even after the source table is dropped or replaced (in v2). This can cause correctness issue. For instance, in the following: ```sql > CREATE TABLE t ...; > CACHE TABLE v AS SELECT * FROM t; > DROP TABLE t; > SELECT * FROM v; ``` The last select query still returns the old (and stale) result instead of fail. Note that the cache is already invalidated as part of dropping table `t`, but the temporary view `v` still exist. On the other hand, the following: ```sql > CREATE TABLE t ...; > CREATE TEMPORARY VIEW v AS SELECT * FROM t; > CACHE TABLE v; > DROP TABLE t; > SELECT * FROM v; ``` will throw "Table or view not found" error in the last select query. This is related to #30567 which aligns the behavior of temporary view and global view by storing the original SQL text for temporary view, as opposed to the analyzed logical plan. However, the PR only handles `CreateView` case but not the `CacheTableAsSelect` case. This also changes uncache logic and use cascade invalidation for temporary views created above. This is to align its behavior to how a permanent view is handled as of today, and also to avoid potential issues where a dependent view becomes invalid while its data is still kept in cache. ### Does this PR introduce _any_ user-facing change? Yes, now when `SQLConf.STORE_ANALYZED_PLAN_FOR_VIEW` is set to false (the default value), whenever a table/permanent view/temp view that a cached view depends on is dropped, the cached view itself will become invalid during analysis, i.e., user will get "Table or view not found" error. In addition, when the dependent is a temp view in the previous case, the cache itself will also be invalidated. ### How was this patch tested? Modified/Enhanced some existing tests. Closes #31107 from sunchao/SPARK-34052. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 02:09:39 +00:00
Max Gekk	00b444d5ed	[SPARK-34056][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests ### What changes were proposed in this pull request? 1. Port DS V2 tests from `AlterTablePartitionV2SQLSuite ` to the test suite `v2.AlterTableRecoverPartitionsSuite`. 2. Port DS v1 tests from `DDLSuite` to `v1.AlterTableRecoverPartitionsSuiteBase`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31105 from MaxGekk/unify-recover-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 01:49:31 +00:00
Angerszhuuuu	f6338a3e0b	[SPARK-34121][SQL] Intersect operator missing rowCount when CBO enabled ### What changes were proposed in this pull request? This pr add row count to `Intersect` operator when CBO enabled. ### Why are the changes needed? Improve query performance, [JoinEstimation.estimateInnerOuterJoin](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Closes #31240 from AngersZhuuuu/SPARK-34121. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-20 10:00:44 +09:00
Max Gekk	32dad1d5a6	[SPARK-34149][SQL] Refresh cache in v2 `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Clear table cache after adding partitions to v2 table in `AlterTableAddPartitionExec`. ### Why are the changes needed? This PR fixes correctness issue. Without the fix, queries on cached tables modified via `ALTER TABLE .. ADD PARTITION` return incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Added new UT to `org.apache.spark.sql.execution.command.v2.AlterTableAddPartitionSuite`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31229 from MaxGekk/v2-add-partition-recache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:42:07 +00:00
ulysses-you	addbbe8339	[SPARK-33939][SQL] Make Column.named UnresolvedExtractValue use UnresolvedAlias to assign name ### What changes were proposed in this pull request? Change `Column.named` code to let expression check if exists `UnresolvedExtractValue` and use `UnresolvedAlias` to assign name. ### Why are the changes needed? It's more reasonable to treat user specify expression as unresolved expression then we should assign name after analyze. Let's say we have this code ``` spark.range(1).selectExpr("id as id1", "id as id2") .selectExpr("cast(struct(id1, id2).id1 as int)") ``` before this PR, the field name is `CAST(struct(id1, id2)[id1] AS INT)`. After, the field name is `CAST(struct(id1, id2).id1 AS INT)`. ### Does this PR introduce _any_ user-facing change? Yes, the default field name may be changed. ### How was this patch tested? Add test. Closes #30974 from ulysses-you/SPARK-33939-0. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:35:56 +00:00
Kent Yao	6fa2fb9eb5	[SPARK-34130][SQL] Impove preformace for char varchar padding and length check with StaticInvoke ### What changes were proposed in this pull request? This could reduce the `generate.java` size to prevent codegen fallback which causes performance regression. here is a case from tpcds that could be fixed by this improvement https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133964/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ The original case generate 20K bytes, we are trying to reduce it to less than 8k ### Why are the changes needed? performance improvement as in the PR benchmark test, the performance w/ codegen is 2~3x better than w/o codegen. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, it's a code reflect so the existing ut should be enough cross-check with https://github.com/apache/spark/pull/31012 where the tpcds shall all pass benchmark compared with master ```logtalk ================================================================================================ Char Varchar Read Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1571 1667 83 63.6 15.7 1.0X read char with length 20 1710 1764 58 58.5 17.1 0.9X read varchar with length 20 1774 1792 16 56.4 17.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1824 1927 91 54.8 18.2 1.0X read char with length 40 1788 1928 137 55.9 17.9 1.0X read varchar with length 40 1676 1700 40 59.7 16.8 1.1X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1727 1762 30 57.9 17.3 1.0X read char with length 60 1628 1674 43 61.4 16.3 1.1X read varchar with length 60 1651 1665 13 60.6 16.5 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1748 1778 28 57.2 17.5 1.0X read char with length 80 1673 1678 9 59.8 16.7 1.0X read varchar with length 80 1667 1684 27 60.0 16.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1709 1743 48 58.5 17.1 1.0X read char with length 100 1610 1664 67 62.1 16.1 1.1X read varchar with length 100 1614 1673 53 61.9 16.1 1.1X ================================================================================================ Char Varchar Write Side Perf ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 20, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 20 2277 2327 67 4.4 227.7 1.0X write char with length 20 2421 2443 19 4.1 242.1 0.9X write varchar with length 20 2393 2419 27 4.2 239.3 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 40, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 40 2249 2290 38 4.4 224.9 1.0X write char with length 40 2386 2444 57 4.2 238.6 0.9X write varchar with length 40 2397 2405 12 4.2 239.7 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 60, hasSpaces: false: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 60 2326 2367 41 4.3 232.6 1.0X write char with length 60 2478 2501 37 4.0 247.8 0.9X write varchar with length 60 2475 2503 24 4.0 247.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 80, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 80 9367 9773 354 1.1 936.7 1.0X write char with length 80 10454 10621 238 1.0 1045.4 0.9X write varchar with length 80 18943 19503 571 0.5 1894.3 0.5X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Write with length 100, hasSpaces: true: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ write string with length 100 11055 11104 59 0.9 1105.5 1.0X write char with length 100 12204 12275 63 0.8 1220.4 0.9X write varchar with length 100 21737 22275 574 0.5 2173.7 0.5X ``` Closes #31199 from yaooqinn/SPARK-34130. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 09:03:06 +00:00
Yuming Wang	030639f456	[SPARK-34119][SQL] Keep necessary stats after partition pruning ### What changes were proposed in this pull request? This pr keep necessary stats after partition pruning. ### Why are the changes needed? Improve query performance. It will push down aggregate since SPARK-34081 because it can be planed as BroadcastHashJoin. But it lacks column statistics after [`PruneFileSourcePartitions`](`d0c83f372b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (L102-L103)`). Therefore, it will eventually be planned as SortMergeJoin. Please see the log: ``` join.right.stats: org.apache.spark.sql.catalyst.optimizer.PushDownPredicates: Statistics(sizeInBytes=348.8 KiB, rowCount=1.79E+4) join.right.stats: org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions: Statistics(sizeInBytes=1414.2 EiB) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test SQL \| Before this PR(Seconds) \| After this PR(Seconds) -- \| -- \| -- q14a \| 594 \| 384 q14b \| 600 \| 402 This change will not affect the results of `PlanStabilitySuite`, because it does not have partition column. Closes #31205 from wangyum/SPARK-34119. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 06:09:16 +00:00
Max Gekk	a98e77c113	[SPARK-34143][SQL][TESTS] Fix adding partitions to fully partitioned v2 tables ### What changes were proposed in this pull request? While adding new partition to v2 `InMemoryAtomicPartitionTable`/`InMemoryPartitionTable`, add single row to the table content when the table is fully partitioned. ### Why are the changes needed? The `ALTER TABLE .. ADD PARTITION` command does not change content of fully partitioned v2 table. For instance, `INSERT INTO` changes table content: ```scala sql(s"CREATE TABLE t (p0 INT, p1 STRING) USING _ PARTITIONED BY (p0, p1)") sql(s"INSERT INTO t SELECT 1, 'def'") sql(s"SELECT * FROM t").show(false) +---+---+ \|p0 \|p1 \| +---+---+ \|1 \|def\| +---+---+ ``` but `ALTER TABLE .. ADD PARTITION` doesn't change v2 table content: ```scala sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')") sql(s"SELECT * FROM t").show(false) +---+---+ \|p0 \|p1 \| +---+---+ +---+---+ ``` ### Does this PR introduce _any_ user-facing change? No, the changes impact only on tests but for the example above in tests: ```scala sql(s"ALTER TABLE t ADD PARTITION (p0 = 0, p1 = 'abc')") sql(s"SELECT * FROM t").show(false) +---+---+ \|p0 \|p1 \| +---+---+ \|0 \|abc\| +---+---+ ``` ### How was this patch tested? By running the unified tests for `ALTER TABLE .. ADD PARTITION`. Closes #31216 from MaxGekk/add-partition-by-all-columns. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 05:40:15 +00:00
ulysses-you	055124a048	[SPARK-34150][SQL] Strip Null literal.sql in resolve alias ### What changes were proposed in this pull request? Change null Literal to PrettyAttribute during ResolveAlias. ### Why are the changes needed? We will convert `Literal(null)` to target data type during analysis. Then the generated alias name will include something like `CAST(NULL AS String)` instead of `NULL`. ``` spark.sql("SELECT RAND(null)").columns -- before rand(CAST(NULL AS INT)) -- after rand(NULL) ``` ### Does this PR introduce _any_ user-facing change? Yes, the default column name maybe changed. ### How was this patch tested? Add test and pass exists test. Closes #31233 from ulysses-you/SPARK-34150. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 03:35:08 +00:00
Max Gekk	bea10a6274	[SPARK-34153][SQL] Remove unused `getRawTable()` from `HiveExternalCatalog.alterPartitions()` ### What changes were proposed in this pull request? Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`. ### Why are the changes needed? It reduces the number of calls to Hive External catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31234 from MaxGekk/remove-getRawTable-from-alterPartitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-19 11:42:33 +09:00
Max Gekk	514172aae7	[SPARK-34074][SQL][TESTS][FOLLOWUP] Fix table size parsing from statistics ### What changes were proposed in this pull request? Fix table size parsing from the `Statistics` field which is formed at: `c3d81fbe79/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala (L573)` . Before the fix, `getTableSize()` returns only the last digit. This works for Hive table in the tests because its size < 10 bytes, and accidentally works for V1 In-Memory catalog table in the tests. ### Why are the changes needed? This makes tests more reliable. For example, the parsing can not work in `AlterTableDropPartitionSuite` when table size before partition dropping: ``` +---------+ \|data_type\| +---------+ \|878 bytes\| +---------+ ``` After: ``` +---------+ \|data_type\| +---------+ \|439 bytes\| +---------+ ``` at: ```scala val onePartSize = getTableSize(t) assert(0 < onePartSize && onePartSize < twoPartSize) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableDropPartitionSuite" ``` Closes #31237 from MaxGekk/optimize-updateTableStats-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-19 09:51:20 +09:00
Liang-Chi Hsieh	bba830f523	[SPARK-34148][TEST][SS] Move general StateStore tests to StateStoreSuiteBase ### What changes were proposed in this pull request? This patch moves move general StateStore-related tests into `StateStoreSuiteBase`. ### Why are the changes needed? There are some general StateStore tests in `StateStoreSuite` which is `HDFSBackedStateStoreProvider`-specific test suite. We should move general tests into `StateStoreSuiteBase`, so it is easier to incorporate other StateStores. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #31219 from viirya/SPARK-34148. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-18 13:03:25 -08:00
Max Gekk	dee596e3ef	[SPARK-34027][SQL] Refresh cache in `ALTER TABLE .. RECOVER PARTITIONS` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl recover partitions; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl recover partitions; spark-sql> select * from tbl; 0 0 0 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31066 from MaxGekk/recover-partitions-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-18 13:52:39 +00:00
yangjie01	163afa6fcf	[SPARK-34151][SQL] Replaces `java.io.File.toURL` with `java.io.File.toURI.toURL` ### What changes were proposed in this pull request? `java.io.FIle.toURL` method does not automatically escape characters that are illegal in URLs. Java doc recommended that new code convert an abstract pathname into a URL by first converting it into a URI, via the `toURI` method, and then converting the URI into a URL via the `URI.toURL` method. So this pr cleaned up the relevant cases in Spark code. ### Why are the changes needed? Cleaning up `Deprecated` Java API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31230 from LuciferYang/SPARK-34151. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 21:39:00 +09:00
Terry Kim	78893b8dc9	[SPARK-34139][SQL] UnresolvedRelation should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("CACHE TABLE unknown") org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be `12`. This PR proposes to fix this issue for commands using `UnresolvedRelation`: ``` CACHE TABLE unknown UNCACHE TABLE unknown DELETE FROM unknown UPDATE unknown SET name='abc' MERGE INTO unknown1 AS target USING unknown2 AS source ON target.col = source.col WHEN MATCHED THEN DELETE INSERT INTO TABLE unknown SELECT 1 INSERT OVERWRITE TABLE unknown VALUES (1, 'a') ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table or view not found: unknown; line 1 pos 12; ``` ### How was this patch tested? Add a new test. Closes #31209 from imback82/unresolved_relation_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-18 05:57:25 +00:00
Yuming Wang	c87b0085c9	[SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8 ### What changes were proposed in this pull request? Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec ### Why are the changes needed? Upgrade Avro and Parquet to latest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517 Closes #30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-17 21:54:35 -08:00
Terry Kim	0540392464	[SPARK-34140][SQL] Move QueryCompilationErrors.scala and QueryExecutionErrors.scala to org/apache/spark/sql/errors ### What changes were proposed in this pull request? `QueryCompilationErrors.scala` and `QueryExecutionErrors.scala` use the `org.apache.spark.sql.errors` package, but these files are reside in `org/apache/spark/sql` directory. This PR proposes to move these files to `org/apache/spark/sql/errors`. ### Why are the changes needed? To match the package name with the directory structure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #31211 from imback82/error_package. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-16 13:40:21 -08:00
Max Gekk	c3d81fbe79	[SPARK-34060][SQL][FOLLOWUP] Preserve serializability of canonicalized CatalogTable ### What changes were proposed in this pull request? Replace `toMap` by `map(identity).toMap` while getting canonicalized representation of `CatalogTable`. `CatalogTable` became not serializable after https://github.com/apache/spark/pull/31112 due to usage of `filterKeys`. The workaround was taken from https://github.com/scala/bug/issues/7005. ### Why are the changes needed? This prevents the errors like: ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 [info] Cause: java.io.NotSerializableException: scala.collection.immutable.MapLike$$anon$1 ``` ### Does this PR introduce _any_ user-facing change? Should not. ### How was this patch tested? By running the test suite affected by https://github.com/apache/spark/pull/31112: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31197 from MaxGekk/fix-caching-hive-table-2-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 17:02:29 -08:00
Chao Sun	b6f46ca297	[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This: 1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. 2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2 Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #30701 from sunchao/test-hadoop-3.2.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 14:06:50 -08:00
Kent Yao	a235c3b254	[SPARK-34037][SQL] Remove unnecessary upcasting for Avg & Sum which handle by themself internally ### What changes were proposed in this pull request? The type-coercion for numeric types of average and sum is not necessary at all, as the resultType and sumType can prevent the overflow. ### Why are the changes needed? rm unnecessary logic which may cause potential performance regressions ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? tpcds tests for plan Closes #31079 from yaooqinn/SPARK-34037. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-01-15 10:18:58 -08:00
yangjie01	9e33d49b5b	[SPARK-33346][CORE][SQL][MLLIB][DSTREAM][K8S] Change the never changed 'var' to 'val' ### What changes were proposed in this pull request? Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases. ### Why are the changes needed? Use `val` instead of `var` when possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31142 from LuciferYang/SPARK-33346. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-15 08:47:02 -06:00
Wenchen Fan	6cd0092150	Revert "[SPARK-34064][SQL] Cancel the running broadcast sub-jobs when SQL statement is cancelled" This reverts commit `f1b21ba505`.	2021-01-15 21:45:17 +08:00
Peter Toth	00d43b1f82	[SPARK-32864][SQL] Support ORC forced positional evolution ### What changes were proposed in this pull request? Add support for `orc.force.positional.evolution` config that forces ORC top level column matching by position rather than by name. This does work in Hive: ``` > set orc.force.positional.evolution; +--------------------------------------+ \| set \| +--------------------------------------+ \| orc.force.positional.evolution=true \| +--------------------------------------+ > create table t (c1 string, c2 string) stored as orc; > insert into t values ('foo', 'bar'); > alter table t change c1 c3 string; ``` The orc file in this case contains the original `c1` and `c2` columns that doesn't match the metadata in HMS. But due to the positional evolution setting, Hive is capable to return all the data: ``` > select * from t; +--------+--------+ \| t.c3 \| t.c2 \| +--------+--------+ \| foo \| bar \| +--------+--------+ ``` Without this PR Spark returns `null`s for the renamed `c3` column. After this PR Spark returns the data in `c3` column. ### Why are the changes needed? Hive/ORC does support it. ### Does this PR introduce _any_ user-facing change? Yes, we will support `orc.force.positional.evolution`. ### How was this patch tested? New UT. Closes #29737 from peter-toth/SPARK-32864-support-orc-forced-positional-evolution. Lead-authored-by: Peter Toth <peter.toth@gmail.com> Co-authored-by: Peter Toth <ptoth@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-14 21:27:25 -08:00
Kent Yao	acd6c1271b	[SPARK-34114][SQL] should not trim right for read-side char length check and padding ### What changes were proposed in this pull request? On the read-side, we should respect the original data instead of trimming it first. It brings extra overhead on the code-gen code side, trimming and padding for the same field, and it's also unnecessary and a bug ### Why are the changes needed? bugfix and perf regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31181 from yaooqinn/SPARK-34114. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-15 04:30:23 +00:00
Kousuke Saruta	bec80d7eec	[SPARK-34101][SQL] Make spark-sql CLI configurable for the behavior of printing header by SET command ### What changes were proposed in this pull request? This PR introduces a new property `spark.sql.cli.print.header` to let users change the behavior of printing header for spark-sql CLI by SET command. ### Why are the changes needed? Like Hive CLI, spark-sql CLI accepts `hive.cli.print.header` property and we can change the behavior of printing header. But spark-sql CLI doesn't allow users to change Hive specific configurations dynamically by SET command. So, it's better to support the way to change the behavior by SET command. ### Does this PR introduce _any_ user-facing change? Yes. Users can dynamically change the behavior by SET command. ### How was this patch tested? I confirmed with the following commands/queries. ``` spark-sql> select (1) as a, (2) as b, (3) as c, (4) as d; 1 2 3 4 Time taken: 3.218 seconds, Fetched 1 row(s) spark-sql> set spark.sql.cli.print.header=true; key value spark.sql.cli.print.header true Time taken: 1.506 seconds, Fetched 1 row(s) spark-sql> select (1) as a, (2) as b, (3) as c, (4) as d; a b c d 1 2 3 4 Time taken: 0.79 seconds, Fetched 1 row(s) ``` Closes #31173 from sarutak/spark-sql-print-header. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-15 13:15:48 +09:00
Max Gekk	adba2ec8f2	[SPARK-34099][SQL] Keep dependents of v2 tables cached during cache refreshing in DS v2 commands ### What changes were proposed in this pull request? This PR changes cache refreshing of v2 tables in v2 commands. In particular, v2 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v2 table and its dependents. In more details: 1. Add new method `recacheTable()` to `DataSourceV2Strategy` and pass it the exec node where need to recache table. New method uses `recacheByPlan` to refresh data cache of v2 tables, and keeps table dependents still cached while clearing their caches. 2. Simplify `invalidateCache` (and rename it `invalidateTableCache`) by retargeting it for only table cache invalidation. 3. Modify a test for `REFRESH TABLE` and check that v2 table dependent is still cached after refreshing the base table. ### Why are the changes needed? 1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v2 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v2 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly. 2. Improve code maintenance. 3. Reduce the number of calls to the Cache Manager when need to recache a table. Before the changes, `invalidateCache()` invokes the Cache Manager 3 times: `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. 4. Also this should speed up table recaching. ### Does this PR introduce _any_ user-facing change? From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. ### How was this patch tested? By running the existing test suites for v2 the add/drop/rename partition commands. Closes #31172 from MaxGekk/dsv2-recache-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-15 03:32:49 +00:00
yangjie01	8ed23ed499	[SPARK-34118][CORE][SQL] Replaces filter and check for emptiness with exists or forall ### What changes were proposed in this pull request? This pr use `exists` or `forall` to simplify `filter + emptiness check`, it's semantically consistent, but looks simpler. The rule as follow: - `seq.filter(p).size == 0)` -> `!seq.exists(p)` - `seq.filter(p).length > 0` -> `seq.exists(p)` - `seq.filterNot(p).isEmpty` -> `seq.forall(p)` - `seq.filterNot(p).nonEmpty` -> `!seq.forall(p)` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31184 from LuciferYang/SPARK-34118. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-15 12:12:33 +09:00
Liang-Chi Hsieh	0e64a22b28	[SPARK-34116][SS][TEST] Separate state store numKeys metric test and validate metrics after committing ### What changes were proposed in this pull request? This patch proposes to pull the test of `numKeys` metric into a separate test in `StateStoreSuite`. ### Why are the changes needed? Right now in `StateStoreSuite`, the tests of get/put/remove/commit are mixed with `numKeys` metric test. I found it is flaky when I was testing with other `StateStore` implementation. Current test logic is tightly bound to the in-memory map behavior of `HDFSBackedStateStore`. For example, put can immediately show up in the `numKeys` metric. But for a `StateStore` implementation relying on external storage, e.g. RocksDB, the metric might be updated once the data is actually committed. And `StateStoreSuite` should be a common test suite for all kinds of StateStore implementations. Specifically, we also are able to check these metrics after state store is updated (committed). So I think we can refactor the test a little bit to make it easier to incorporate other `StateStore` externally. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #31183 from viirya/SPARK-34116. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-14 15:03:22 -08:00
ulysses-you	92e5cfd58d	[SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql ### What changes were proposed in this pull request? This PR aims to strip auto-generated cast. The main logic is: 1. Add tag if Cast is specified by user. 2. Wrap `PrettyAttribute` in usePrettyExpression. ### Why are the changes needed? Make sql consistent with dsl. Here is an inconsistent example before this PR: ``` -- output field name: FLOOR(1) spark.emptyDataFrame.select(floor(lit(1))) -- output field name: FLOOR(CAST(1 AS DOUBLE)) spark.sql("select floor(1)") ``` Note that, we don't remove the `Cast` so the auto-generated `Cast` can still work. The only changed place is `usePrettyExpression`, we use `PrettyAttribute` replace `Cast` to give a better sql string. ### Does this PR introduce _any_ user-facing change? Yes, the default field name may change. ### How was this patch tested? Add test and pass exists test. Closes #31034 from ulysses-you/SPARK-33989. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-14 15:27:14 +00:00
Dereck Li	1288ad814e	[SPARK-34067][SQL] PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time ### What changes were proposed in this pull request? PartitionPruning push down pruningHasBenefit function into insertPredicate function to decrease calculate time ### Why are the changes needed? to accelerate PartitionPruning prune calculate ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? existed unit test Closes #31122 from monkeyboy123/optimize-dynamic-pruning. Authored-by: Dereck Li <monkeyboy.ljh@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-01-14 16:28:06 +08:00
Yuming Wang	d3ea308c8f	[SPARK-34081][SQL] Only pushdown LeftSemi/LeftAnti over Aggregate if join can be planned as broadcast join ### What changes were proposed in this pull request? Should not pushdown LeftSemi/LeftAnti over Aggregate for some cases. ```scala spark.range(50000000L).selectExpr("id % 10000 as a", "id % 10000 as b").write.saveAsTable("t1") spark.range(40000000L).selectExpr("id % 8000 as c", "id % 8000 as d").write.saveAsTable("t2") spark.sql("SELECT distinct a, b FROM t1 INTERSECT SELECT distinct c, d FROM t2").explain ``` Before this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#72] +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L)], LeftSemi :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, [id=#65] : +- FileScan parquet default.t1[a#16L,b#17L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, [id=#66] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- Exchange hashpartitioning(c#18L, d#19L, 5), ENSURE_REQUIREMENTS, [id=#61] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- FileScan parquet default.t2[c#18L,d#19L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:bigint,d:bigint> ``` After this pr: ``` == Physical Plan == AdaptiveSparkPlan isFinalPlan=false +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#74] +- HashAggregate(keys=[a#16L, b#17L], functions=[]) +- SortMergeJoin [coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L)], [coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L)], LeftSemi :- Sort [coalesce(a#16L, 0) ASC NULLS FIRST, isnull(a#16L) ASC NULLS FIRST, coalesce(b#17L, 0) ASC NULLS FIRST, isnull(b#17L) ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(coalesce(a#16L, 0), isnull(a#16L), coalesce(b#17L, 0), isnull(b#17L), 5), ENSURE_REQUIREMENTS, [id=#67] : +- HashAggregate(keys=[a#16L, b#17L], functions=[]) : +- Exchange hashpartitioning(a#16L, b#17L, 5), ENSURE_REQUIREMENTS, [id=#61] : +- HashAggregate(keys=[a#16L, b#17L], functions=[]) : +- FileScan parquet default.t1[a#16L,b#17L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:bigint,b:bigint> +- Sort [coalesce(c#18L, 0) ASC NULLS FIRST, isnull(c#18L) ASC NULLS FIRST, coalesce(d#19L, 0) ASC NULLS FIRST, isnull(d#19L) ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(coalesce(c#18L, 0), isnull(c#18L), coalesce(d#19L, 0), isnull(d#19L), 5), ENSURE_REQUIREMENTS, [id=#68] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- Exchange hashpartitioning(c#18L, d#19L, 5), ENSURE_REQUIREMENTS, [id=#63] +- HashAggregate(keys=[c#18L, d#19L], functions=[]) +- FileScan parquet default.t2[c#18L,d#19L] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/spark/spark-warehouse/org.apache.spark.sql.Data..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c:bigint,d:bigint> ``` ### Why are the changes needed? 1. Pushdown LeftSemi/LeftAnti over Aggregate will affect performance. 2. It will remove user added DISTINCT operator, e.g.: [q38](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q38.sql), [q87](https://github.com/apache/spark/blob/master/sql/core/src/test/resources/tpcds/q87.sql). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test. SQL \| Before this PR(Seconds) \| After this PR(Seconds) -- \| -- \| -- q14a \| 660 \| 594 q14b \| 660 \| 600 q38 \| 55 \| 29 q87 \| 66 \| 35 Before this pr: ![image](https://user-images.githubusercontent.com/5399861/104452849-8789fc80-55de-11eb-88da-44059899f9a9.png) After this pr: ![image](https://user-images.githubusercontent.com/5399861/104452899-9a043600-55de-11eb-9286-d8f3a23ca3b8.png) Closes #31145 from wangyum/SPARK-34081. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-14 04:37:55 +00:00
Gengliang Wang	467d758973	[SPARK-34075][SQL][CORE] Hidden directories are being listed for partition inference ### What changes were proposed in this pull request? Fix a regression from https://github.com/apache/spark/pull/29959. In Spark, the following file paths are considered as hidden paths and they are ignored on file reads: 1. starts with "_" and doesn't contain "=" 2. starts with "." However, after the refactoring PR https://github.com/apache/spark/pull/29959, the hidden paths are not filtered out on partition inference: https://github.com/apache/spark/pull/29959/files#r556432426 This PR is to fix the bug. To archive the goal, the method `InMemoryFileIndex.shouldFilterOut` is refactored as `HadoopFSUtils.shouldFilterOutPathName` ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? Yes, it fixes a bug for reading file paths with partitions. ### How was this patch tested? Unit test Closes #31169 from gengliangwang/fileListingBug. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-14 09:39:38 +09:00
Kousuke Saruta	b7da108cae	[SPARK-33690][SQL][FOLLOWUP] Escape further meta-characters in showString ### What changes were proposed in this pull request? This is a followup PR for SPARK-33690 (#30647) . In addition to the original PR, this PR intends to escape the following meta-characters in `Dataset#showString`. * `\r` (carrige ret) * `\f` (form feed) * `\b` (backspace) * `\u000B` (vertical tab) * `\u0007` (bell) ### Why are the changes needed? To avoid breaking the layout of `Dataset#showString`. `\u0007` does not break the layout of `Dataset#showString` but it's noisy (beeps for each row) so it should be also escaped. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified the existing tests. I also build the documents and check the generated html for `sql-migration-guide.md`. Closes #31144 from sarutak/escape-metacharacters-in-getRows. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:13:01 -06:00
Kousuke Saruta	62d8466c74	[SPARK-34051][SQL] Support 32-bit unicode escape in string literals ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> This PR adds a feature which supports 32-bit unicode escape in string literals like PostgreSQL or some modern programming languages do (e.g, Python3, C++11 and Rust). In addition to the feature which supports 16-bit unicode escape like `"\u0041"`, users can express unicode characters like `"\U00020BB7"` with this change. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> Users can express unicode characters straightly without surrogate pair. ### Does this PR introduce _any_ user-facing change? <!-- Note that it means any user-facing change including all aspects such as the documentation fix. If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master. If no, write 'No'. --> Yes. Users an express all the unicode characters straightly. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> Added new assertions to the existing test case. Closes #31096 from sarutak/32-bit-unicode-escape. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:10:03 -06:00
yangjie01	8b1ba233f1	[SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion ### What changes were proposed in this pull request? There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile. ### Why are the changes needed? Remove redundant collection conversion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed Closes #31125 from LuciferYang/SPARK-34068. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:07:02 -06:00
yangjie01	8c5fecda73	[SPARK-34070][CORE][SQL] Replaces find and emptiness check with exists ### What changes were proposed in this pull request? This pr use `exists` to simplify `find + emptiness check`, it's semantically consistent, but looks simpler. Before ``` seq.find(p).isDefined or seq.find(p).isEmpty ``` After ``` seq.exists(p) or !seq.exists(p) ``` ### Why are the changes needed? Code Simpilefications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31130 from LuciferYang/SPARK-34070. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 10:42:24 -06:00
Chao Sun	62d82b5b27	[SPARK-34076][SQL] SQLContext.dropTempTable fails if cache is non-empty ### What changes were proposed in this pull request? This changes `CatalogImpl.dropTempView` and `CatalogImpl.dropGlobalTempView` use analyzed logical plan instead of `viewDef` which is unresolved. ### Why are the changes needed? Currently, `CatalogImpl.dropTempView` is implemented as following: ```scala override def dropTempView(viewName: String): Boolean = { sparkSession.sessionState.catalog.getTempView(viewName).exists { viewDef => sparkSession.sharedState.cacheManager.uncacheQuery( sparkSession, viewDef, cascade = false) sessionCatalog.dropTempView(viewName) } } ``` Here, the logical plan `viewDef` is not resolved, and when passing to `uncacheQuery`, it could fail at `sameResult` call, where canonicalized plan is compared. The error message looks like: ``` Invalid call to qualifier on unresolved object, tree: 'key ``` This can be reproduced via: ```scala sql(s"CREATE TEMPORARY VIEW $v AS SELECT key FROM src LIMIT 10") sql(s"CREATE TABLE $t AS SELECT * FROM src") sql(s"CACHE TABLE $t") dropTempTable(v) ``` ### Does this PR introduce _any_ user-facing change? The only user-facing change is that, previously `SQLContext.dropTempTable` may fail in the above scenario but will work with this fix. ### How was this patch tested? Added new unit tests. Closes #31136 from sunchao/SPARK-34076. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 13:22:21 +00:00
LantaoJin	f1b21ba505	[SPARK-34064][SQL] Cancel the running broadcast sub-jobs when SQL statement is cancelled ### What changes were proposed in this pull request? #24595 introduced `private val runId: UUID = UUID.randomUUID` in `BroadcastExchangeExec` to cancel the broadcast execution in the Future when timeout happens. Since the runId is a random UUID instead of inheriting the job group id, when a SQL statement is cancelled, these broadcast sub-jobs are still executing. This PR uses the job group id of the outside thread as its `runId` to abort these broadcast sub-jobs when the SQL statement is cancelled. ### Why are the changes needed? When broadcasting a table takes too long and the SQL statement is cancelled. However, the background Spark job is still running and it wastes resources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test. Since broadcasting a table is too fast to cancel in UT, but it is very easy to verify manually: 1. Start a Spark thrift-server with less resource in YARN. 2. When the driver is running but no executors are launched, submit a SQL which will broadcast tables from beeline. 3. Cancel the SQL in beeline Without the patch, broadcast sub-jobs won't be cancelled. ![Screen Shot 2021-01-11 at 12 03 13 PM](https://user-images.githubusercontent.com/1853780/104150975-ab024b00-5416-11eb-8bf9-b5167bdad80a.png) With this patch, broadcast sub-jobs will be cancelled. ![Screen Shot 2021-01-11 at 11 43 40 AM](https://user-images.githubusercontent.com/1853780/104150994-be151b00-5416-11eb-80ff-313d423c8a2e.png) Closes #31119 from LantaoJin/SPARK-34064. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 12:58:27 +00:00
Kent Yao	04f031acb3	[SPARK-34086][SQL] RaiseError generates too much code and may fails codegen in length check for char varchar ### What changes were proposed in this pull request? https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/133928/testReport/org.apache.spark.sql.execution/LogicalPlanTagInSparkPlanSuite/q41/ We can reduce more than 8000 bytes by removing the unnecessary CONCAT expression. W/ this fix, for q41 in TPCDS with [Using TPCDS original definitions for char/varchar columns](https://github.com/apache/spark/pull/31012) applied, we can reduce the stage code-gen size from 22523 to 14369 ``` 14369 - 22523 = - 8154 ``` ### Why are the changes needed? fix the perf regression(we need other improvements for q41 works), there will be a huge performance regression if codegen fails ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? modified uts Closes #31150 from yaooqinn/SPARK-34086. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 09:52:36 +00:00
Max Gekk	861f8bb5fb	[SPARK-34071][SQL][TESTS] Check stats of cached v1 tables after altering ### What changes were proposed in this pull request? Port the test added by https://github.com/apache/spark/pull/31112 to: 1. v1 In-Memory catalog for `ALTER TABLE .. DROP PARTITION` 2. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. ADD PARTITION` 3. v1 In-Memory and Hive external catalogs for `ALTER TABLE .. RENAME PARTITION` ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableDropPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableRenamePartitionSuite" ``` Closes #31131 from MaxGekk/cache-stats-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-13 04:58:01 +00:00
Takuya UESHIN	ad8e40e2ab	[SPARK-32338][SQL][PYSPARK][FOLLOW-UP][TEST] Add more tests for slice function ### What changes were proposed in this pull request? This PR is a follow-up of #29138 and #29195 to add more tests for `slice` function. ### Why are the changes needed? The original PRs are missing tests with column-based arguments instead of literals. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests and existing tests. Closes #31159 from ueshin/issues/SPARK-32338/slice_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-13 09:56:38 +09:00
yi.wu	0099715aae	[SPARK-34091][SQL] Shuffle batch fetch should be able to disable after it's been enabled ### What changes were proposed in this pull request? Fix the setting issue of shuffle batch fetch in `ShuffledRowRDD`. ### Why are the changes needed? Currently, we can not disable the shuffle batch fetch mode once the batch fetch mode has been enabled. This PR fixes the issue to make `ShuffledRowRDD` respects the `spark.sql.adaptive.fetchShuffleBlocksInBatch` at runtime. ### Does this PR introduce _any_ user-facing change? Yes. Before this PR, users can not disable batch fetch if they enabled first. After this PR, they can. ### How was this patch tested? Added unit test. Closes #31155 from Ngone51/fix-batchfetch-set-issue. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 15:45:15 +00:00
Max Gekk	6c047958f9	[SPARK-34084][SQL] Fix auto updating of table stats in `ALTER TABLE .. ADD PARTITION` ### What changes were proposed in this pull request? Fix an issue in `ALTER TABLE .. ADD PARTITION` which happens when: - A table doesn't have stats - `spark.sql.statistics.size.autoUpdate.enabled` is `true` In that case, `ALTER TABLE .. ADD PARTITION` does not update table stats automatically. ### Why are the changes needed? The changes fix the issue demonstrated by the example: ```sql spark-sql> create table tbl (col0 int, part int) partitioned by (part); spark-sql> insert into tbl partition (part = 0) select 0; spark-sql> set spark.sql.statistics.size.autoUpdate.enabled=true; spark-sql> alter table tbl add partition (part = 1); ``` the `add partition` command should update table stats but it does not. There is no stats in the output of: ``` spark-sql> describe table extended tbl; ``` ### Does this PR introduce _any_ user-facing change? Yes. After the changes, `ALTER TABLE .. ADD PARTITION` updates stats even when a table does have them before the command: ```sql spark-sql> alter table tbl add partition (part = 1); spark-sql> describe table extended tbl; col0 int NULL part int NULL # Partition Information # col_name data_type comment part int NULL # Detailed Table Information ... Statistics 2 bytes ``` ### How was this patch tested? By running new UT and existing test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableAddPartitionSuite" ``` Closes #31149 from MaxGekk/fix-stats-in-add-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-12 14:34:17 +00:00

1 2 3 4 5 ...

10590 commits