ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Max Gekk	b2dfeae18b	[SPARK-33911][SQL][DOCS] Update the SQL migration guide about changes in `HiveClientImpl` ### What changes were proposed in this pull request? Update the SQL migration guide about the changes made by: - https://github.com/apache/spark/pull/30778 - https://github.com/apache/spark/pull/30711 - https://github.com/apache/spark/pull/30866 ### Why are the changes needed? To inform users about the recent changes in the upcoming releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30925 from MaxGekk/sql-migr-guide-hiveclientimpl. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-27 17:57:42 +09:00
yangjie01	37ae0a6086	[SPARK-33560][TEST-MAVEN][BUILD] Add "unused-import" check to Maven compilation process ### What changes were proposed in this pull request? Similar to SPARK-33441, this pr add `unused-import` check to Maven compilation process. After this pr `unused-import` will trigger Maven compilation error. For Scala 2.13 profile, this pr also left TODO(SPARK-33499) similar to SPARK-33441 because `scala.language.higherKinds` no longer needs to be imported explicitly since Scala 2.13.1 ### Why are the changes needed? Let Maven build also check for unused imports as compilation error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Local manual test：add an unused import intentionally to trigger maven compilation error. Closes #30784 from LuciferYang/SPARK-33560. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-26 17:40:19 -06:00
kozakana	2553d53dc8	[SPARK-33897][SQL] Can't set option 'cross' in join method ### What changes were proposed in this pull request? [The PySpark documentation](https://spark.apache.org/docs/3.0.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.join) says "Must be one of: inner, cross, outer, full, fullouter, full_outer, left, leftouter, left_outer, right, rightouter, right_outer, semi, leftsemi, left_semi, anti, leftanti and left_anti." However, I get the following error when I set the cross option. ``` scala> val df1 = spark.createDataFrame(Seq((1,"a"),(2,"b"))) df1: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> val df2 = spark.createDataFrame(Seq((1,"A"),(2,"B"), (3, "C"))) df2: org.apache.spark.sql.DataFrame = [_1: int, _2: string] scala> df1.join(right = df2, usingColumns = Seq("_1"), joinType = "cross").show() java.lang.IllegalArgumentException: requirement failed: Unsupported using join type Cross at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.plans.UsingJoin.<init>(joinTypes.scala:106) at org.apache.spark.sql.Dataset.join(Dataset.scala:1025) ... 53 elided ``` ### Why are the changes needed? The documentation says cross option can be set, but when I try to set it, I get an java.lang.IllegalArgumentException. ### Does this PR introduce _any_ user-facing change? Accepting this PR fix will behave the same as the documentation. ### How was this patch tested? There is already a test for [JoinTypes](`1b9fd67904/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/plans/JoinTypesTest.scala`), but I can't find a test for the join option itself. Closes #30803 from kozakana/allow_cross_option. Authored-by: kozakana <goki727@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-26 16:30:50 +09:00
angerszhu	10b6466e91	[SPARK-33084][CORE][SQL] Add jar support ivy path ### What changes were proposed in this pull request? Support add jar with ivy path ### Why are the changes needed? Since submit app can support ivy, add jar we can also support ivy now. ### Does this PR introduce _any_ user-facing change? User can add jar with sql like ``` add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true add jar ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false ``` core api ``` sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=true") sparkContext.addJar("ivy:://group:artifict:version?exclude=xxx,xxx&transitive=false") ``` #### Doc Update snapshot ![image](https://user-images.githubusercontent.com/46485123/101227738-de451200-36d3-11eb-813d-78a8b879da4f.png) ### How was this patch tested? Added UT Closes #29966 from AngersZhuuuu/support-add-jar-ivy. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-25 09:07:48 +09:00
Takeshi Yamamuro	65a9ac2ff4	[SPARK-30027][SQL] Support codegen for aggregate filters in HashAggregateExec ### What changes were proposed in this pull request? This pr intends to support code generation for `HashAggregateExec` with filters. Quick benchmark results: ``` $ ./bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shuffle.partitions=1 -v scala> spark.range(100000000).selectExpr("id % 3 as k1", "id % 5 as k2", "rand() as v1", "rand() as v2").write.saveAsTable("t") scala> sql("SELECT k1, k2, AVG(v1) FILTER (WHERE v2 > 0.5) FROM t GROUP BY k1, k2").write.format("noop").mode("overwrite").save() >> Before this PR Elapsed time: 16.170697619s >> After this PR Elapsed time: 6.7825313s ``` The query above is compiled into code below; ``` ... /* 285 / private void agg_doAggregate_avg_0(boolean agg_exprIsNull_2_0, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_0, double agg_expr_2_0) throws java.io.IOException { / 286 / // evaluate aggregate function for avg / 287 / boolean agg_isNull_10 = true; / 288 / double agg_value_12 = -1.0; / 289 / boolean agg_isNull_11 = agg_unsafeRowAggBuffer_0.isNullAt(0); / 290 / double agg_value_13 = agg_isNull_11 ? / 291 / -1.0 : (agg_unsafeRowAggBuffer_0.getDouble(0)); / 292 / if (!agg_isNull_11) { / 293 / agg_agg_isNull_12_0 = true; / 294 / double agg_value_14 = -1.0; / 295 / do { / 296 / if (!agg_exprIsNull_2_0) { / 297 / agg_agg_isNull_12_0 = false; / 298 / agg_value_14 = agg_expr_2_0; / 299 / continue; / 300 / } / 301 / / 302 / if (!false) { / 303 / agg_agg_isNull_12_0 = false; / 304 / agg_value_14 = 0.0D; / 305 / continue; / 306 / } / 307 / / 308 / } while (false); / 309 / / 310 / agg_isNull_10 = false; // resultCode could change nullability. / 311 / / 312 / agg_value_12 = agg_value_13 + agg_value_14; / 313 / / 314 / } / 315 / boolean agg_isNull_15 = false; / 316 / long agg_value_17 = -1L; / 317 / if (!false && agg_exprIsNull_2_0) { / 318 / boolean agg_isNull_18 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 319 / long agg_value_20 = agg_isNull_18 ? / 320 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 321 / agg_isNull_15 = agg_isNull_18; / 322 / agg_value_17 = agg_value_20; / 323 / } else { / 324 / boolean agg_isNull_19 = true; / 325 / long agg_value_21 = -1L; / 326 / boolean agg_isNull_20 = agg_unsafeRowAggBuffer_0.isNullAt(1); / 327 / long agg_value_22 = agg_isNull_20 ? / 328 / -1L : (agg_unsafeRowAggBuffer_0.getLong(1)); / 329 / if (!agg_isNull_20) { / 330 / agg_isNull_19 = false; // resultCode could change nullability. / 331 / / 332 / agg_value_21 = agg_value_22 + 1L; / 333 / / 334 / } / 335 / agg_isNull_15 = agg_isNull_19; / 336 / agg_value_17 = agg_value_21; / 337 / } / 338 / // update unsafe row buffer / 339 / if (!agg_isNull_10) { / 340 / agg_unsafeRowAggBuffer_0.setDouble(0, agg_value_12); / 341 / } else { / 342 / agg_unsafeRowAggBuffer_0.setNullAt(0); / 343 / } / 344 / / 345 / if (!agg_isNull_15) { / 346 / agg_unsafeRowAggBuffer_0.setLong(1, agg_value_17); / 347 / } else { / 348 / agg_unsafeRowAggBuffer_0.setNullAt(1); / 349 / } / 350 */ } ... ``` ### Why are the changes needed? For high performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #27019 from maropu/AggregateFilterCodegen. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:44:16 -08:00
ulysses-you	9c30116fb4	[SPARK-33857][SQL] Unify the default seed of random functions ### What changes were proposed in this pull request? Unify the seed of random functions 1. Add a hold place expression `UnresolvedSeed ` as the defualt seed. 2. Change `Rand`,`Randn`,`Uuid`,`Shuffle` default seed to `UnresolvedSeed `. 3. Replace `UnresolvedSeed ` to real seed at `ResolveRandomSeed` rule. ### Why are the changes needed? `Uuid` and `Shuffle` use the `ResolveRandomSeed` rule to set the seed if user doesn't give a seed value. `Rand` and `Randn` do this at constructing. It's better to unify the default seed at Analyzer side since we have used `ExpressionWithRandomSeed` at streaming query. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exists test and add test. Closes #30864 from ulysses-you/SPARK-33857. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-24 14:30:34 -08:00
sychen	700f5ab65c	[SPARK-33900][WEBUI] Show shuffle read size / records correctly when only remotebytesread is available ### What changes were proposed in this pull request? Shuffle Read Size / Records can also be displayed in remoteBytesRead>0 localBytesRead=0. current: ![image](https://user-images.githubusercontent.com/3898450/103079421-c4ca2280-460e-11eb-9e2f-49d35b5d324d.png) fix: ![image](https://user-images.githubusercontent.com/3898450/103079439-cc89c700-460e-11eb-9a41-6b2882980d11.png) ### Why are the changes needed? At present, the page only displays the data of Shuffle Read Size / Records when localBytesRead>0. When there is only remote reading, metrics cannot be seen on the stage page. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? manual test Closes #30916 from cxzl25/SPARK-33900. Authored-by: sychen <sychen@ctrip.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-12-25 00:54:26 +09:00
Kent Yao	29cca68e9e	[SPARK-33892][SQL] Display char/varchar in DESC and SHOW CREATE TABLE ### What changes were proposed in this pull request? Display char/varchar in - DESC table - DESC column - SHOW CREATE TABLE ### Why are the changes needed? show the correct definition for users ### Does this PR introduce _any_ user-facing change? yes, char/varchar column's will print char/varchar instead of string ### How was this patch tested? new tests Closes #30908 from yaooqinn/SPARK-33892. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:56:02 +00:00
Max Gekk	54a67842e6	[SPARK-33881][SQL][TESTS] Check null and empty string as partition values in DS v1 and v2 tests ### What changes were proposed in this pull request? Add tests to check handling `null` and `''` (empty string) as partition values in commands `SHOW PARTITIONS`, `ALTER TABLE .. ADD PARTITION`, `ALTER TABLE .. DROP PARTITION`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .ShowPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly .AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *.AlterTableDropPartitionSuite" ``` Closes #30893 from MaxGekk/partition-value-empty-string. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:54:53 +00:00
gengjiaan	3e9821edfd	[SPARK-33443][SQL] LEAD/LAG should support [ IGNORE NULLS \| RESPECT NULLS ] ### What changes were proposed in this pull request? The mainstream database support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG`/`NTH_VALUE`/`FIRST_VALUE`/`LAST_VALUE`. But the current implement of `LEAD`/`LAG` don't support this syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/LEAD.html#GUID-0A0481F1-E98F-4535-A739-FCCA8D1B5B77 Presto https://prestodb.io/docs/current/functions/window.html Redshift https://docs.aws.amazon.com/redshift/latest/dg/r_WF_LEAD.html DB2 https://www.ibm.com/support/knowledgecenter/SSGU8G_14.1.0/com.ibm.sqls.doc/ids_sqs_1513.htm Teradata https://docs.teradata.com/r/756LNiPSFdY~4JcCCcR5Cw/GjCT6l7trjkIEjt~7Dhx4w Snowflake https://docs.snowflake.com/en/sql-reference/functions/lead.html https://docs.snowflake.com/en/sql-reference/functions/lag.html ### Why are the changes needed? Support `[ IGNORE NULLS \| RESPECT NULLS ]` for `LEAD`/`LAG` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #30387 from beliefer/SPARK-33443. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:13:48 +00:00
Yuming Wang	32d4a2b062	[SPARK-33861][SQL] Simplify conditional in predicate ### What changes were proposed in this pull request? This pr simplify conditional in predicate, after this change we can push down the filter to datasource: Expression \| After simplify -- \| -- IF(cond, trueVal, false) \| AND(cond, trueVal) IF(cond, trueVal, true) \| OR(NOT(cond), trueVal) IF(cond, false, falseVal) \| AND(NOT(cond), elseVal) IF(cond, true, falseVal) \| OR(cond, elseVal) CASE WHEN cond THEN trueVal ELSE false END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE null END \| AND(cond, trueVal) CASE WHEN cond THEN trueVal ELSE true END \| OR(NOT(cond), trueVal) CASE WHEN cond THEN false ELSE elseVal END \| AND(NOT(cond), elseVal) CASE WHEN cond THEN false END \| false CASE WHEN cond THEN true ELSE elseVal END \| OR(cond, elseVal) CASE WHEN cond THEN true END \| cond ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30865 from wangyum/SPARK-33861. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 08:10:28 +00:00
Kent Yao	d7dc42d5f6	[SPARK-33895][SQL] Char and Varchar fail in MetaOperation of ThriftServer ### What changes were proposed in this pull request? ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: CHAR(10) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:187) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:203) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:195) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4(SparkGetColumnsOperation.scala:99) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$4$adapted(SparkGetColumnsOperation.scala:98) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) ``` meta operation is targeting raw table schema, we need to handle these types there. ### Why are the changes needed? bugfix, see the above case ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests locally ![image](https://user-images.githubusercontent.com/8326978/103069196-cdfcc480-45f9-11eb-9c6a-d4c42123c6e3.png) Closes #30914 from yaooqinn/SPARK-33895. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 07:40:38 +00:00
Terry Kim	f1d3797291	[SPARK-33886][SQL] UnresolvedTable should retain SQL text position for DDL commands ### What changes were proposed in this pull request? Currently, there are many DDL commands where the position of the unresolved identifiers are incorrect: ``` scala> sql("MSCK REPAIR TABLE unknown") org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 0; ``` , whereas the `pos` should be 18. This PR proposes to fix this issue for commands using `UnresolvedTable`: ``` MSCK REPAIR TABLE t LOAD DATA LOCAL INPATH 'filepath' INTO TABLE t TRUNCATE TABLE t SHOW PARTITIONS t ALTER TABLE t RECOVER PARTITIONS ALTER TABLE t ADD PARTITION (p=1) ALTER TABLE t PARTITION (p=1) RENAME TO PARTITION (p=2) ALTER TABLE t DROP PARTITION (p=1) ALTER TABLE t SET SERDEPROPERTIES ('a'='b') COMMENT ON TABLE t IS 'hello'" ``` ### Why are the changes needed? To fix a bug. ### Does this PR introduce _any_ user-facing change? Yes, now the above example will print the following: ``` org.apache.spark.sql.AnalysisException: Table not found: unknown; line 1 pos 18; ``` ### How was this patch tested? Add a new suite of tests. Closes #30900 from imback82/position_Fix. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-24 05:21:39 +00:00
Yuanjian Li	86c1cfc579	[SPARK-33659][SS] Document the current behavior for DataStreamWriter.toTable API ### What changes were proposed in this pull request? Follow up work for #30521, document the following behaviors in the API doc: - Figure out the effects when configurations are (provider/partitionBy) conflicting with the existing table. - Document the lack of functionality on creating a v2 table, and guide that the users should ensure a table is created in prior to avoid the behavior unintended/insufficient table is being created. ### Why are the changes needed? We didn't have full support for the V2 table created in the API now. (TODO SPARK-33638) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Document only. Closes #30885 from xuanyuanking/SPARK-33659. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-24 12:44:37 +09:00
offthewall123	61881bb698	[SPARK-33835][CORE] Refector AbstractCommandBuilder.buildJavaCommand: use firstNonEmpty ### What changes were proposed in this pull request? refector AbstractCommandBuilder.buildJavaCommand: use firstNonEmpty ### Why are the changes needed? For better code understanding, and firstNonEmpty can detect javaHome = " ", an empty string. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? End to End. Closes #30831 from offthewall123/refector_AbstractCommandBuilder. Authored-by: offthewall123 <dingyu.xu@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-23 20:01:53 -06:00
Kent Yao	368a2c341d	[SPARK-33877][SQL][FOLLOWUP] SQL reference documents for INSERT w/ a column list ### What changes were proposed in this pull request? followup of `a3dd8dacee` via suggestion https://github.com/apache/spark/pull/30888#discussion_r547822642 ### Why are the changes needed? doc improvement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? passing GA doc Closes #30909 from yaooqinn/SPARK-33877-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-23 15:38:32 -08:00
Dongjoon Hyun	d467d81726	[SPARK-33893][CORE] Exclude fallback block manager from executorList ### What changes were proposed in this pull request? This PR aims to exclude fallback block manager from `executorList` function. ### Why are the changes needed? When a fallback storage is used, the executors UI tab hangs because the executor list REST API result doesn't have `peakMemoryMetrics` of `ExecutorMetrics`. The root cause is that the block manager id used by fallback storage is included in the API result and it doesn't have `peakMemoryMetrics` because it's populated during HeartBeat reporting. We should hide it. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix on UI. ### How was this patch tested? Manual. Run the following and visit Spark `executors` tab UI with browser. ``` bin/spark-shell -c spark.storage.decommission.fallbackStorage.path=file:///tmp/spark-storage/ ``` Closes #30911 from dongjoon-hyun/SPARK-33893. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-23 15:31:56 -08:00
Takuya UESHIN	5c9b421c37	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? This is a retry of #30177. This is not a complete fix, but it would take long time to complete (#30242). As discussed offline, at least using `ContextAwareIterator` should be helpful enough for many cases. As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30899 from ueshin/issues/SPARK-33277/context_aware_iterator. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-23 14:48:01 -08:00
Chandni Singh	0677c39009	[SPARK-32916][SHUFFLE][TEST-MAVEN][TEST-HADOOP2.7] Ensure the number of chunks in meta file and index file are equal ### What changes were proposed in this pull request? 1. Fixes for bugs in `RemoteBlockPushResolver` where the number of chunks in meta file and index file are inconsistent due to exceptions while writing to either index file or meta file. This java class was introduced in https://github.com/apache/spark/pull/30062. - If the writing to index file fails, the position of meta file is not reset. This means that the number of chunks in meta file is inconsistent with index file. - During the exception handling while writing to index/meta file, we just set the pointer to the start position. If the files are closed just after this then it doesn't get rid of any the extra bytes written to it. 2. Adds an IOException threshold. If the `RemoteBlockPushResolver` encounters IOExceptions greater than this threshold while updating data/meta/index file of a shuffle partition, then it responds to the client with exception- `IOExceptions exceeded the threshold` so that client can stop pushing data for this shuffle partition. 3. When the update to metadata fails, exception is not propagated back to the client. This results in the increased size of the current chunk. However, with (2) in place, the current chunk will still be of a manageable size. ### Why are the changes needed? This fix is needed for the bugs mentioned above. 1. Moved writing to meta file after index file. This fixes the issue because if there is an exception writing to meta file, then the index file position is not updated. With this change, if there is an exception writing to index file, then none of the files are effectively updated and the same is true vice-versa. 2. Truncating the lengths of data/index/meta files when the partition is finalized. 3. When the IOExceptions have reached the threshold, it is most likely that future blocks will also face the issue. So, it is better to let the clients know so that they can stop pushing the blocks for that partition. 4. When just the meta update fails, client retries pushing the block which was successfully merged to data file. This can be avoided by letting the chunk grow slightly. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests for all the bugs and threshold. Closes #30433 from otterc/SPARK-32916-followup. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-12-23 12:42:18 -06:00
Dongjoon Hyun	47d1aa4e93	[SPARK-33891][DOCS][CORE] Update dynamic allocation related documents ### What changes were proposed in this pull request? This PR aims to update the followings. - Remove the outdated requirement for `spark.shuffle.service.enabled` in `configuration.md` - Dynamic allocation section in `job-scheduling.md` ### Why are the changes needed? To make the document up-to-date. ### Does this PR introduce _any_ user-facing change? No, it's a documentation update. ### How was this patch tested? Manual. BEFORE ![Screen Shot 2020-12-23 at 2 22 04 AM](https://user-images.githubusercontent.com/9700541/102986441-ae647f80-44c5-11eb-97a3-87c2d368952a.png) ![Screen Shot 2020-12-23 at 2 22 34 AM](https://user-images.githubusercontent.com/9700541/102986473-bcb29b80-44c5-11eb-8eae-6802001c6dfa.png) AFTER ![Screen Shot 2020-12-23 at 2 25 36 AM](https://user-images.githubusercontent.com/9700541/102986767-2df24e80-44c6-11eb-8540-e74856a4c313.png) ![Screen Shot 2020-12-23 at 2 21 13 AM](https://user-images.githubusercontent.com/9700541/102986366-8e34c080-44c5-11eb-8054-1efd07c9458c.png) Closes #30906 from dongjoon-hyun/SPARK-33891. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 23:43:21 +09:00
Yuming Wang	7ffcfcf7db	[SPARK-33847][SQL] Simplify CaseWhen if elseValue is None ### What changes were proposed in this pull request? 1. Enhance `ReplaceNullWithFalseInPredicate` to replace None of elseValue inside `CaseWhen` with `FalseLiteral` if all branches are `FalseLiteral` . The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN 'a' WHEN id = 3 THEN 'b' end) = 'c'; ``` Before this pr: ``` == Physical Plan == (1) Filter CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [CASE WHEN (id#1L = 1) THEN false WHEN (id#1L = 3) THEN false END], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` 2. Enhance `SimplifyConditionals` if elseValue is None and all outputs are null. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30852 from wangyum/SPARK-33847. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:35:46 +00:00
Max Gekk	303df64b46	[SPARK-33889][SQL] Fix NPE from `SHOW PARTITIONS` on V2 tables ### What changes were proposed in this pull request? At `ShowPartitionsExec.run()`, check that a row returned by `listPartitionIdentifiers()` contains a `null` field, and convert it to `"null"`. ### Why are the changes needed? Because `SHOW PARTITIONS` throws NPE on V2 table with `null` partition values. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Added new UT to `v2.ShowPartitionsSuite`. Closes #30904 from MaxGekk/fix-npe-show-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 14:34:01 +00:00
Max Gekk	cc23581e26	[SPARK-33858][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME PARTITION tests ### What changes were proposed in this pull request? 1. Move the `ALTER TABLE .. RENAME PARTITION` parsing tests to `AlterTableRenamePartitionParserSuite` 2. Place the v1 tests for `ALTER TABLE .. RENAME PARTITION` from `DDLSuite` to `v1.AlterTableRenamePartitionSuite` and v2 tests from `AlterTablePartitionV2SQLSuite` to `v2.AlterTableRenamePartitionSuite`, so, the tests will run for V1, Hive V1 and V2 DS. ### Why are the changes needed? - The unification will allow to run common `ALTER TABLE .. RENAME PARTITION` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenamePartitionSuite" ``` Closes #30863 from MaxGekk/unify-rename-partition-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 12:19:07 +00:00
ulysses-you	f421c172d9	[SPARK-33497][SQL] Override maxRows in some LogicalPlan ### What changes were proposed in this pull request? This PR aims to override maxRows method in these follow `LogicalPlan`: * `ReturnAnswer` * `Join` * `Range` * `Sample` * `RepartitionOperation` * `Deduplicate` * `LocalRelation` * `Window` ### Why are the changes needed? 1. Logically, we know the max rows info with these `LogicalPlan`. 2. Before this PR, we already have some max rows with `LogicalPlan`, so we can eliminate limit with more case if we expand more. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30443 from ulysses-you/SPARK-33497. Lead-authored-by: ulysses-you <youxiduo@weidian.com> Co-authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:20:49 +00:00
Max Gekk	34bfb3a31d	[SPARK-33787][SQL] Allow partition purge for v2 tables ### What changes were proposed in this pull request? 1. Add new methods `purgePartition()`/`purgePartitions()` to the interfaces `SupportsPartitionManagement`/`SupportsAtomicPartitionManagement`. 2. Default implementation of new methods throw the exception `UnsupportedOperationException`. 3. Add tests for new methods to `SupportsPartitionManagementSuite`/`SupportsAtomicPartitionManagementSuite`. 4. Add `ALTER TABLE .. DROP PARTITION` tests for DS v1 and v2. Closes #30776 Closes #30821 ### Why are the changes needed? Currently, the `PURGE` option that user can set in `ALTER TABLE .. DROP PARTITION` is completely ignored. We should pass this flag to the catalog implementation, so, the catalog should decide how to handle the flag. ### Does this PR introduce _any_ user-facing change? The changes can impact on behavior of `ALTER TABLE .. DROP PARTITION` for v2 tables. ### How was this patch tested? By running the affected test suites, for instance: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #30886 from MaxGekk/purge-partition. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-23 09:09:48 +00:00
HyukjinKwon	d98c216e19	[SPARK-31960][YARN][DOCS][FOLLOW-UP] Document the behaviour change of Hadoop's classpath propagation in migration guide ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/28788, and proposes to update migration guide. ### Why are the changes needed? To tell users about the behaviour change. ### Does this PR introduce _any_ user-facing change? Yes, it updates migration guides for users. ### How was this patch tested? GitHub Actions' documentation build should test it. Closes #30903 from HyukjinKwon/SPARK-31960-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 18:04:28 +09:00
Kent Yao	2287f56a3e	[SPARK-33879][SQL] Char Varchar values fails w/ match error as partition columns ### What changes were proposed in this pull request? ```sql spark-sql> select * from t10 where c0='abcd'; 20/12/22 15:43:38 ERROR SparkSQLDriver: Failed in [select * from t10 where c0='abcd'] scala.MatchError: CharType(10) (of class org.apache.spark.sql.types.CharType) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:815) at org.apache.spark.sql.catalyst.expressions.CastBase.cast$lzycompute(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.cast(Cast.scala:842) at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:844) at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:476) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.$anonfun$toRow$2(interface.scala:164) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:102) at org.apache.spark.sql.catalyst.catalog.CatalogTablePartition.toRow(interface.scala:158) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3(ExternalCatalogUtils.scala:157) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogUtils$.$anonfun$prunePartitionsByFilter$3$adapted(ExternalCatalogUtils.scala:156) ``` c0 is a partition column, it fails in the partition pruning rule In this PR, we relace char/varchar w/ string type before the CAST happends ### Why are the changes needed? bugfix, see the case above ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? yes, new tests Closes #30887 from yaooqinn/SPARK-33879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 16:14:27 +09:00
ulysses-you	e853f068f6	[SPARK-33526][SQL][FOLLOWUP] Fix flaky test due to timeout and fix docs ### What changes were proposed in this pull request? Make test stable and fix docs. ### Why are the changes needed? Query timeout sometime since we set an another config after set query timeout. ``` sbt.ForkMain$ForkError: java.sql.SQLTimeoutException: Query timed out after 0 seconds at org.apache.hive.jdbc.HiveStatement.waitForOperationToComplete(HiveStatement.java:381) at org.apache.hive.jdbc.HiveStatement.execute(HiveStatement.java:254) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13(ThriftServerWithSparkContextSuite.scala:107) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$13$adapted(ThriftServerWithSparkContextSuite.scala:106) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12(ThriftServerWithSparkContextSuite.scala:106) at org.apache.spark.sql.hive.thriftserver.ThriftServerWithSparkContextSuite.$anonfun$$init$$12$adapted(ThriftServerWithSparkContextSuite.scala:89) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4(SharedThriftServer.scala:95) at org.apache.spark.sql.hive.thriftserver.SharedThriftServer.$anonfun$withJdbcStatement$4$adapted(SharedThriftServer.scala:95) ``` The reason is: 1. we execute `set spark.sql.thriftServer.queryTimeout = 1`, then all the option will be limited in 1s. 2. we execute `set spark.sql.thriftServer.interruptOnCancel = false/true`. This sql will get timeout exception if there is something hung within 1s. It's not our expected. Reset the timeout before we do the step2 can avoid this problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fix test. Closes #30897 from ulysses-you/SPARK-33526-followup. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 22:43:03 -08:00
Dongjoon Hyun	90d6f86001	[SPARK-33870][CORE] Enable spark.storage.replication.proactive by default ### What changes were proposed in this pull request? This PR aims to enable `spark.storage.replication.proactive` by default for Apache Spark 3.2.0. ### Why are the changes needed? `spark.storage.replication.proactive` is added by SPARK-15355 at Apache Spark 2.2.0 and has been helpful when the block manager loss occurs frequently like K8s environment. ### Does this PR introduce _any_ user-facing change? Yes, this will make the Spark jobs more robust. ### How was this patch tested? Pass the existing UTs. Closes #30876 from dongjoon-hyun/SPARK-33870. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 21:59:53 -08:00
Takeshi Yamamuro	ea37717f7c	[SPARK-32106][SQL][FOLLOWUP] Fix flaky tests in transform.sql ### What changes were proposed in this pull request? This PR intends to fix flaky GitHub Actions (GA) tests below in `transform.sql` (this flakiness does not seem to happen in the Jenkins tests): - https://github.com/apache/spark/runs/1592987501 - https://github.com/apache/spark/runs/1593196242 - https://github.com/apache/spark/runs/1595496305 - https://github.com/apache/spark/runs/1596309555 This is because the error message is different between test runs in GA (the error message seems to be truncated indeterministically) ,e.g., ``` # https://github.com/apache/spark/runs/1592987501 Expected "...h status 127. Error:[ /bin/bash: some_non_existent_command: command not found]", but got "...h status 127. Error:[]" Result did not match for query #2 # https://github.com/apache/spark/runs/1593196242 Expected "...istent_command: comm[and not found]", but got "...istent_command: comm[]" Result did not match for query #2 ``` The root cause of this indeterministic behaviour happening only in GA is not clear though, this test throws SparkException consistently even in GA. So, this PR proposes to make the test just check if it will be thrown when running it. This PR comes from the dongjoon-hyun comment: https://github.com/apache/spark/pull/29414/files#r547414513 ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30896 from maropu/SPARK-32106-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 13:50:05 +09:00
Kent Yao	a3dd8dacee	[SPARK-33877][SQL] SQL reference documents for INSERT w/ a column list We support a column list of INSERT for Spark v3.1.0 (See: SPARK-32976 (https://github.com/apache/spark/pull/29893)). So, this PR targets at documenting it in the SQL documents. ### What changes were proposed in this pull request? improve doc ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? doc ### How was this patch tested? passing GA doc gen. ![image](https://user-images.githubusercontent.com/8326978/102954876-8994fa00-450f-11eb-81f9-931af6d1f69b.png) ![image](https://user-images.githubusercontent.com/8326978/102954900-99acd980-450f-11eb-9733-115ad37d2319.png) ![image](https://user-images.githubusercontent.com/8326978/102954935-af220380-450f-11eb-9aaa-fdae0725d41e.png) ![image](https://user-images.githubusercontent.com/8326978/102954949-bc3ef280-450f-11eb-8a0d-d7b688efa7bb.png) Closes #30888 from yaooqinn/SPARK-33877. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 19:46:37 -08:00
Wenchen Fan	ec1560af25	[SPARK-33364][SQL][FOLLOWUP] Refine the catalog v2 API to purge a table ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30267 Inspired by https://github.com/apache/spark/pull/30886, it's better to have 2 methods `def dropTable` and `def purgeTable`, than `def dropTable(ident)` and `def dropTable(ident, purge)`. ### Why are the changes needed? 1. make the APIs orthogonal. Previously, `def dropTable(ident, purge)` calls `def dropTable(ident)` and is a superset. 2. simplifies the catalog implementation a little bit. Now the `if (purge) ... else ...` check is done at the Spark side. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? existing tests Closes #30890 from cloud-fan/purgeTable. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 11:47:13 +09:00
Erik Krogen	303b8c8773	[SPARK-23862][SQL] Support Java enums from Scala Dataset API ### What changes were proposed in this pull request? Add support for Java Enums (`java.lang.Enum`) from the Scala typed Dataset APIs. This involves adding an implicit for `Encoder` creation in `SQLImplicits`, and updating `ScalaReflection` to handle Java Enums on the serialization and deserialization pathways. Enums are mapped to a `StringType` which is just the name of the Enum value. ### Why are the changes needed? In [SPARK-21255](https://issues.apache.org/jira/browse/SPARK-21255), support for (de)serialization of Java Enums was added, but only when called from Java code. It is common for Scala code to rely on Java libraries that are out of control of the Scala developer. Today, if there is a dependency on some Java code which defines an Enum, it would be necessary to define a corresponding Scala class. This change brings closer feature parity between Scala and Java APIs. ### Does this PR introduce _any_ user-facing change? Yes, previously something like: ``` val ds = Seq(MyJavaEnum.VALUE1, MyJavaEnum.VALUE2).toDS // or val ds = Seq(CaseClass(MyJavaEnum.VALUE1), CaseClass(MyJavaEnum.VALUE2)).toDS ``` would fail. Now, it will succeed. ### How was this patch tested? Additional unit tests are added in `DatasetSuite`. Tests include validating top-level enums, enums inside of case classes, enums inside of arrays, and validating that the Enum is stored as the expected string. Closes #30877 from xkrogen/xkrogen-SPARK-23862-scalareflection-java-enums. Lead-authored-by: Erik Krogen <xkrogen@apache.org> Co-authored-by: Fangshi Li <fli@linkedin.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-22 09:55:33 -08:00
Enrico Minack	1d450250eb	[BUILD][MINOR] Do not publish snapshots from forks ### What changes were proposed in this pull request? The GitHub workflow `Publish Snapshot` publishes master and 3.1 branch via Nexus. For this, the workflow uses `secrets.NEXUS_USER` and `secrets.NEXUS_PW` secrets. These are not available in forks where this workflow fails every day: - https://github.com/G-Research/spark/actions/runs/431626797 - https://github.com/G-Research/spark/actions/runs/433153049 - https://github.com/G-Research/spark/actions/runs/434680048 - https://github.com/G-Research/spark/actions/runs/436958780 ### Why are the changes needed? Avoid attempting to publish snapshots from forked repositories. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Code review only. Closes #30884 from EnricoMi/branch-do-not-publish-snapshots-from-forks. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-23 00:22:42 +09:00
Kent Yao	6da5cdf1db	[SPARK-33876][SQL] Add length-check for reading char/varchar from tables w/ a external location ### What changes were proposed in this pull request? This PR adds the length check to the existing ApplyCharPadding rule. Tables will have external locations when users execute SET LOCATION or CREATE TABLE ... LOCATION. If the location contains over length values we should FAIL ON READ. ### Why are the changes needed? ```sql spark-sql> INSERT INTO t2 VALUES ('1', 'b12345'); Time taken: 0.141 seconds spark-sql> alter table t set location '/tmp/hive_one/t2'; Time taken: 0.095 seconds spark-sql> select * from t; 1 b1234 ``` the above case should fail rather than implicitly applying truncation ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #30882 from yaooqinn/SPARK-33876. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 14:24:12 +00:00
Max Gekk	84bf07bbd7	[SPARK-33878][SQL][TESTS] Fix resolving of `spark_catalog` in v1 Hive catalog tests ### What changes were proposed in this pull request? 1. Recognize `spark_catalog` as the default session catalog in the checks of `TestHiveQueryExecution`. 2. Move v2 and v1 in-memory catalog test `"SPARK-33305: DROP TABLE should also invalidate cache"` to the common trait `command/DropTableSuiteBase`, and run it with v1 Hive external catalog. ### Why are the changes needed? To run In-memory catalog tests in Hive catalog. ### Does this PR introduce _any_ user-facing change? No, the changes influence only on tests. ### How was this patch tested? By running the affected test suites for `DROP TABLE`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *DropTableSuite" ``` Closes #30883 from MaxGekk/fix-spark_catalog-hive-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 12:37:16 +00:00
Jacob Kim	43a562035c	[SPARK-33846][SQL] Include Comments for a nested schema in StructType.toDDL ### What changes were proposed in this pull request? ```scala val nestedStruct = new StructType() .add(StructField("b", StringType).withComment("Nested comment")) val struct = new StructType() .add(StructField("a", nestedStruct).withComment("comment")) struct.toDDL ``` Currently, returns: ``` `a` STRUCT<`b`: STRING> COMMENT 'comment'` ``` With this PR, the code above returns: ``` `a` STRUCT<`b`: STRING COMMENT 'Nested comment'> COMMENT 'comment'` ``` ### Why are the changes needed? My team is using nested columns as first citizens, and I thought it would be nice to have comments for nested columns. ### Does this PR introduce _any_ user-facing change? Now, when users call something like this, ```scala spark.table("foo.bar").schema.fields.map(_.toDDL).mkString(", ") ``` they will get comments for the nested columns. ### How was this patch tested? I added unit tests under `org.apache.spark.sql.types.StructTypeSuite`. They test if nested StructType's comment is included in the DDL string. Closes #30851 from jacobhjkim/structtype-toddl. Authored-by: Jacob Kim <me@jacobkim.io> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 17:55:16 +09:00
Anton Okolnychyi	7bbcbb84c2	[SPARK-33784][SQL] Rename dataSourceRewriteRules batch ### What changes were proposed in this pull request? This PR tries to rename `dataSourceRewriteRules` into something more generic. ### Why are the changes needed? These changes are needed to address the post-review discussion [here](https://github.com/apache/spark/pull/30558#discussion_r533885837). ### Does this PR introduce _any_ user-facing change? Yes but the changes haven't been released yet. ### How was this patch tested? Existing tests. Closes #30808 from aokolnychyi/spark-33784. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:29:22 +00:00
Anton Okolnychyi	2562183987	[SPARK-33808][SQL] DataSource V2: Build logical writes in the optimizer ### What changes were proposed in this pull request? This PR adds logic to build logical writes introduced in SPARK-33779. Note: This PR contains a subset of changes discussed in PR #29066. ### Why are the changes needed? These changes are the next step as discussed in the [design doc](https://docs.google.com/document/d/1X0NsQSryvNmXBY9kcvfINeYyKC-AahZarUqg3nS1GQs/edit#) for SPARK-23889. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30806 from aokolnychyi/spark-33808. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 08:23:56 +00:00
ulysses-you	1dd63dccd8	[SPARK-33860][SQL] Make CatalystTypeConverters.convertToCatalyst match special Array value ### What changes were proposed in this pull request? Add some case to match Array whose element type is primitive. ### Why are the changes needed? We will get exception when use `Literal.create(Array(1, 2, 3), ArrayType(IntegerType))` . ``` Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Literal must have a corresponding value to array<int>, but class int[] found. at scala.Predef$.require(Predef.scala:281) at org.apache.spark.sql.catalyst.expressions.Literal$.validateLiteralValue(literals.scala:215) at org.apache.spark.sql.catalyst.expressions.Literal.<init>(literals.scala:292) at org.apache.spark.sql.catalyst.expressions.Literal$.create(literals.scala:140) ``` And same problem with other array whose element is primitive. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #30868 from ulysses-you/SPARK-33860. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-22 15:10:46 +09:00
yangjie01	b88745565b	[SPARK-33700][SQL] Avoid file meta reading when enableFilterPushDown is true and filters is empty for Orc ### What changes were proposed in this pull request? Orc support filter push down optimization, but this optimization will read file meta from external storage even if filters is empty. This pr add a extra `filters.nonEmpty` when `spark.sql.orc.filterPushdown` is true ### Why are the changes needed? Orc filters push down operation should only triggered when `filters.nonEmpty` is true ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30663 from LuciferYang/pushdownfilter-when-filter-nonempty. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 20:24:23 -08:00
Dongjoon Hyun	16ae3a5c12	[MINOR][CORE] Remove unused variable CompressionCodec.DEFAULT_COMPRESSION_CODEC ### What changes were proposed in this pull request? This PR removed an unused variable `CompressionCodec.DEFAULT_COMPRESSION_CODEC`. ### Why are the changes needed? Apache Spark 3.0.0 centralized this default value to `IO_COMPRESSION_CODEC.defaultValue` via [SPARK-26462](https://github.com/apache/spark/pull/23447). We had better remove this variable to avoid any potential confusion in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI compilation. Closes #30880 from dongjoon-hyun/minor. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 19:48:58 -08:00
Kent Yao	f5fd10b1bc	[SPARK-33834][SQL] Verify ALTER TABLE CHANGE COLUMN with Char and Varchar ### What changes were proposed in this pull request? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change For v1 table, changing type is not allowed, we fix a regression that uses the replaced string instead of the original char/varchar type when altering char/varchar columns For v2 table, char/varchar to string, char(x) to char(x), char(x)/varchar(x) to varchar(y) if x <=y are valid cases, other changes are invalid ### Why are the changes needed? Verify ALTER TABLE CHANGE COLUMN with Char and Varchar and avoid unexpected change ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #30833 from yaooqinn/SPARK-33834. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-22 03:07:26 +00:00
angerszhu	7466031632	[SPARK-32106][SQL] Implement script transform in sql/core ### What changes were proposed in this pull request? * Implement `SparkScriptTransformationExec` based on `BaseScriptTransformationExec` * Implement `SparkScriptTransformationWriterThread` based on `BaseScriptTransformationWriterThread` of writing data * Add rule `SparkScripts` to support convert script LogicalPlan to SparkPlan in Spark SQL (without hive mode) * Add `SparkScriptTransformationSuite` test spark spec case * add test in `SQLQueryTestSuite` And we will close #29085 . ### Why are the changes needed? Support user use Script Transform without Hive ### Does this PR introduce _any_ user-facing change? User can use Script Transformation without hive in no serde mode. Such as : default no serde ``` SELECT TRANSFORM(a, b, c) USING 'cat' AS (a int, b string, c long) FROM testData ``` no serde with spec ROW FORMAT DELIMITED ``` SELECT TRANSFORM(a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0002' MAP KEYS TERMINATED BY '\u0003' LINES TERMINATED BY '\n' NULL DEFINED AS 'null' USING 'cat' AS (a, b, c) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' COLLECTION ITEMS TERMINATED BY '\u0004' MAP KEYS TERMINATED BY '\u0005' LINES TERMINATED BY '\n' NULL DEFINED AS 'NULL' FROM testData ``` ### How was this patch tested? Added UT Closes #29414 from AngersZhuuuu/SPARK-32106-MINOR. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-22 11:37:59 +09:00
Dongjoon Hyun	f62e957b31	[SPARK-33873][CORE][TESTS] Test all compression codecs with encrypted spilling ### What changes were proposed in this pull request? This PR aims to test all compression codecs for encrypted spilling. ### Why are the changes needed? To improve test coverage. Currently, only `CompressionCodec.DEFAULT_COMPRESSION_CODEC` is under testing. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the updated test cases. Closes #30879 from dongjoon-hyun/SPARK-33873. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 16:35:04 -08:00
Kyle Krueger	0bf3828ac4	[MINOR] update dstream.py with more accurate exceptions ### What changes were proposed in this pull request? Reopened from https://github.com/apache/spark/pull/27525. The exception messages for dstream.py when using windows were improved to be specific about what sliding duration is important. ### Why are the changes needed? The batch interval of dstreams are improperly named as sliding windows. The term sliding window is also used to reference the new window of a dstream collected over a window of rdds in a parent dstream. We should probably fix the naming convention of sliding window used in the dstream class, but for now more this more explicit exception message may reduce confusion. ### Does this PR introduce any user-facing change? No ### How was this patch tested? It wasn't since this is only a change of the exception message Closes #30871 from kykrueger/kykrueger-patch-1. Authored-by: Kyle Krueger <kyle.s.krueger@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 14:17:09 -08:00
HyukjinKwon	4106731fdd	[SPARK-33836][SS][PYTHON][FOLLOW-UP] Use test utils and clean up doctests in table and toTable ### What changes were proposed in this pull request? This PR proposes to: - Make doctests simpler to show the usage (since we're not running them now). - Use the test utils to drop the tables if exists. ### Why are the changes needed? Better docs and code readability. ### Does this PR introduce _any_ user-facing change? No, dev-only. It includes some doc changes in unreleased branches. ### How was this patch tested? Manually tested. ```bash cd python ./run-tests --python-executable=python3.9,python3.8 --testnames "pyspark.sql.tests.test_streaming StreamingTests" ``` Closes #30873 from HyukjinKwon/SPARK-33836. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2020-12-22 06:27:27 +09:00
HyukjinKwon	38bbccab75	[SPARK-33869][PYTHON][SQL][TESTS] Have a separate metastore directory for each PySpark test job ### What changes were proposed in this pull request? This PR proposes to have its own metastore directory to avoid potential conflict in catalog operations. ### Why are the changes needed? To make PySpark tests less flaky. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested by trying some sleeps in https://github.com/apache/spark/pull/30873. Closes #30875 from HyukjinKwon/SPARK-33869. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 11:11:25 -08:00
Yuming Wang	1c77605682	[SPARK-33848][SQL] Push the UnaryExpression into (if / case) branches ### What changes were proposed in this pull request? This pr push the `UnaryExpression` into (if / case) branches. The use case is: ```sql create table t1 using parquet as select id from range(10); explain select id from t1 where (CASE WHEN id = 1 THEN '1' WHEN id = 3 THEN '2' end) > 3; ``` Before this pr: ``` == Physical Plan == (1) Filter (cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#1L] Batched: true, DataFilters: [(cast(CASE WHEN (id#1L = 1) THEN 1 WHEN (id#1L = 3) THEN 2 END as int) > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/yumwang/opensource/spark/spark-warehouse/org.apache.spark.sql.DataF..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<id:bigint> ``` After this pr: ``` == Physical Plan == LocalTableScan <empty>, [id#1L] ``` This change can also improve this case: `a78d6ce376/sql/core/src/test/resources/tpcds/q62.sql (L5-L22)` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30853 from wangyum/SPARK-33848. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-12-21 10:25:23 -08:00
Max Gekk	661ac10901	[SPARK-33838][SQL][DOCS] Comment the `PURGE` option in the DropTable and in AlterTableDropPartition commands ### What changes were proposed in this pull request? Add comments for the `PURGE` option to the logical nodes `DropTable` and `AlterTableDropPartition`. ### Why are the changes needed? To improve code maintenance. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle` Closes #30837 from MaxGekk/comment-purge-logical-node. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-21 14:06:31 +00:00

... 4 5 6 7 8 ...

29158 commits