ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
angerszhu	55ce49ed28	[SPARK-32400][SQL][TEST][FOLLOWUP][TEST-MAVEN] Fix resource loading error in HiveScripTransformationSuite ### What changes were proposed in this pull request? #29401 move `test_script.py` from sql/hive module to sql/core module, cause HiveScripTransformationSuite load resource issue. ### Why are the changes needed? This issue cause jenkins test failed in mvn spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/ spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/ spark-master-test-maven-hadoop-3.2-hive-2.3: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3/ ![image](https://user-images.githubusercontent.com/46485123/91681585-71285a80-eb81-11ea-8519-99fc9783d6b9.png) ![image](https://user-images.githubusercontent.com/46485123/91681010-aaf86180-eb7f-11ea-8dbb-61365a3b0ab4.png) Error as below: ``` Exception thrown while executing Spark plan: HiveScriptTransformation [a#349299, b#349300, c#349301, d#349302, e#349303], python /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py, [a#349309, b#349310, c#349311, d#349312, e#349313], ScriptTransformationIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false) +- Project [_1#349288 AS a#349299, _2#349289 AS b#349300, _3#349290 AS c#349301, _4#349291 AS d#349302, _5#349292 AS e#349303] +- LocalTableScan [_1#349288, _2#349289, _3#349290, _4#349291, _5#349292] == Exception == org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18021.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18021.0 (TID 37324) (192.168.10.31 executor driver): org.apache.spark.SparkException: Subprocess exited with status 2. Error: python: can't open file '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py': [Errno 2] No such file or directory at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate(BaseScriptTransformationExec.scala:180) at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate$(BaseScriptTransformationExec.scala:157) at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec.checkFailureAndPropagate(HiveScriptTransformationExec.scala:49) at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec$$anon$1.hasNext(HiveScriptTransformationExec.scala:110) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426) at o ``` ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Existed UT Closes #29588 from AngersZhuuuu/SPARK-32400-FOLLOWUP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 18:27:29 +09:00
liwensun	f0851e95c6	[SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation ### What changes were proposed in this pull request? PropagateEmptyRelation will not be applied to LIMIT operators in streaming queries. ### Why are the changes needed? Right now, the limit operator in a streaming query may get optimized away when the relation is empty. This can be problematic for stateful streaming, as this empty batch will not write any state store files, and the next batch will fail when trying to read these state store files and throw a file not found error. We should not let PropagateEmptyRelation optimize away the Limit operator for streaming queries. This PR is intended as a small and safe fix for PropagateEmptyRelation. A fundamental fix that can prevent this from happening again in the future and in other optimizer rules is more desirable, but that's a much larger task. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit tests. Closes #29623 from liwensun/spark-32776. Authored-by: liwensun <liwen.sun@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 18:05:06 +09:00
Yuming Wang	54348dbd21	[SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number ### What changes were proposed in this pull request? Bucket join should work if `spark.sql.shuffle.partitions` larger than bucket number, such as: ```scala spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1") spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2") sql("set spark.sql.shuffle.partitions=600") sql("set spark.sql.autoBroadcastJoinThreshold=-1") sql("select * from t1 join t2 on t1.id = t2.id").explain() ``` Before this pr: ``` == Physical Plan == (5) SortMergeJoin [id#26L], [id#27L], Inner :- (2) Sort [id#26L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#26L, 600), true : +- (1) Filter isnotnull(id#26L) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- (4) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 600), true +- (3) Filter isnotnull(id#27L) +- (3) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` After this pr: ``` == Physical Plan == (4) SortMergeJoin [id#26L], [id#27L], Inner :- (1) Sort [id#26L ASC NULLS FIRST], false, 0 : +- (1) Filter isnotnull(id#26L) : +- (1) ColumnarToRow : +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432 +- (3) Sort [id#27L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#27L, 432), true +- (2) Filter isnotnull(id#27L) +- (2) ColumnarToRow +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34 ``` ### Why are the changes needed? Spark 2.4 support this. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #29612 from wangyum/SPARK-32767. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-02 04:16:20 +00:00
Kousuke Saruta	812d0918a8	[SPARK-32771][DOCS] The example of expressions.Aggregator in Javadoc / Scaladoc is wrong ### What changes were proposed in this pull request? This PR modifies an example for `expressions.Aggregator` in Javadoc and Scaladoc. The definition of `bufferEncoder` and `outputEncoder` are added. ### Why are the changes needed? To correct the example. The current example is wrong and doesn't work because `bufferEncoder` and `outputEncoder` are not defined. ### Does this PR introduce _any_ user-facing change? Yes. Before this change, the scaladoc and javadoc are like as follows. ![wrong-example-java](https://user-images.githubusercontent.com/4736016/91897528-5ebf3580-ecd5-11ea-8d7b-e846b776ebbb.png) ![wrong-example](https://user-images.githubusercontent.com/4736016/91897509-58c95480-ecd5-11ea-81a3-98774083b689.png) After this change, the docs are like as follows. ![fixed-example-java](https://user-images.githubusercontent.com/4736016/91897592-78607d00-ecd5-11ea-9e55-03fd9c9c6b54.png) ![fixed-example](https://user-images.githubusercontent.com/4736016/91897609-7c8c9a80-ecd5-11ea-837e-9dbcada6cd53.png) ### How was this patch tested? Build with `build/sbt unidoc` and confirmed the generated javadoc/scaladoc and got the screenshots above. Closes #29617 from sarutak/fix-aggregator-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 10:03:07 +09:00
Kousuke Saruta	31be672c91	[SPARK-32774][BUILD] Don't track docs/.jekyll-cache ### What changes were proposed in this pull request? This PR changes .gitignore not to track docs/.jekyll-cache. ### Why are the changes needed? When I build docs, docs/.jekyll-cache can be created and it should not be tracked. ``` $ git status On branch master Your branch is up to date with 'origin/master'. Untracked files: (use "git add <file>..." to include in what will be committed) docs/.jekyll-cache/ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Applied the change and confirmed the result of `git status` ``` $ git status On branch untrack-jekyll-cache nothing to commit, working tree clean ``` Closes #29622 from sarutak/untrack-jekyll-cache. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-02 09:43:32 +09:00
Zhenhua Wang	2a88a20271	[SPARK-32754][SQL][TEST] Unify to `assertEqualJoinPlans` for join reorder suites ### What changes were proposed in this pull request? Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and the logic is almost the same. We can extract the method to a single place for code simplicity. ### Why are the changes needed? To reduce code redundancy. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Covered by existing tests. Closes #29594 from wzhfy/unify_assertEqualPlans_joinReorder. Authored-by: Zhenhua Wang <wzh_zju@163.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-09-01 09:08:35 -07:00
Linhong Liu	a410658c9b	[SPARK-32761][SQL] Allow aggregating multiple foldable distinct expressions ### What changes were proposed in this pull request? For queries with multiple foldable distinct columns, since they will be eliminated during execution, it's not mandatory to let `RewriteDistinctAggregates` handle this case. And in the current code, `RewriteDistinctAggregates` dose miss some "aggregating with multiple foldable distinct expressions" cases. For example: `select count(distinct 2), count(distinct 2, 3)` will be missed. But in the planner, this will trigger an error that "multiple distinct expressions" are not allowed. As the foldable distinct columns can be eliminated finally, we can allow this in the aggregation planner check. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added test case Closes #29607 from linhongliu-db/SPARK-32761. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 13:04:24 +00:00
Wenchen Fan	fea9360ae7	[SPARK-32757][SQL][FOLLOW-UP] Use child's output for canonicalization in SubqueryBroadcastExec ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/29601 , to fix a small mistake in `SubqueryBroadcastExec`. `SubqueryBroadcastExec.doCanonicalize` should canonicalize the build keys with the query output, not the `SubqueryBroadcastExec.output`. ### Why are the changes needed? fix mistake ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing test Closes #29610 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 12:54:40 +00:00
Huaxin Gao	e1dbc85c72	[SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder ### What changes were proposed in this pull request? Add JDBCScan, JDBCScanBuilder, JDBCWriteBuilder in Datasource V2 JDBC ### Why are the changes needed? Complete Datasource V2 JDBC implementation ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? new tests Closes #29396 from huaxingao/v2jdbc. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 07:23:20 +00:00
Wenchen Fan	d2a5dad97c	[SPARK-32757][SQL] Physical InSubqueryExec should be consistent with logical InSubquery ### What changes were proposed in this pull request? `InSubquery` can be either single-column mode, or multi-column mode, depending on the output length of the subquery. For multi-column mode, the length of input `values` must match the subquery output length. However, `InSubqueryExec` doesn't follow it and always be executed under single column mode. It's OK as it's only used by DPP, which looks up one key in one `InSubqueryExec`, so the multi-column mode is not needed. But it's better to make the physical and logical node consistent. This PR updates `InSubqueryExec` to support multi-column mode, and also fix `SubqueryBroadcastExec` to report output correctly. ### Why are the changes needed? Fix a potential bug. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #29601 from cloud-fan/follow. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 07:19:43 +00:00
Kris Mok	6e5bc39e17	[SPARK-32624][SQL][FOLLOWUP] Fix regression in CodegenContext.addReferenceObj on nested Scala types ### What changes were proposed in this pull request? Use `CodeGenerator.typeName()` instead of `Class.getCanonicalName()` in `CodegenContext.addReferenceObj()` for getting the runtime class name for an object. ### Why are the changes needed? https://github.com/apache/spark/pull/29439 fixed a bug in `CodegenContext.addReferenceObj()` for `Array[Byte]` (i.e. Spark SQL's `BinaryType`) objects, but unfortunately it introduced a regression for some nested Scala types. For example, for `implicitly[Ordering[UTF8String]]`, after that PR `CodegenContext.addReferenceObj()` would return `((null) references[0] /* ... */)`. The actual type for `implicitly[Ordering[UTF8String]]` is `scala.math.LowPriorityOrderingImplicits$$anon$3` in Scala 2.12.10, and `Class.getCanonicalName()` returns `null` for that class. On the other hand, `Class.getName()` is safe to use for all non-array types, and Janino will happily accept the type name returned from `Class.getName()` for nested types. `CodeGenerator.typeName()` happens to do the right thing by correctly handling arrays and otherwise use `Class.getName()`. So it's a better alternative than `Class.getCanonicalName()`. Side note: rule of thumb for using Java reflection in Spark: it may be tempting to use `Class.getCanonicalName()`, but for functions that may need to handle Scala types, please avoid it due to potential issues with nested Scala types. Instead, use `Class.getName()` or utility functions in `org.apache.spark.util.Utils` (e.g. `Utils.getSimpleName()` or `Utils.getFormattedClassName()` etc). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new unit test case for the regression case in `CodeGenerationSuite`. Closes #29602 from rednaxelafx/spark-32624-followup. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 15:15:11 +09:00
HyukjinKwon	d80c85c2e3	[SPARK-32191][FOLLOW-UP][PYTHON][DOCS] Indent the table and reword the main page in migration guide ### What changes were proposed in this pull request? This PR is a minor followup to fix: 1. Slightly reword the wording in the main page. 2. The indentation in the table at the migration guide; from ![Screen Shot 2020-09-01 at 1 53 40 PM](https://user-images.githubusercontent.com/6477701/91796204-91781800-ec5a-11ea-9f57-d7a9f4207ba0.png) to ![Screen Shot 2020-09-01 at 1 53 26 PM](https://user-images.githubusercontent.com/6477701/91796202-9046eb00-ec5a-11ea-9db2-815139ddfdb9.png) ### Why are the changes needed? In order to show the migration guide pretty. ### Does this PR introduce _any_ user-facing change? Yes, this is a change to user-facing documentation. ### How was this patch tested? Manually built the documentation. Closes #29606 from HyukjinKwon/SPARK-32191. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 15:08:03 +09:00
Chao Sun	94d313b061	[SPARK-32721][SQL][FOLLOWUP] Simplify if clauses with null and boolean ### What changes were proposed in this pull request? This is a follow-up on SPARK-32721 and PR #29567. In the previous PR we missed two more cases that can be optimized: ``` if(p, false, null) ==> and(not(p), null) if(p, true, null) ==> or(p, null) ``` ### Why are the changes needed? By transforming if to boolean conjunctions or disjunctions, we can enable more filter pushdown to datasources. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests. Closes #29603 from sunchao/SPARK-32721-2. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-09-01 06:06:25 +00:00
Yuming Wang	a701bc79e3	[SPARK-32659][SQL][FOLLOWUP] Improve test for pruning DPP on non-atomic type ### What changes were proposed in this pull request? Improve test for pruning DPP on non-atomic type: - Avoid creating new partition tables. This may take 30 seconds.. - Add test `array` type. ### Why are the changes needed? Improve test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #29595 from wangyum/SPARK-32659-test. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-09-01 05:51:04 +00:00
HyukjinKwon	86ca90ccd7	[SPARK-32190][PYTHON][DOCS] Development - Contribution Guide in PySpark ### What changes were proposed in this pull request? This PR proposes to document PySpark specific contribution guides at "Development" section. Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/development/contributing.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it is a new documentation. See the demo linked above. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29596 from HyukjinKwon/SPARK-32190. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 14:20:07 +09:00
Lu WANG	701e593414	[MINOR][R] Fix a R style in try and finally at DataFrame.R Fix the R style issue which is not catched by the R style checker. Got error: ``` R/DataFrame.R:1244:17: style: Closing curly-braces should always be on their own line, unless it's followed by an else. }, finally = { ^ lintr checks failed. ``` Closes #29574 from lu-wang-dl/fix-r-style. Lead-authored-by: Lu WANG <lu.wang@databricks.com> Co-authored-by: Lu Wang <38018689+lu-wang-dl@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-09-01 10:07:34 +09:00
Chao Sun	1453a09a63	[SPARK-32721][SQL] Simplify if clauses with null and boolean ### What changes were proposed in this pull request? The following if clause: ```sql if(p, null, false) ``` can be simplified to: ```sql and(p, null) ``` Similarly, the clause: ```sql if(p, null, true) ``` can be simplified to ```sql or(not(p), null) ``` iff the predicate `p` is non-nullable, i.e., can be evaluated to either true or false, but not null. ### Why are the changes needed? Converting if to or/and clauses can better push filters down. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #29567 from sunchao/SPARK-32721. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-08-31 20:59:54 +00:00
Huaxin Gao	806140de40	[SPARK-32592][SQL] Make DataFrameReader.table take the specified options ### What changes were proposed in this pull request? pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable ### Why are the changes needed? Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost. ``` val df = spark.read .option("partitionColumn", "id") .option("lowerBound", "0") .option("upperBound", "3") .option("numPartitions", "2") .table("h2.test.people") ``` We need to make `DataFrameReader.table` take the specified options. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test for now. Will add a test after V2 JDBC read is implemented. Closes #29535 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-31 13:21:15 +00:00
HyukjinKwon	eaaf783148	[MINOR][DOCS] Fix the Binder link to point the quickstart notebook correctly ### What changes were proposed in this pull request? This PR fixes the link of Binder in Quickstart notebook and documentation. From: https://mybinder.org/v2/gh/databricks/apache/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb To: https://mybinder.org/v2/gh/apache/spark/master?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb This link is the same as the one in RST files: `b54103016a/python/docs/source/conf.py (L57)` ### Why are the changes needed? The link was wrong, and points out non-existent file and repo. ### Does this PR introduce _any_ user-facing change? Yes, it will fixes the link so users can correctly try Binder. ### How was this patch tested? Manually tested by building the documentation. Closes #29597 from HyukjinKwon/minor-link-quickstart. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 22:00:56 +09:00
HyukjinKwon	2491cf1ae1	[SPARK-32747][R][TESTS] Deduplicate configuration set/unset in test_sparkSQL_arrow.R ### What changes were proposed in this pull request? This PR proposes to deduplicate configuration set/unset in `test_sparkSQL_arrow.R`. Setting `spark.sql.execution.arrow.sparkr.enabled` can be globally done instead of doing it in each test case. ### Why are the changes needed? To duduplicate the codes. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Manually ran the tests. Closes #29592 from HyukjinKwon/SPARK-32747. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 17:39:12 +09:00
zero323	5574734093	[SPARK-32138][FOLLOW-UP] Drop obsolete StringIO import branching ### What changes were proposed in this pull request? Removal of branched `StringIO` import. ### Why are the changes needed? Top level `StringIO` is no longer present in Python 3.x. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #29590 from zero323/SPARK-32138-FOLLOW-UP. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 16:56:50 +09:00
Cheng Su	ce473b223a	[SPARK-32740][SQL] Refactor common partitioning/distribution logic to BaseAggregateExec ### What changes were proposed in this pull request? For all three different aggregate physical operator: `HashAggregateExec`, `ObjectHashAggregateExec` and `SortAggregateExec`, they have same `outputPartitioning` and `requiredChildDistribution` logic. Refactor these same logic into their super class `BaseAggregateExec` to avoid code duplication and future bugs (similar to `HashJoin` and `ShuffledJoin`). ### Why are the changes needed? Reduce duplicated code across classes and prevent future bugs if we only update one class but forget another. We already did similar refactoring for join (`HashJoin` and `ShuffledJoin`). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests as this is pure refactoring and no new logic added. Closes #29583 from c21/aggregate-refactor. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-31 15:43:13 +09:00
Fokko Driesprong	a1e459ed9f	[SPARK-32719][PYTHON] Add Flake8 check missing imports https://issues.apache.org/jira/browse/SPARK-32719 ### What changes were proposed in this pull request? Add a check to detect missing imports. This makes sure that if we use a specific class, it should be explicitly imported (not using a wildcard). ### Why are the changes needed? To make sure that the quality of the Python code is up to standard. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit-tests and Flake8 static analysis Closes #29563 from Fokko/fd-add-check-missing-imports. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 11:23:31 +09:00
Kent Yao	6dacba7fa0	[SPARK-32733][SQL] Add extended information - arguments/examples/since/notes of expressions to the remarks field of GetFunctionsOperation ### What changes were proposed in this pull request? This PR adds extended information of a function including arguments, examples, notes and the since field to the SparkGetFunctionOperation ### Why are the changes needed? better user experience, it will help JDBC users to have a better understanding of our builtin functions ### Does this PR introduce _any_ user-facing change? Yes, BI tools and JDBC users will get full information on a spark function instead of only fragmentary usage info. e.g. date_part #### before ``` date_part(field, source) - Extracts a part of the date/timestamp or interval source. ``` #### after ``` Usage: date_part(field, source) - Extracts a part of the date/timestamp or interval source. Arguments: * field - selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function `EXTRACT`. * source - a date/timestamp or interval column from where `field` should be extracted Examples: > SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456'); 2019 > SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456'); 33 > SELECT date_part('doy', DATE'2019-08-12'); 224 > SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001'); 1.000001 > SELECT date_part('days', interval 1 year 10 months 5 days); 5 > SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds); 30.001001 Note: The date_part function is equivalent to the SQL-standard function `EXTRACT(field FROM source)` Since: 3.0.0 ``` ### How was this patch tested? New tests Closes #29577 from yaooqinn/SPARK-32733. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-31 11:03:01 +09:00
Udbhav30	065f17386d	[SPARK-32481][CORE][SQL] Support truncate table to move data to trash ### What changes were proposed in this pull request? Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. ### Why are the changes needed? Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. ### Does this PR introduce _any_ user-facing change? Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; ### How was this patch tested? new UTs added Closes #29552 from Udbhav30/truncate. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-30 10:25:32 -07:00
Cheng Su	cfe012a431	[SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ ### What changes were proposed in this pull request? This is followup from https://github.com/apache/spark/pull/29342, where to do two things: * Per https://github.com/apache/spark/pull/29342#discussion_r470153323, change from java `HashSet` to spark in-house `OpenHashSet` to track matched rows for non-unique join keys. I checked `OpenHashSet` implementation which is built from a key index (`OpenHashSet._bitset` as `BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet` is built from `HashMap`, which stores value in `Node` linked list and by theory should have taken more memory than `OpenHashSet`. Reran the same benchmark query used in https://github.com/apache/spark/pull/29342, and verified the query has similar performance here between `HashSet` and `OpenHashSet`. * Track metrics of the extra data structure `BitSet`/`OpenHashSet` for full outer SHJ. This depends on above thing, because there seems no easy way to get java `HashSet` memory size. ### Why are the changes needed? To better surface the memory usage for full outer SHJ more accurately. This can help users/developers to debug/improve full outer SHJ. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unite test in `SQLMetricsSuite.scala` . Closes #29566 from c21/add-metrics. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-30 07:01:33 +09:00
Wenchen Fan	ccc0250a08	[SPARK-32718][SQL] Remove unnecessary keywords for interval units ### What changes were proposed in this pull request? Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway. ### Why are the changes needed? These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries. Closes #29560 from cloud-fan/keyword. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-29 14:06:01 -07:00
Louiszr	a0bd273bb0	[SPARK-32092][ML][PYSPARK][FOLLOWUP] Fixed CrossValidatorModel.copy() to copy models instead of list ### What changes were proposed in this pull request? Fixed `CrossValidatorModel.copy()` so that it correctly calls `.copy()` on the models instead of lists of models. ### Why are the changes needed? `copy()` was first changed in #29445 . The issue was found in CI of #29524 and fixed. This PR introduces the exact same change so that `CrossValidatorModel.copy()` and its related tests are aligned in branch `master` and branch `branch-3.0`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated `test_copy` to make sure `copy()` is called on models instead of lists of models. Closes #29553 from Louiszr/fix-cv-copy. Authored-by: Louiszr <zxhst14@gmail.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-08-28 10:15:16 -07:00
Chen Zhang	58f87b3178	[SPARK-32639][SQL] Support GroupType parquet mapkey field ### What changes were proposed in this pull request? Remove the assertion in ParquetSchemaConverter that the parquet mapKey field must be PrimitiveType. ### Why are the changes needed? There is a parquet file in the attachment of [SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and the MessageType recorded in the file is: ``` message parquet_schema { optional group value (MAP) { repeated group key_value { required group key { optional binary first (UTF8); optional binary middle (UTF8); optional binary last (UTF8); } optional binary value (UTF8); } } } ``` Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark will throw an exception when converting Parquet MessageType to Spark SQL StructType: > AssertionError(Map key type is expected to be a primitive type, but found...) Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file, spark returns the correct result . According to the parquet project document (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), the mapKey in the parquet format does not need to be a primitive type. Note: This parquet file is not written by spark, because spark will write additional sparkSchema string information in the parquet file. When Spark reads, it will directly use the additional sparkSchema information in the file instead of converting Parquet MessageType to Spark SQL StructType. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a unit test case Closes #29451 from izchen/SPARK-32639. Authored-by: Chen Zhang <izchen@126.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-28 16:51:00 +00:00
Takeshi Yamamuro	0cb91b8c18	[SPARK-32704][SQL] Logging plan changes for execution ### What changes were proposed in this pull request? Since we only log plan changes for analyzer/optimizer now, this PR intends to add code to log plan changes in the preparation phase in `QueryExecution` for execution. ``` scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN") scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan ... 20/08/26 09:32:36 WARN PlanChangeLogger: === Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages === !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) (1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) +- (1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) ! +- Range (0, 10, step=1, splits=4) +- (1) Range (0, 10, step=1, splits=4) 20/08/26 09:32:36 WARN PlanChangeLogger: === Result of Batch Preparations === !HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) (1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L]) !+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) +- (1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L]) ! +- Range (0, 10, step=1, splits=4) +- (1) Range (0, 10, step=1, splits=4) ``` ### Why are the changes needed? Easy debugging for executed plans ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. Closes #29544 from maropu/PlanLoggingInPreparations. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-28 16:35:47 +00:00
Kent Yao	0626901bcb	[SPARK-32729][SQL][DOCS] Add missing since version for math functions ### What changes were proposed in this pull request? Add missing since version for math functions, including SPARK-8223 shiftright/shiftleft SPARK-8215 pi SPARK-8212 e SPARK-6829 sin/asin/sinh/cos/acos/cosh/tan/atan/tanh/ceil/floor/rint/cbrt/signum/isignum/Fsignum/Lsignum/degrees/radians/log/log10/log1p/exp/expm1/pow/hypot/atan2 SPARK-8209 conv SPARK-8213 factorial SPARK-20751 cot SPARK-2813 sqrt SPARK-8227 unhex SPARK-8218 log(a,b) SPARK-8207 bin SPARK-8214 hex SPARK-8206 round SPARK-14614 bround ### Why are the changes needed? fix SQL docs ### Does this PR introduce _any_ user-facing change? yes, doc updated ### How was this patch tested? passing doc generation. Closes #29571 from yaooqinn/minor. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-29 00:30:31 +09:00
yi.wu	c3b9404253	[SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec ### What changes were proposed in this pull request? This PR proposes to add a specific `AQEOptimizer` for the `AdaptiveSparkPlanExec` instead of implementing an anonymous `RuleExecutor`. At the same time, this PR also adds the configuration `spark.sql.adaptive.optimizer.excludedRules`, which follows the same pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for users and developers. ### Why are the changes needed? Currently, `AdaptiveSparkPlanExec` has implemented an anonymous `RuleExecutor` to apply the AQE optimize rules on the plan. However, the anonymous class usually could be inconvenient to maintain and extend for the long term. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It's a pure refactor so pass existing tests should be ok. Closes #29559 from Ngone51/impro-aqe-optimizer. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 21:23:53 +09:00
HyukjinKwon	5775073a01	[SPARK-32722][PYTHON][DOCS] Update document type conversion for Pandas UDFs (pyarrow 1.0.1, pandas 1.1.1, Python 3.7) ### What changes were proposed in this pull request? This PR updates the chart generated at SPARK-25666. We bumped up the minimal PyArrow version. It's better to use PyArrow 0.15.1+ ### Why are the changes needed? To track the changes in type coercion of PySpark <> PyArrow <> pandas. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Use this code to generate the chart: ```python from pyspark.sql.types import * from pyspark.sql.functions import pandas_udf columns = [ ('none', 'object(NoneType)'), ('bool', 'bool'), ('int8', 'int8'), ('int16', 'int16'), ('int32', 'int32'), ('int64', 'int64'), ('uint8', 'uint8'), ('uint16', 'uint16'), ('uint32', 'uint32'), ('uint64', 'uint64'), ('float64', 'float16'), ('float64', 'float32'), ('float64', 'float64'), ('date', 'datetime64[ns]'), ('tz_aware_dates', 'datetime64[ns, US/Eastern]'), ('string', 'object(string)'), ('decimal', 'object(Decimal)'), ('array', 'object(array[int32])'), ('float128', 'float128'), ('complex64', 'complex64'), ('complex128', 'complex128'), ('category', 'category'), ('tdeltas', 'timedelta64[ns]'), ] def create_dataframe(): import pandas as pd import numpy as np import decimal pdf = pd.DataFrame({ 'none': [None, None], 'bool': [True, False], 'int8': np.arange(1, 3).astype('int8'), 'int16': np.arange(1, 3).astype('int16'), 'int32': np.arange(1, 3).astype('int32'), 'int64': np.arange(1, 3).astype('int64'), 'uint8': np.arange(1, 3).astype('uint8'), 'uint16': np.arange(1, 3).astype('uint16'), 'uint32': np.arange(1, 3).astype('uint32'), 'uint64': np.arange(1, 3).astype('uint64'), 'float16': np.arange(1, 3).astype('float16'), 'float32': np.arange(1, 3).astype('float32'), 'float64': np.arange(1, 3).astype('float64'), 'float128': np.arange(1, 3).astype('float128'), 'complex64': np.arange(1, 3).astype('complex64'), 'complex128': np.arange(1, 3).astype('complex128'), 'string': list('ab'), 'array': pd.Series([np.array([1, 2, 3], dtype=np.int32), np.array([1, 2, 3], dtype=np.int32)]), 'decimal': pd.Series([decimal.Decimal('1'), decimal.Decimal('2')]), 'date': pd.date_range('19700101', periods=2).values, 'category': pd.Series(list("AB")).astype('category')}) pdf['tdeltas'] = [pdf.date.diff()[1], pdf.date.diff()[0]] pdf['tz_aware_dates'] = pd.date_range('19700101', periods=2, tz='US/Eastern') return pdf types = [ BooleanType(), ByteType(), ShortType(), IntegerType(), LongType(), FloatType(), DoubleType(), DateType(), TimestampType(), StringType(), DecimalType(10, 0), ArrayType(IntegerType()), MapType(StringType(), IntegerType()), StructType([StructField("_1", IntegerType())]), BinaryType(), ] df = spark.range(2).repartition(1) results = [] count = 0 total = len(types) * len(columns) values = [] spark.sparkContext.setLogLevel("FATAL") for t in types: result = [] for column, pandas_t in columns: v = create_dataframe()[column][0] values.append(v) try: row = df.select(pandas_udf(lambda _: create_dataframe()[column], t)(df.id)).first() ret_str = repr(row[0]) except Exception: ret_str = "X" result.append(ret_str) progress = "SQL Type: [%s]\n Pandas Value(Type): %s(%s)]\n Result Python Value: [%s]" % ( t.simpleString(), v, pandas_t, ret_str) count += 1 print("%s/%s:\n %s" % (count, total, progress)) results.append([t.simpleString()] + list(map(str, result))) schema = ["SQL Type \\ Pandas Value(Type)"] + list(map(lambda values_column: "%s(%s)" % (values_column[0], values_column[1][1]), zip(values, columns))) strings = spark.createDataFrame(results, schema=schema)._jdf.showString(20, 20, False) print("\n".join(map(lambda line: " # %s # noqa" % line, strings.strip().split("\n")))) ``` Closes #29569 from HyukjinKwon/SPARK-32722. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:38:39 +09:00
Jungtaek Lim (HeartSaVioR)	73bfed3633	[SPARK-28612][SQL][FOLLOWUP] Correct method doc of DataFrameWriterV2.replace() ### What changes were proposed in this pull request? This patch corrects the method doc of DataFrameWriterV2.replace() which explanation of exception is described oppositely. ### Why are the changes needed? The method doc is incorrect. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Only doc change. Closes #29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:14:57 +09:00
HyukjinKwon	c154629171	[SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow ### What changes were proposed in this pull request? This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide"). Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html ### Why are the changes needed? To have a single place for PySpark users, and better documentation. ### Does this PR introduce _any_ user-facing change? Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation. ### How was this patch tested? ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` and ```bash cd python/docs make clean html ``` Closes #29548 from HyukjinKwon/SPARK-32183. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-28 15:09:06 +09:00
Liang-Chi Hsieh	d6c095c92c	[SPARK-32693][SQL] Compare two dataframes with same schema except nullable property ### What changes were proposed in this pull request? This PR changes key data types check in `HashJoin` to use `sameType`. ### Why are the changes needed? Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable. It makes inconsistent results when doing `except` between two dataframes. If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored. If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects. ### Does this PR introduce _any_ user-facing change? Yes. Making consistent `except` operation between dataframes. ### How was this patch tested? Unit test. Closes #29555 from viirya/SPARK-32693. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-08-28 10:32:23 +09:00
Dongjoon Hyun	182727d90f	[SPARK-32713][K8S] Support execId placeholder in executor PVC conf ### What changes were proposed in this pull request? This PR aims to support executor id placeholder in `spark.kubernetes.executor.volumes.persistentVolumeClaim.myname.options.claimName` configuration like the following. ``` --conf spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=pvc-spark-SPARK_EXECUTOR_ID \ ``` ### Why are the changes needed? This is a convenient way to mount corresponding PV to the executor. ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature and there is no regression because users don't use `SPARK_EXECUTOR_ID` in PVC claim name. ### How was this patch tested? Pass the newly added test case. Closes #29557 from dongjoon-hyun/SPARK-PVC. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-27 09:49:21 -07:00
waleedfateem	8749b2b6fa	[SPARK-32701][CORE][DOCS] mapreduce.fileoutputcommitter.algorithm.version default value The current documentation states that the default value of spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version is 1 which is not entirely true since this configuration isn't set anywhere in Spark but rather inherited from the Hadoop FileOutputCommitter class. ### What changes were proposed in this pull request? I'm submitting this change, to clarify that the default value will entirely depend on the Hadoop version of the runtime environment. ### Why are the changes needed? An application would end up using algorithm version 1 on certain environments but without any changes the same exact application will use version 2 on environments running Hadoop 3.0 and later. This can have pretty bad consequences in certain scenarios, for example, two tasks can partially overwrite their output if speculation is enabled. Also, please refer to the following JIRA: https://issues.apache.org/jira/browse/MAPREDUCE-7282 ### Does this PR introduce _any_ user-facing change? Yes. Configuration page content was modified where previously we explicitly highlighted that the default version for the FileOutputCommitter algorithm was v1, this now has changed to "Dependent on environment" with additional information in the description column to elaborate. ### How was this patch tested? Checked changes locally in browser Closes #29541 from waleedfateem/SPARK-32701. Authored-by: waleedfateem <waleed.fateem@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-27 09:05:50 -05:00
Dale Clarke	ed51a7f083	[SPARK-30654] Bootstrap4 docs upgrade ### What changes were proposed in this pull request? We are using an older version of Bootstrap (v. 2.1.0) for the online documentation site. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs: https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889 https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700 https://snyk.io/vuln/npm:bootstrap:20180529 https://snyk.io/vuln/npm:bootstrap:20160627 I haven't validated each CVE, but it would probably be good practice to resolve any potential issues and get on a supported release. The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the documentation. This is a fairly large change so I'm sure additional testing and fixes will be needed. ### How was this patch tested? This has been manually tested, but as there is a lot of documentation it is possible issues were missed. Additional testing and feedback is welcomed. If it appears a whole section was missed let me know and I'll take a pass at addressing that section. Closes #27369 from clarkead/bootstrap4-docs-upgrade. Authored-by: Dale Clarke <a.dale.clarke@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-08-27 09:03:39 -05:00
Kent Yao	f14f3742e0	[SPARK-32696][SQL][TEST-HIVE1.2][TEST-HADOOP2.7] Get columns operation should handle interval column properly ### What changes were proposed in this pull request? This PR let JDBC clients identify spark interval columns properly. ### Why are the changes needed? JDBC users can query interval values through thrift server, create views with interval columns, e.g. ```sql CREATE global temp view view1 as select interval 1 day as i; ``` but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: INTERVAL` ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: INTERVAL at org.apache.hadoop.hive.serde2.thrift.Type.getType(Type.java:170) at org.apache.spark.sql.hive.thriftserver.ThriftserverShimUtils$.toJavaSQLType(ThriftserverShimUtils.scala:53) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:157) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:102) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:149) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6(SparkGetColumnsOperation.scala:113) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$6$adapted(SparkGetColumnsOperation.scala:112) at scala.Option.foreach(Option.scala:407) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:112) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:111) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:111) ... 34 more ``` ### Does this PR introduce _any_ user-facing change? YES, #### before ![image](https://user-images.githubusercontent.com/8326978/91162239-6cd1ec80-e6fe-11ea-8c2c-914ddb325c4e.png) #### after ![image](https://user-images.githubusercontent.com/8326978/91162025-1a90cb80-e6fe-11ea-94c4-03a6f2ec296b.png) ### How was this patch tested? new tests Closes #29539 from yaooqinn/SPARK-32696. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:52:34 +00:00
xuewei.linxuewei	eb379766f4	[SPARK-32705][SQL] Fix serialization issue for EmptyHashedRelation ### What changes were proposed in this pull request? Currently, EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and it will cause JavaDeserialization Exception as following ``` 20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java.io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169) at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328) ``` This PR includes * Using case object instead to fix serialization issue. * Also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins. ### Why are the changes needed? It will cause BHJ failed when buildSide is Empty and BHJ(NAAJ) failed when buildSide with null partition keys. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? * Existing UT. * Run entire TPCDS for E2E coverage. Closes #29547 from leanken/leanken-SPARK-32705. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:24:42 +00:00
Terry Kim	baaa756dee	[SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start() ### What changes were proposed in this pull request? This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. ### Why are the changes needed? The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`. For example, ``` Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") ``` will write the result to `/tmp/path2`. ### Does this PR introduce _any_ user-facing change? Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`: ``` scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2") org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.; ``` The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`. ### How was this patch tested? Added new tests. Closes #29543 from imback82/path_option. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-27 06:21:04 +00:00
Devesh Agrawal	b786f31a42	[SPARK-32643][CORE][K8S] Consolidate state decommissioning in the TaskSchedulerImpl realm ### What changes were proposed in this pull request? The decommissioning state is a bit fragment across two places in the TaskSchedulerImpl: https://github.com/apache/spark/pull/29014/ stored the incoming decommission info messages in TaskSchedulerImpl.executorsPendingDecommission. While https://github.com/apache/spark/pull/28619/ was storing just the executor end time in the map TaskSetManager.tidToExecutorKillTimeMapping (which in turn is contained in TaskSchedulerImpl). While the two states are not really overlapping, it's a bit of a code hygiene concern to save this state in two places. With https://github.com/apache/spark/pull/29422, TaskSchedulerImpl is emerging as the place where all decommissioning book keeping is kept within the driver. So consolidate the information in _tidToExecutorKillTimeMapping_ into _executorsPendingDecommission_. However, in order to do so, we need to walk away from keeping the raw ExecutorDecommissionInfo messages and instead keep another class ExecutorDecommissionState. This decoupling will allow the RPC message class ExecutorDecommissionInfo to evolve independently from the book keeping ExecutorDecommissionState. ### Why are the changes needed? This is just a code cleanup. These two features were added independently and its time to consolidate their state for good hygiene. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #29452 from agrawaldevesh/consolidate_decom_state. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-08-26 15:16:47 -07:00
Dongjoon Hyun	2dee4352a0	Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash" This reverts commit `5c077f0580`.	2020-08-26 11:24:35 -07:00
unirt	d3304268d3	[MINOR][PYTHON] Fix typo in a docsting of RDD.toDF ### What changes were proposed in this pull request? Fixes typo in docsting of `toDF` ### Why are the changes needed? The third argument of `toDF` is actually `sampleRatio`. related discussion: https://github.com/apache/spark/pull/12746#discussion-diff-62704834 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This patch doesn't affect any logic, so existing tests should cover it. Closes #29551 from unirt/minor_fix_docs. Authored-by: unirt <lunirtc@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-26 10:34:49 -07:00
Yuming Wang	a8b568800e	[SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type ### What changes were proposed in this pull request? Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example: ```scala spark.range(1000) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df1"); spark.range(100) .select(col("id"), col("id").as("k")) .write .partitionBy("k") .format("parquet") .mode("overwrite") .saveAsTable("df2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2") spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false") spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show ``` It should return two records, but it returns empty. ### Why are the changes needed? Fix data issue ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add new unit test. Closes #29475 from wangyum/SPARK-32659. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-08-26 06:57:43 +00:00
Udbhav30	5c077f0580	[SPARK-32481][CORE][SQL] Support truncate table to move data to trash ### What changes were proposed in this pull request? Instead of deleting the data, we can move the data to trash. Based on the configuration provided by the user it will be deleted permanently from the trash. ### Why are the changes needed? Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently. ### Does this PR introduce _any_ user-facing change? Yes, After truncate table the data is not permanently deleted now. It is first moved to the trash and then after the given time deleted permanently; ### How was this patch tested? new UTs added Closes #29387 from Udbhav30/tuncateTrash. Authored-by: Udbhav30 <u.agrawal30@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-08-25 23:38:43 -07:00
yi.wu	f510d21e93	[SPARK-32466][FOLLOW-UP][TEST][SQL] Regenerate the golden explain file for PlanStabilitySuite ### What changes were proposed in this pull request? This PR regenerates the golden explain file based on the fix: https://github.com/apache/spark/pull/29537 ### Why are the changes needed? Eliminates the personal related information (e.g., local directories) in the explain plan. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Checked manually. Closes #29546 from Ngone51/follow-up-gen-golden-file. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 14:46:14 +09:00
HyukjinKwon	b07e7429a6	[SPARK-32695][INFRA] Explicitly cache and hash 'build' directly in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to explicitly cache and hash the files/directories under 'build' for SBT and Zinc at GitHub Actions. Otherwise, it can end up with overwriting `build` directory. See also https://github.com/apache/spark/pull/29286#issuecomment-679368436 Previously, other files like `build/mvn` and `build/sbt` are also cached and overwritten. So, when you have some changes there, they are ignored. ### Why are the changes needed? To make GitHub Actions build stable. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? The builds in this PR test it out. Closes #29536 from HyukjinKwon/SPARK-32695. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:25:59 +09:00
HyukjinKwon	b54103016a	[SPARK-32204][SPARK-32182][DOCS] Add a quickstart page with Binder integration in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - add a notebook with a Binder integration which allows users to try PySpark in a live notebook. Please [try this here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - reuse this notebook as a quickstart guide in PySpark documentation. Note that Binder turns a Git repo into a collection of interactive notebooks. It works based on Docker image. Once somebody builds, other people can reuse the image against a specific commit. Therefore, if we run Binder with the images based on released tags in Spark, virtually all users can instantly launch the Jupyter notebooks. <br/> I made a simple demo to make it easier to review. Please see: - [Main page](https://hyukjin-spark.readthedocs.io/en/stable/). Note that the link ("Live Notebook") in the main page wouldn't work since this PR is not merged yet. - [Quickstart page](https://hyukjin-spark.readthedocs.io/en/stable/getting_started/quickstart.html) <br/> When reviewing the notebook file itself, please give my direct feedback which I will appreciate and address. Another way might be: - open [here](https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-32204?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb). - edit / change / update the notebook. Please feel free to change as whatever you want. I can apply as are or slightly update more when I apply to this PR. - download it as a `.ipynb` file: ![Screen Shot 2020-08-20 at 10 12 19 PM](https://user-images.githubusercontent.com/6477701/90774311-3e38c800-e332-11ea-8476-699a653984db.png) - upload the `.ipynb` file here in a GitHub comment. Then, I will push a commit with that file with crediting correctly, of course. - alternatively, push a commit into this PR right away if that's easier for you (if you're a committer). References: - https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html - https://databricks.com/jp/blog/2020/03/31/10-minutes-from-pandas-to-koalas-on-apache-spark.html - my own blog post .. :-) and https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? To improve PySpark's usability. The current quickstart for Python users are very friendly. ### Does this PR introduce _any_ user-facing change? Yes, it will add a documentation page, and expose a live notebook to PySpark users. ### How was this patch tested? Manually tested, and GitHub Actions builds will test. Closes #29491 from HyukjinKwon/SPARK-32204. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-08-26 12:23:24 +09:00

... 4 5 6 7 8 ...

28215 commits