ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Chao Sun	0de7f2ff1e	[SPARK-34039][SQL] ReplaceTable should invalidate cache ### What changes were proposed in this pull request? This changes `ReplaceTableExec`/`AtomicReplaceTableExec`, and uncaches the target table before it is dropped. In addition, this includes some refactoring by moving the `uncacheTable` method to `DataSourceV2Strategy` so that we don't need to pass a Spark session to the v2 exec. ### Why are the changes needed? Similar to SPARK-33492 (#30429). When a table is refreshed, the associated cache should be invalidated to avoid potential incorrect results. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added a new unit test. Closes #31081 from sunchao/SPARK-34039. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 21:13:22 -08:00
angerszhu	9b54da490d	[SPARK-33818][SQL][DOC] Add descriptions about `spark.sql.parser.quotedRegexColumnNames` in the SQL documents ### What changes were proposed in this pull request? According to https://github.com/apache/spark/pull/30805#issuecomment-747179899, doc `spark.sql.parser.quotedRegexColumnNames` since we need user know about this in doc and it's useful. ![image](https://user-images.githubusercontent.com/46485123/103656543-afa4aa80-4fa3-11eb-8cd3-a9d1b87a3489.png) ![image](https://user-images.githubusercontent.com/46485123/103656551-b2070480-4fa3-11eb-9ce7-95cc424242a6.png) ### Why are the changes needed? Complete doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #30816 from AngersZhuuuu/SPARK-33818. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 18:55:27 -08:00
Holden Karau	8e11ce5378	[SPARK-34018][K8S] NPE in ExecutorPodsSnapshot ### What changes were proposed in this pull request? Label both the statuses and ensure the ExecutorPodSnapshot starts with the default config to match. ### Why are the changes needed? The current test depends on the order rather than testing the desired property. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Labeled the containers statuses, observed failures, added the default label as the initialization point, tests passed again. Built Spark, ran on K8s cluster verified no NPE in driver log. Closes #31071 from holdenk/SPARK-34018-finishedExecutorWithRunningSidecar-doesnt-correctly-constructt-the-test-case. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 16:47:37 -08:00
Dongjoon Hyun	5b16d70d6a	[SPARK-34044][DOCS] Add spark.sql.hive.metastore.jars.path to sql-data-sources-hive-tables.md ### What changes were proposed in this pull request? This PR adds new configuration to `sql-data-sources-hive-tables`. ### Why are the changes needed? SPARK-32852 added a new configuration, `spark.sql.hive.metastore.jars.path`. ### Does this PR introduce _any_ user-facing change? Yes, but a document only. ### How was this patch tested? BEFORE ![Screen Shot 2021-01-07 at 2 57 57 PM](https://user-images.githubusercontent.com/9700541/103954318-cc9ec200-50f8-11eb-86d3-cd89b07fcd21.png) AFTER ![Screen Shot 2021-01-07 at 2 56 34 PM](https://user-images.githubusercontent.com/9700541/103954221-9d885080-50f8-11eb-8938-fb91394a33cb.png) Closes #31085 from dongjoon-hyun/SPARK-34044. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:34:40 +09:00
HyukjinKwon	aa388cf3d0	[SPARK-34041][PYTHON][DOCS] Miscellaneous cleanup for new PySpark documentation ### What changes were proposed in this pull request? This PR proposes to: - Add a link of quick start in PySpark docs into "Programming Guides" in Spark main docs - `ML` / `MLlib` -> `MLlib (DataFrame-based)` / `MLlib (RDD-based)` in API reference page - Mention other user guides as well because the guide such as [ML](http://spark.apache.org/docs/latest/ml-guide.html) and [SQL](http://spark.apache.org/docs/latest/sql-programming-guide.html). - Mention other migration guides as well because PySpark can get affected by it. ### Why are the changes needed? For better documentation. ### Does this PR introduce _any_ user-facing change? It fixes user-facing docs. However, it's not released out yet. ### How was this patch tested? Manually tested by running: ```bash cd docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch ``` Closes #31082 from HyukjinKwon/SPARK-34041. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-08 09:28:31 +09:00
fwang12	7b06acc28b	[SPARK-33100][SQL][FOLLOWUP] Find correct bound of bracketed comment in spark-sql ### What changes were proposed in this pull request? This PR help find correct bound of bracketed comment in spark-sql. Here is the log for UT of SPARK-33100 in CliSuite before: ``` 2021-01-05 13:22:34.768 - stdout> spark-sql> /* SELECT 'test';/ SELECT 'test'; 2021-01-05 13:22:41.523 - stderr> Time taken: 6.716 seconds, Fetched 1 row(s) 2021-01-05 13:22:41.599 - stdout> test 2021-01-05 13:22:41.6 - stdout> spark-sql> ;;/ SELECT 'test';/ SELECT 'test'; 2021-01-05 13:22:41.709 - stdout> test 2021-01-05 13:22:41.709 - stdout> spark-sql> / SELECT 'test';/;; SELECT 'test'; 2021-01-05 13:22:41.902 - stdout> spark-sql> SELECT 'test'; -- SELECT 'test'; 2021-01-05 13:22:41.902 - stderr> Time taken: 0.129 seconds, Fetched 1 row(s) 2021-01-05 13:22:41.902 - stderr> Error in query: 2021-01-05 13:22:41.902 - stderr> mismatched input '<EOF>' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 19) 2021-01-05 13:22:42.006 - stderr> 2021-01-05 13:22:42.006 - stderr> == SQL == 2021-01-05 13:22:42.006 - stderr> / SELECT 'test';/ 2021-01-05 13:22:42.006 - stderr> -------------------^^^ 2021-01-05 13:22:42.006 - stderr> 2021-01-05 13:22:42.006 - stderr> Time taken: 0.226 seconds, Fetched 1 row(s) 2021-01-05 13:22:42.006 - stdout> test ``` The root cause is that the insideBracketedComment is not accurate. For `/ comment */`, the last character `/` is not insideBracketedComment and it would be treat as beginning of statements. In this PR, this issue is fixed. ### Why are the changes needed? To fix the issue described above. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT Closes #31054 from turboFei/SPARK-33100-followup. Authored-by: fwang12 <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-07 20:49:37 +09:00
Yu Zhong	d36cdd5541	[SPARK-33933][SQL] Materialize BroadcastQueryStage first to avoid broadcast timeout in AQE ### What changes were proposed in this pull request? In AdaptiveSparkPlanExec.getFinalPhysicalPlan, when newStages are generated, sort the new stages by class type to make sure BroadcastQueryState precede others. It can make sure the broadcast job are submitted before map jobs to avoid waiting for job schedule and cause broadcast timeout. ### Why are the changes needed? When enable AQE, in getFinalPhysicalPlan, spark traversal the physical plan bottom up and create query stage for materialized part by createQueryStages and materialize those new created query stages to submit map stages or broadcasting. When ShuffleQueryStage are materializing before BroadcastQueryStage, the map job and broadcast job are submitted almost at the same time, but map job will hold all the computing resources. If the map job runs slow (when lots of data needs to process and the resource is limited), the broadcast job cannot be started(and finished) before spark.sql.broadcastTimeout, thus cause whole job failed (introduced in SPARK-31475). The workaround to increase spark.sql.broadcastTimeout doesn't make sense and graceful, because the data to broadcast is very small. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? 1. Add UT 2. Test the code using dev environment in https://issues.apache.org/jira/browse/SPARK-33933 Closes #30998 from zhongyu09/aqe-broadcast. Authored-by: Yu Zhong <yzhong@freewheel.tv> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-07 08:59:26 +00:00
Dongjoon Hyun	194edc86a2	Revert "[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider" This reverts commit `8bb70bf0d6`.	2021-01-06 23:41:27 -08:00
Yuming Wang	aa509c1eee	[SPARK-34031][SQL] Union operator missing rowCount when CBO enabled ### What changes were proposed in this pull request? This pr add row count to `Union` operator when CBO enabled. ```scala spark.sql("CREATE TABLE t1 USING parquet AS SELECT id FROM RANGE(10)") spark.sql("CREATE TABLE t2 USING parquet AS SELECT id FROM RANGE(10)") spark.sql("ANALYZE TABLE t1 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("ANALYZE TABLE t2 COMPUTE STATISTICS FOR ALL COLUMNS") spark.sql("set spark.sql.cbo.enabled=true") spark.sql("SELECT * FROM t1 UNION ALL SELECT * FROM t2").explain("cost") ``` Before this pr: ``` == Optimized Logical Plan == Union false, false, Statistics(sizeInBytes=320.0 B) :- Relation[id#5880L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) +- Relation[id#5881L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) ``` After this pr: ``` == Optimized Logical Plan == Union false, false, Statistics(sizeInBytes=320.0 B, rowCount=20) :- Relation[id#2138L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) +- Relation[id#2139L] parquet, Statistics(sizeInBytes=160.0 B, rowCount=10) ``` ### Why are the changes needed? Improve query performance, [`JoinEstimation.estimateInnerOuterJoin`](`d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)`) need the row count. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31068 from wangyum/SPARK-34031. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:41:10 +09:00
Yuming Wang	3aa4e113c5	[SPARK-33861][SQL][FOLLOWUP] Simplify conditional in predicate should consider deterministic ### What changes were proposed in this pull request? This pr address https://github.com/apache/spark/pull/30865#pullrequestreview-562344089 to fix simplify conditional in predicate should consider deterministic. ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #31067 from wangyum/SPARK-33861-2. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:28:30 +09:00
yangjie01	26b603992c	[SPARK-34028][SQL] Cleanup "unreachable code" compilation warning ### What changes were proposed in this pull request? There is one compilation warning as follow: ``` [WARNING] [Warn] /spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1555: [other-match-analysis org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction.catalogFunction] unreachable code ``` This compilation warning is due to `NoSuchPermanentFunctionException` is sub-class of `AnalysisException` and if there is `NoSuchPermanentFunctionException` be thrown out, it will be catch by `case _: AnalysisException => failFunctionLookup(name)`, so `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` is `unreachable code`. This pr remove `case _: NoSuchPermanentFunctionException => failFunctionLookup(name)` directly because both these 2 branches handle exceptions in the same way: `failFunctionLookup(name)` ### Why are the changes needed? Cleanup "unreachable code" compilation warnings. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31064 from LuciferYang/SPARK-34028. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 14:26:04 +09:00
HyukjinKwon	0ba3ab4c23	[SPARK-34021][R] Fix hyper links in SparkR documentation for CRAN submission ### What changes were proposed in this pull request? 3.0.1 CRAN submission was failed as the reason below: ``` Found the following (possibly) invalid URLs: URL: http://jsonlines.org/ (moved to https://jsonlines.org/) From: man/read.json.Rd man/write.json.Rd Status: 200 Message: OK URL: https://dl.acm.org/citation.cfm?id=1608614 (moved to https://dl.acm.org/doi/10.1109/MC.2009.263) From: inst/doc/sparkr-vignettes.html Status: 200 Message: OK ``` The links were being redirected now. This PR checked all hyperlinks in the docs such as `href{...}` and `url{...}`, and fixed all in SparkR: - Fix two problems above. - Fix http to https - Fix `https://www.apache.org/ https://spark.apache.org/` -> `https://www.apache.org https://spark.apache.org`. ### Why are the changes needed? For CRAN submission. ### Does this PR introduce _any_ user-facing change? Virtually no because it's just cleanup that CRAN requires. ### How was this patch tested? Manually tested by clicking the links Closes #31058 from HyukjinKwon/SPARK-34021. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-07 13:58:13 +09:00
Dongjoon Hyun	9b5df2afaa	[SPARK-34036][DOCS] Update ORC data source documentation ### What changes were proposed in this pull request? This PR aims to update SQL documentation about ORC data sources. New structure looks like the following. - ORC Implementation - Vectorized Reader - Schema Merging - Zstandard - Bloom Filters - Columnar Encryption - Hive metastore ORC table conversion - Configuration ### Why are the changes needed? This document is not up-to-date. Apache Spark 3.2.0 can utilize new improvements from Apache ORC 1.6.6. ### Does this PR introduce _any_ user-facing change? No, this is a documentation. ### How was this patch tested? Manual. ``` SKIP_API=1 jekyll build ``` --- BEFORE ![Screen Shot 2021-01-06 at 5 08 19 PM](https://user-images.githubusercontent.com/9700541/103838399-d0bbd880-5041-11eb-8757-297728d2793f.png) --- AFTER ![Screen Shot 2021-01-06 at 7 03 38 PM](https://user-images.githubusercontent.com/9700541/103845972-0963ae00-5052-11eb-905e-8e8b335c760a.png) ![Screen Shot 2021-01-06 at 7 03 49 PM](https://user-images.githubusercontent.com/9700541/103845971-08cb1780-5052-11eb-9b2a-d3acfa4b9278.png) ![Screen Shot 2021-01-06 at 7 03 59 PM](https://user-images.githubusercontent.com/9700541/103845970-08328100-5052-11eb-8982-7079fd7b0efc.png) ![Screen Shot 2021-01-06 at 7 04 10 PM](https://user-images.githubusercontent.com/9700541/103845968-08328100-5052-11eb-9ef5-db99c7cc64d3.png) ![Screen Shot 2021-01-06 at 7 04 16 PM](https://user-images.githubusercontent.com/9700541/103845963-07015400-5052-11eb-955f-8126d417e8aa.png) Closes #31075 from dongjoon-hyun/SPARK-34036. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 20:19:16 -08:00
ulysses-you	f9daf035f4	[SPARK-33806][SQL][FOLLOWUP] Fold RepartitionExpression num partition should check if partition expression is empty ### What changes were proposed in this pull request? Add check partition expressions is empty. ### Why are the changes needed? We should keep `spark.range(1).hint("REPARTITION_BY_RANGE")` has default shuffle number instead of 1. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? Add test. Closes #31074 from ulysses-you/SPARK-33806-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 17:22:14 -08:00
Dongjoon Hyun	8bb70bf0d6	[SPARK-34029][SQL][TESTS] Add OrcEncryptionSuite and FakeKeyProvider ### What changes were proposed in this pull request? This PR aims to add a basis for columnar encryption test framework by add `OrcEncryptionSuite` and `FakeKeyProvider`. Please note that we will improve more in both Apache Spark and Apache ORC in Apache Spark 3.2.0 timeframe. ### Why are the changes needed? Apache ORC 1.6 supports columnar encryption. ### Does this PR introduce _any_ user-facing change? No. This is for a test case. ### How was this patch tested? Pass the newly added test suite. Closes #31065 from dongjoon-hyun/SPARK-34029. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 12:59:47 -08:00
Kazuaki Ishizaki	a0269bb419	[SPARK-34022][DOCS][FOLLOW-UP] Fix typo in SQL built-in function docs ### What changes were proposed in this pull request? This PR is a follow-up of #31061. It fixes a typo in a document: `Finctions` -> `Functions` ### Why are the changes needed? Make the change better documented. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #31069 from kiszk/SPARK-34022-followup. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-06 09:28:22 -08:00
angerszhu	3cdc4ef5b4	[SPARK-32685][SQL][FOLLOW-UP] Update migration guide about change default filed.delim to '\t' when user specifies serde ### What changes were proposed in this pull request? Update migration guide according to https://github.com/apache/spark/pull/30942#issuecomment-755054562 ### Why are the changes needed? update migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31051 from AngersZhuuuu/SPARK-32685-FOLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-06 13:45:48 +00:00
gengjiaan	6788304240	[SPARK-33977][SQL][DOCS] Add doc for "'like any' and 'like all' operators" ### What changes were proposed in this pull request? Add doc for 'like any' and 'like all' operators in sql-ref-syntx-qry-select-like.cmd ### Why are the changes needed? make the usage of 'like any' and 'like all' known to more users ### Does this PR introduce _any_ user-facing change? Yes. <img width="687" alt="Screen Shot 2021-01-06 at 21 10 38" src="https://user-images.githubusercontent.com/692303/103767385-dc1ffb80-5063-11eb-9529-89479531425f.png"> <img width="495" alt="Screen Shot 2021-01-06 at 21 11 06" src="https://user-images.githubusercontent.com/692303/103767391-dde9bf00-5063-11eb-82ce-63bdd11593a1.png"> <img width="406" alt="Screen Shot 2021-01-06 at 21 11 20" src="https://user-images.githubusercontent.com/692303/103767396-df1aec00-5063-11eb-8e81-a192e6c72431.png"> ### How was this patch tested? No tests Closes #31008 from beliefer/SPARK-33977. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-06 21:14:45 +09:00
HyukjinKwon	0d86a02ffb	[SPARK-34022][DOCS] Support latest mkdocs in SQL built-in function docs ### What changes were proposed in this pull request? This PR adds the support of the latest mkdocs, and makes the sidebar properly show. It works in lower versions too. Before: ![Screen Shot 2021-01-06 at 5 11 56 PM](https://user-images.githubusercontent.com/6477701/103745131-4e7fe400-5042-11eb-9c09-84f9f95e9fb9.png) After: ![Screen Shot 2021-01-06 at 5 10 53 PM](https://user-images.githubusercontent.com/6477701/103745139-5049a780-5042-11eb-8ded-30b6f7ef48aa.png) ### Why are the changes needed? This is a regression in the documentation. ### Does this PR introduce _any_ user-facing change? Technically no. It's not related yet. It fixes the list on the sidebar appears properly. ### How was this patch tested? Manually built the docs via `./sql/create-docs.sh` and `open ./sql/site/index.html` Closes #31061 from HyukjinKwon/SPARK-34022. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 20:31:27 +09:00
HyukjinKwon	ff284fb6ac	[SPARK-30681][PYTHON][FOLLOW-UP] Keep the name similar with Scala side in higher order functions ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/27406. It fixes the naming to match with Scala side. Note that there are a bit of inconsistency already e.g.) `col`, `e`, `expr` and `column`. This part I did not change but other names like `zero` vs `initialValue` or `col1`/`col2` vs `left`/`right` looks unnecessary. ### Why are the changes needed? To make the usage similar with Scala side, and for consistency. ### Does this PR introduce _any_ user-facing change? No, this is not released yet. ### How was this patch tested? GitHub Actions and Jenkins build will test it out. Closes #31062 from HyukjinKwon/SPARK-30681. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 18:46:20 +09:00
Prashant Sharma	f64dfa8727	[SPARK-32221][K8S] Avoid possible errors due to incorrect file size or type supplied in spark conf ### What changes were proposed in this pull request? Skip files if they are binary or very large to fit the configMap's max size. ### Why are the changes needed? Config map cannot hold binary files and there is also a limit on how much data a configMap can hold. This limit can be configured by the k8s cluster admin. This PR, skips such files (with a warning) instead of failing with weird runtime errors. If such files are not skipped, then it would result in mount errors or encoding errors (if binary files are submitted). ### Does this PR introduce _any_ user-facing change? yes, in simple words avoids possible errors due to negligence (for example, placing a large file or a binary file in SPARK_CONF_DIR) and thus improves user experience. ### How was this patch tested? Added relevant tests and improved existing tests. Closes #30472 from ScrapCodes/SPARK-32221/avoid-conf-propagate-errors. Lead-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Co-authored-by: Prashant Sharma <prashant@apache.org> Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>	2021-01-06 14:55:40 +05:30
gengjiaan	26d8df300a	[SPARK-33938][SQL] Optimize Like Any/All by LikeSimplification ### What changes were proposed in this pull request? We should optimize Like Any/All by LikeSimplification to improve performance. ### Why are the changes needed? Optimize Like Any/All ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #30975 from beliefer/SPARK-33938. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-06 08:25:34 +00:00
yangjie01	45a4ff8e54	[SPARK-33948][SQL] Fix CodeGen error of MapObjects.doGenCode method in Scala 2.13 ### What changes were proposed in this pull request? `MapObjects.doGenCode` method will generate wrong code when `inputDataType` is `ArrayBuffer`. For example `encode/decode for Tuple2: (ArrayBuffer[(String, String)],ArrayBuffer((a,b))) (codegen path)` in `ExpressionEncoderSuite`, the error generated code part as follow: ``` /* 126 / private scala.collection.mutable.ArrayBuffer MapObjects_0(InternalRow i) { / 127 / boolean isNull_4 = i.isNullAt(1); / 128 / ArrayData value_4 = isNull_4 ? / 129 / null : (i.getArray(1)); / 130 / scala.collection.mutable.ArrayBuffer value_3 = null; / 131 / / 132 / if (!isNull_4) { / 133 / / 134 / int dataLength_0 = value_4.numElements(); / 135 / / 136 / scala.Tuple2[] convertedArray_0 = null; / 137 / convertedArray_0 = new scala.Tuple2[dataLength_0]; / 138 / / 139 / / 140 / int loopIndex_0 = 0; / 141 / / 142 / while (loopIndex_0 < dataLength_0) { / 143 / value_MapObject_lambda_variable_1 = (InternalRow) (value_4.getStruct(loopIndex_0, 2)); / 144 / isNull_MapObject_lambda_variable_1 = value_4.isNullAt(loopIndex_0); / 145 / / 146 / boolean isNull_5 = false; / 147 / scala.Tuple2 value_5 = null; / 148 / if (!false && isNull_MapObject_lambda_variable_1) { / 149 / / 150 / isNull_5 = true; / 151 / value_5 = ((scala.Tuple2)null); / 152 / } else { / 153 / scala.Tuple2 value_13 = NewInstance_0(i); / 154 / isNull_5 = false; / 155 / value_5 = value_13; / 156 / } / 157 / if (isNull_5) { / 158 / convertedArray_0[loopIndex_0] = null; / 159 / } else { / 160 / convertedArray_0[loopIndex_0] = value_5; / 161 / } / 162 / / 163 / loopIndex_0 += 1; / 164 / } / 165 / / 166 / value_3 = new org.apache.spark.sql.catalyst.util.GenericArrayData(convertedArray_0); / 167 / } / 168 / globalIsNull_0 = isNull_4; / 169 / return value_3; / 170 / } ``` Line 166 in generated code try to assign `GenericArrayData` to `value_3(ArrayBuffer)` because `ArrayBuffer` type can't match `s.c.i.Seq` branch in Scala 2.13 in `MapObjects.doGenCode` method now. So this pr change to use `s.c.Seq` instead of `Seq` alias to let `ArrayBuffer` type can enter the same branch as Scala 2.12. After this pr the generate code when `inputDataType` is `ArrayBuffer` as follow: ``` / 126 / private scala.collection.mutable.ArrayBuffer MapObjects_0(InternalRow i) { / 127 / boolean isNull_4 = i.isNullAt(1); / 128 / ArrayData value_4 = isNull_4 ? / 129 / null : (i.getArray(1)); / 130 / scala.collection.mutable.ArrayBuffer value_3 = null; / 131 / / 132 / if (!isNull_4) { / 133 / / 134 / int dataLength_0 = value_4.numElements(); / 135 / / 136 / scala.collection.mutable.Builder collectionBuilder_0 = scala.collection.mutable.ArrayBuffer$.MODULE$.newBuilder(); / 137 / collectionBuilder_0.sizeHint(dataLength_0); / 138 / / 139 / / 140 / int loopIndex_0 = 0; / 141 / / 142 / while (loopIndex_0 < dataLength_0) { / 143 / value_MapObject_lambda_variable_1 = (InternalRow) (value_4.getStruct(loopIndex_0, 2)); / 144 / isNull_MapObject_lambda_variable_1 = value_4.isNullAt(loopIndex_0); / 145 / / 146 / boolean isNull_5 = false; / 147 / scala.Tuple2 value_5 = null; / 148 / if (!false && isNull_MapObject_lambda_variable_1) { / 149 / / 150 / isNull_5 = true; / 151 / value_5 = ((scala.Tuple2)null); / 152 / } else { / 153 / scala.Tuple2 value_13 = NewInstance_0(i); / 154 / isNull_5 = false; / 155 / value_5 = value_13; / 156 / } / 157 / if (isNull_5) { / 158 / collectionBuilder_0.$plus$eq(null); / 159 / } else { / 160 / collectionBuilder_0.$plus$eq(value_5); / 161 / } / 162 / / 163 / loopIndex_0 += 1; / 164 / } / 165 / / 166 / value_3 = (scala.collection.mutable.ArrayBuffer) collectionBuilder_0.result(); / 167 / } / 168 / globalIsNull_0 = isNull_4; / 169 / return value_3; / 170 */ } ``` ### Why are the changes needed? Bug fix in Scala 2.13 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `sql/catalyst` and `sql/core` in Scala 2.13 passed ``` mvn clean test -pl sql/catalyst -Pscala-2.13 Run completed in 11 minutes, 23 seconds. Total number of tests run: 4711 Suites: completed 261, aborted 0 Tests: succeeded 4711, failed 0, canceled 0, ignored 5, pending 0 All tests passed. ``` - Manual cherry-pick this pr to branch 3.1 and test`sql/catalyst` in Scala 2.13 passed ``` mvn clean test -pl sql/catalyst -Pscala-2.13 Run completed in 11 minutes, 18 seconds. Total number of tests run: 4655 Suites: completed 256, aborted 0 Tests: succeeded 4655, failed 0, canceled 0, ignored 5, pending 0 ``` Closes #31055 from LuciferYang/SPARK-33948. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 23:11:23 -08:00
angerszhu	c0d0dbabdb	[SPARK-33934][SQL][FOLLOW-UP] Use SubProcessor's exit code as assert condition to fix flaky test ### What changes were proposed in this pull request? Follow comment and fix. flaky test https://github.com/apache/spark/pull/30973#issuecomment-754852130. This flaky test is similar as https://github.com/apache/spark/pull/30896 Some task's failed with root cause but in driver may return error without root cause , change. UT to check with status exit code since different root cause's exit code is not same. ### Why are the changes needed? Fix flaky test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #31046 from AngersZhuuuu/SPARK-33934-FOLLOW-UP. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 22:33:15 -08:00
Jungtaek Lim (HeartSaVioR)	fa9309001a	[SPARK-33635][SS] Adjust the order of check in KafkaTokenUtil.needTokenUpdate to remedy perf regression ### What changes were proposed in this pull request? This PR proposes to adjust the order of check in KafkaTokenUtil.needTokenUpdate, so that short-circuit applies on the non-delegation token cases (insecure + secured without delegation token) and remedies the performance regression heavily. ### Why are the changes needed? There's a serious performance regression between Spark 2.4 vs Spark 3.0 on read path against Kafka data source. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually ran a reproducer (https://github.com/codegorillauk/spark-kafka-read with modification to just count instead of writing to Kafka topic) with measuring the time. > the branch applying the change with adding measurement https://github.com/HeartSaVioR/spark/commits/debug-SPARK-33635-v3.0.1 > the branch only adding measurement https://github.com/HeartSaVioR/spark/commits/debug-original-ver-SPARK-33635-v3.0.1 > the result (before the fix) count: 10280000 Took 41.634007047 secs 21/01/06 13:16:07 INFO KafkaDataConsumer: debug ver. 17-original 21/01/06 13:16:07 INFO KafkaDataConsumer: Total time taken to retrieve: 82118 ms > the result (after the fix) count: 10280000 Took 7.964058475 secs 21/01/06 13:08:22 INFO KafkaDataConsumer: debug ver. 17 21/01/06 13:08:22 INFO KafkaDataConsumer: Total time taken to retrieve: 987 ms Closes #31056 from HeartSaVioR/SPARK-33635. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 21:59:49 -08:00
Kousuke Saruta	b1c4fc7fc7	[SPARK-34008][BUILD] Upgrade derby to 10.14.2.0 ### What changes were proposed in this pull request? This PR upgrades `derby` to `10.14.2.0`. You can check the major changes from the following URLs. * 10.13.1.1 http://svn.apache.org/repos/asf/db/derby/code/tags/10.13.1.1/RELEASE-NOTES.html * 10.14.1.0 http://svn.apache.org/repos/asf/db/derby/code/tags/10.14.1.0/RELEASE-NOTES.html * 10.14.2.0 http://svn.apache.org/repos/asf/db/derby/code/tags/10.14.2.0/RELEASE-NOTES.html ### Why are the changes needed? It seems to be the final release which supports `JDK8` as the minimum required version. After `10.15.1.3`, the minimum required version is `JDK9`. https://db.apache.org/derby/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31032 from sarutak/upgrade-derby. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 21:50:16 -08:00
gengjiaan	2ab77d634f	[SPARK-34004][SQL] Change FrameLessOffsetWindowFunction as sealed abstract class ### What changes were proposed in this pull request? Change `FrameLessOffsetWindowFunction` as sealed abstract class so that simplify pattern match. ### Why are the changes needed? Simplify pattern match ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? Jenkins test Closes #31026 from beliefer/SPARK-30789-followup. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 20:45:19 -08:00
Baohe Zhang	29510821a0	[SPARK-33029][CORE][WEBUI] Fix the UI executor page incorrectly marking the driver as excluded ### What changes were proposed in this pull request? Filter out the driver entity when updating the exclusion status of live executors(including the driver), so the UI won't be marked as excluded in the UI even if the node that hosts the driver has been marked as excluded. ### Why are the changes needed? Before this change, if we run spark with the standalone mode and with spark.blacklist.enabled=true. The driver will be marked as excluded when the host that hosts that driver has been marked as excluded. While it's incorrect because the exclude list feature will exclude executors only and the driver is still active. ![image](https://user-images.githubusercontent.com/26694233/103238740-35c05180-4911-11eb-99a2-c87c059ba0cf.png) After the fix, the driver won't be marked as excluded. ![image](https://user-images.githubusercontent.com/26694233/103238806-6f915800-4911-11eb-80d5-3c99266cfd0a.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Reopen the UI and see the driver is no longer marked as excluded. Closes #30954 from baohe-zhang/SPARK-33029. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 19:16:40 -08:00
Tom.Howland	3d8ee492d6	[SPARK-34015][R] Fixing input timing in gapply ### What changes were proposed in this pull request? When sparkR is run at log level INFO, a summary of how the worker spent its time processing the partition is printed. There is a logic error where it is over-reporting the time inputting rows. In detail: the variable inputElap in a wider context is used to mark the end of reading rows, but in the part changed here it was used as a local variable for measuring the beginning of compute time in a loop over the groups in the partition. Thus, the error is not observable if there is only one group per partition, which is what you get in unit tests. For our application, here's what a log entry looks like before these changes were applied: `20/10/09 04:08:58 INFO RRunner: Times: boot = 0.013 s, init = 0.005 s, broadcast = 0.000 s, read-input = 529.471 s, compute = 492.037 s, write-output = 0.020 s, total = 1021.546 s` this indicates that we're spending more time reading rows than operating on the rows. After these changes, it looks like this: `20/12/15 06:43:29 INFO RRunner: Times: boot = 0.013 s, init = 0.010 s, broadcast = 0.000 s, read-input = 120.275 s, compute = 1680.161 s, write-output = 0.045 s, total = 1812.553 s ` ### Why are the changes needed? Metrics shouldn't mislead? ### Does this PR introduce _any_ user-facing change? Aside from no longer misleading, no ### How was this patch tested? unit tests passed. Field test results seem plausible Closes #31021 from WamBamBoozle/input_timing. Authored-by: Tom.Howland <Tom.Howland@target.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 11:40:02 +09:00
Max Gekk	b77d11dfd9	[SPARK-34011][SQL] Refresh cache in `ALTER TABLE .. RENAME TO PARTITION` ### What changes were proposed in this pull request? 1. Invoke `refreshTable()` from `AlterTableRenamePartitionCommand.run()` after partitions renaming. In particular, this re-creates the cache associated with the modified table. 2. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. RENAME TO PARTITION` command. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> CREATE TABLE tbl1 (col0 int, part0 int) USING parquet PARTITIONED BY (part0); spark-sql> INSERT INTO tbl1 PARTITION (part0=0) SELECT 0; spark-sql> INSERT INTO tbl1 PARTITION (part0=1) SELECT 1; spark-sql> CACHE TABLE tbl1; spark-sql> SELECT * FROM tbl1; 0 0 1 1 spark-sql> ALTER TABLE tbl1 PARTITION (part0=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 0 1 1 ``` The last query must not return `0 2` since `0 0` was renamed by previous command. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> ALTER TABLE tbl1 PARTITION (part=0) RENAME TO PARTITION (part=2); spark-sql> SELECT * FROM tbl1; 0 2 1 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31044 from MaxGekk/rename-partition-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-06 11:19:44 +09:00
angerszhu	e279ed3044	[SPARK-34012][SQL] Keep behavior consistent when conf `spark.sql.legacy.parser.havingWithoutGroupByAsWhere` is true with migration guide ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/22696 we support HAVING without GROUP BY means global aggregate But since we treat having as Filter before, in this way will cause a lot of analyze error, after https://github.com/apache/spark/pull/28294 we use `UnresolvedHaving` to instead `Filter` to solve such problem, but break origin logical about treat `SELECT 1 FROM range(10) HAVING true` as `SELECT 1 FROM range(10) WHERE true` . This PR fix this issue and add UT. ### Why are the changes needed? Keep consistent behavior of migration guide. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? added UT Closes #31039 from AngersZhuuuu/SPARK-25780-Follow-up. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-06 08:48:24 +09:00
Holden Karau	171db85aa2	[SPARK-33874][K8S][FOLLOWUP] Handle long lived sidecars - clean up logging ### What changes were proposed in this pull request? Switch log level from warn to debug when the spark container is not present in the pod's container statuses. ### Why are the changes needed? There are many non-critical situations where the Spark container may not be present, and the warning log level is too high. ### Does this PR introduce _any_ user-facing change? Log message change. ### How was this patch tested? N/A Closes #31047 from holdenk/SPARK-33874-follow-up. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 13:48:52 -08:00
gengjiaan	cc1d9d25fb	[SPARK-33542][SQL] Group exception messages in catalyst/catalog ### What changes were proposed in this pull request? This PR group exception messages in `/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog`. ### Why are the changes needed? It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. Closes #30870 from beliefer/SPARK-33542. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: Jiaan Geng <beliefer@163.com> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 16:15:33 +00:00
huangtianhua	14c2edae7e	[SPARK-34009][BUILD] To activate profile 'aarch64' based on OS settings Instead of taking parameter '-Paarch64' when maven build to activate the profile based on OS settings automatically, than we can use same command to build on aarch64. ### What changes were proposed in this pull request? Activate profile 'aarch64' based on OS ### Why are the changes needed? After this change, we build spark using the same command for aarch64 as x86. ### Does this PR introduce _any_ user-facing change? No. After this change, no need to taking parameter '-Paarch64' when build, but take the parameter works also. ### How was this patch tested? ARM daily CI. Closes #31036 from huangtianhua/SPARK-34009. Authored-by: huangtianhua <huangtianhua223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 21:50:21 +09:00
HyukjinKwon	8d09f96495	[SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build ### What changes were proposed in this pull request? This PR proposes to use python3 instead of python in SQL documentation build. After SPARK-29672, we use `sql/create-docs.sh` everywhere in Spark dev. We should fix it in `sql/create-docs.sh` too. This blocks release because the release container does not have `python` but only `python3`. ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? I manually ran the script Closes #31041 from HyukjinKwon/SPARK-34010. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 19:48:10 +09:00
Kousuke Saruta	acf0a4fac2	[SPARK-33999][BUILD] Make sbt unidoc success with JDK11 ### What changes were proposed in this pull request? This PR fixes an issue that `sbt unidoc` fails with JDK11. With the current master, `sbt unidoc` fails because the generated Java sources cause syntax error. As of JDK11, the default doclet seems to refuse such syntax error. Usually, it's enough to specify `--ignore-source-errors` option when `javadoc` runs to suppress the syntax error but unfortunately, we will then get an internal error. ``` [error] javadoc: error - An internal exception has occurred. [error] (java.lang.NullPointerException) [error] Please file a bug against the javadoc tool via the Java bug reporting page [error] (http://bugreport.java.com) after checking the Bug Database (http://bugs.java.com) [error] for duplicates. Include error messages and the following diagnostic in your report. Thank you. [error] java.lang.NullPointerException [error] at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2340) [error] at jdk.compiler/com.sun.tools.javac.code.Types$14.visitTypeVar(Types.java:2398) [error] at jdk.compiler/com.sun.tools.javac.code.Types$14.visitTypeVar(Types.java:2348) [error] at jdk.compiler/com.sun.tools.javac.code.Type$TypeVar.accept(Type.java:1659) [error] at jdk.compiler/com.sun.tools.javac.code.Types$DefaultTypeVisitor.visit(Types.java:4857) [error] at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2343) [error] at jdk.compiler/com.sun.tools.javac.code.Types.erasure(Types.java:2329) [error] at jdk.compiler/com.sun.tools.javac.model.JavacTypes.erasure(JavacTypes.java:134) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils$5.visitTypeVariable(Utils.java:1069) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils$5.visitTypeVariable(Utils.java:1048) [error] at jdk.compiler/com.sun.tools.javac.code.Type$TypeVar.accept(Type.java:1695) [error] at java.compiler11.0.9.1/javax.lang.model.util.AbstractTypeVisitor6.visit(AbstractTypeVisitor6.java:104) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.Utils.asTypeElement(Utils.java:1086) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkInfoImpl.setContext(LinkInfoImpl.java:410) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkInfoImpl.<init>(LinkInfoImpl.java:285) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkFactoryImpl.getTypeParameterLink(LinkFactoryImpl.java:184) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.LinkFactoryImpl.getTypeParameterLinks(LinkFactoryImpl.java:167) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.util.links.LinkFactory.getLink(LinkFactory.java:196) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.getLink(HtmlDocletWriter.java:679) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.addPreQualifiedClassLink(HtmlDocletWriter.java:814) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDocletWriter.addPreQualifiedStrongClassLink(HtmlDocletWriter.java:839) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addPartialInfo(AbstractTreeWriter.java:185) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addLevelInfo(AbstractTreeWriter.java:92) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addLevelInfo(AbstractTreeWriter.java:94) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addTree(AbstractTreeWriter.java:129) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.AbstractTreeWriter.addTree(AbstractTreeWriter.java:112) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.PackageTreeWriter.generatePackageTreeFile(PackageTreeWriter.java:115) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.PackageTreeWriter.generate(PackageTreeWriter.java:92) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.formats.html.HtmlDoclet.generatePackageFiles(HtmlDoclet.java:312) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.AbstractDoclet.startGeneration(AbstractDoclet.java:210) [error] at jdk.javadoc/jdk.javadoc.internal.doclets.toolkit.AbstractDoclet.run(AbstractDoclet.java:114) [error] at jdk.javadoc/jdk.javadoc.doclet.StandardDoclet.run(StandardDoclet.java:72) [error] at jdk.javadoc/jdk.javadoc.internal.tool.Start.parseAndExecute(Start.java:588) [error] at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:432) [error] at jdk.javadoc/jdk.javadoc.internal.tool.Start.begin(Start.java:345) [error] at jdk.javadoc/jdk.javadoc.internal.tool.Main.execute(Main.java:63) [error] at jdk.javadoc/jdk.javadoc.internal.tool.Main.main(Main.java:52) ``` I found the internal error happens when a generated Java class is from a Scala class which is package private and generic. I also found that if we don't generate class hierarchy tree in the JavaDoc, we can suppress the internal error for JDK11 and later. ### Why are the changes needed? Make the build success with sbt and JDK11. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed the following command successfully finish with JDK8 and JDK11. ``` $ build/sbt -Phive -Phive-thriftserver -Pyarn -Pkubernetes -Pmesos -Pspark-ganglia-lgpl -Pkinesis-asl -Phadoop-cloud clean unidoc ``` I also confirmed html files are successfully generated under `target/javaunidoc`. Closes #31023 from sarutak/fix-genjavadoc-java11. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 19:03:28 +09:00
HyukjinKwon	329850c667	[SPARK-32017][PYTHON][FOLLOW-UP] Rename HADOOP_VERSION to PYSPARK_HADOOP_VERSION in pip installation option ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/29703. It renames `HADOOP_VERSION` environment variable to `PYSPARK_HADOOP_VERSION` in case `HADOOP_VERSION` is already being used somewhere. Arguably `HADOOP_VERSION` is a pretty common name. I see here and there: - https://www.ibm.com/support/knowledgecenter/SSZUMP_7.2.1/install_grid_sym/understanding_advanced_edition.html - https://cwiki.apache.org/confluence/display/ARROW/HDFS+Filesystem+Support - http://crs4.github.io/pydoop/_pydoop1/installation.html ### Why are the changes needed? To avoid the environment variables is unexpectedly conflicted. ### Does this PR introduce _any_ user-facing change? It renames the environment variable but it's not released yet. ### How was this patch tested? Existing unittests will test. Closes #31028 from HyukjinKwon/SPARK-32017-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 17:21:32 +09:00
HyukjinKwon	356fdc9a7f	[SPARK-34007][BUILD] Downgrade scala-maven-plugin to 4.3.0 ### What changes were proposed in this pull request? This PR is a partial revert of https://github.com/apache/spark/pull/30456 by downgrading scala-maven-plugin from 4.4.0 to 4.3.0. Currently, when you run the docker release script (`./dev/create-release/do-release-docker.sh`), it fails to compile as below during incremental compilation with zinc for an unknown reason: ``` [INFO] Compiling 21 Scala sources and 3 Java sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes ... [ERROR] ## Exception when compiling 24 sources to /opt/spark-rm/output/spark-3.1.0-bin-hadoop2.7/resource-managers/yarn/target/scala-2.12/test-classes java.lang.SecurityException: class "javax.servlet.SessionCookieConfig"'s signer information does not match signer information of other classes in the same package java.lang.ClassLoader.checkCerts(ClassLoader.java:891) java.lang.ClassLoader.preDefineClass(ClassLoader.java:661) java.lang.ClassLoader.defineClass(ClassLoader.java:754) java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) java.net.URLClassLoader.defineClass(URLClassLoader.java:468) java.net.URLClassLoader.access$100(URLClassLoader.java:74) java.net.URLClassLoader$1.run(URLClassLoader.java:369) java.net.URLClassLoader$1.run(URLClassLoader.java:363) java.security.AccessController.doPrivileged(Native Method) java.net.URLClassLoader.findClass(URLClassLoader.java:362) java.lang.ClassLoader.loadClass(ClassLoader.java:418) java.lang.ClassLoader.loadClass(ClassLoader.java:351) java.lang.Class.getDeclaredMethods0(Native Method) java.lang.Class.privateGetDeclaredMethods(Class.java:2701) java.lang.Class.privateGetPublicMethods(Class.java:2902) java.lang.Class.getMethods(Class.java:1615) sbt.internal.inc.ClassToAPI$.toDefinitions0(ClassToAPI.scala:170) sbt.internal.inc.ClassToAPI$.$anonfun$toDefinitions$1(ClassToAPI.scala:123) scala.collection.mutable.HashMap.getOrElseUpdate(HashMap.scala:86) sbt.internal.inc.ClassToAPI$.toDefinitions(ClassToAPI.scala:123) sbt.internal.inc.ClassToAPI$.$anonfun$process$1(ClassToAPI.scala:3 ``` This happens when it builds Spark with Hadoop 2. It doesn't reproduce when you build this alone. It should follow the sequence of build in the release script. This is fixed by downgrading. Looks like there is a regression in scala-maven-plugin somewhere between 4.4.0 and 4.3.0. ### Why are the changes needed? To unblock the release. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It can be tested as below: ```bash ./dev/create-release/do-release-docker.sh -d $WORKING_DIR ``` Closes #31031 from HyukjinKwon/SPARK-34007. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 17:20:08 +09:00
Max Gekk	122f8f0fdb	[SPARK-33919][SQL][TESTS] Unify v1 and v2 SHOW NAMESPACES tests ### What changes were proposed in this pull request? 1. Port DS V2 tests from `DataSourceV2SQLSuite` to the base test suite `ShowNamespacesSuiteBase` to run those tests for v1 catalogs. 2. Port DS v1 tests from `DDLSuite` to `ShowNamespacesSuiteBase` to run the tests for v2 catalogs too. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowNamespacesSuite" ``` Closes #30937 from MaxGekk/unify-show-namespaces-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 07:30:59 +00:00
tanel.kiis@gmail.com	f252a9334e	[SPARK-33935][SQL] Fix CBO cost function ### What changes were proposed in this pull request? Changed the cost function in CBO to match documentation. ### Why are the changes needed? The parameter `spark.sql.cbo.joinReorder.card.weight` is documented as: ``` The weight of cardinality (number of rows) for plan cost comparison in join reorder: rows * weight + size * (1 - weight). ``` The implementation in `JoinReorderDP.betterThan` does not match this documentaiton: ``` def betterThan(other: JoinPlan, conf: SQLConf): Boolean = { if (other.planCost.card == 0 \|\| other.planCost.size == 0) { false } else { val relativeRows = BigDecimal(this.planCost.card) / BigDecimal(other.planCost.card) val relativeSize = BigDecimal(this.planCost.size) / BigDecimal(other.planCost.size) relativeRows * conf.joinReorderCardWeight + relativeSize * (1 - conf.joinReorderCardWeight) < 1 } } ``` This different implementation has an unfortunate consequence: given two plans A and B, both A betterThan B and B betterThan A might give the same results. This happes when one has many rows with small sizes and other has few rows with large sizes. A example values, that have this fenomen with the default weight value (0.7): A.card = 500, B.card = 300 A.size = 30, B.size = 80 Both A betterThan B and B betterThan A would have score above 1 and would return false. This happens with several of the TPCDS queries. The new implementation does not have this behavior. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New and existing UTs Closes #30965 from tanelk/SPARK-33935_cbo_cost_function. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-05 16:00:24 +09:00
fwang12	a071826f72	[SPARK-33100][SQL] Ignore a semicolon inside a bracketed comment in spark-sql ### What changes were proposed in this pull request? Now the spark-sql does not support parse the sql statements with bracketed comments. For the sql statements: ``` /* SELECT 'test'; / SELECT 'test'; ``` Would be split to two statements: The first one: `/ SELECT 'test'` The second one: `*/ SELECT 'test'` Then it would throw an exception because the first one is illegal. In this PR, we ignore the content in bracketed comments while splitting the sql statements. Besides, we ignore the comment without any content. ### Why are the changes needed? Spark-sql might split the statements inside bracketed comments and it is not correct. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added UT. Closes #29982 from turboFei/SPARK-33110. Lead-authored-by: fwang12 <fwang12@ebay.com> Co-authored-by: turbofei <fwang12@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-05 15:55:30 +09:00
LantaoJin	a7d3fcd354	[SPARK-34000][CORE] Fix stageAttemptToNumSpeculativeTasks java.util.NoSuchElementException ### What changes were proposed in this pull request? From below log, Stage 600 could be removed from `stageAttemptToNumSpeculativeTasks` by `onStageCompleted()`, but the speculative task 306.1 in stage 600 threw `NoSuchElementException` when it entered into `onTaskEnd()`. ``` 21/01/04 03:00:32,259 WARN [task-result-getter-2] scheduler.TaskSetManager:69 : Lost task 306.1 in stage 600.0 (TID 283610, hdc49-mcc10-01-0510-4108-039-tess0097.stratus.rno.ebay.com, executor 27): TaskKilled (another attempt succeeded) 21/01/04 03:00:32,259 INFO [task-result-getter-2] scheduler.TaskSetManager:57 : Task 306.1 in stage 600.0 (TID 283610) failed, but the task will not be re-executed (either because the task failed with a shuffle data fetch failure, so the previous stage needs to be re-run, or because a different copy of the task has already succeeded). 21/01/04 03:00:32,259 INFO [task-result-getter-2] cluster.YarnClusterScheduler:57 : Removed TaskSet 600.0, whose tasks have all completed, from pool default 21/01/04 03:00:32,259 INFO [HiveServer2-Handler-Pool: Thread-5853] thriftserver.SparkExecuteStatementOperation:190 : Returning result set with 50 rows from offsets [5378600, 5378650) with 1fe245f8-a7f9-4ec0-bcb5-8cf324cbbb47 21/01/04 03:00:32,260 ERROR [spark-listener-group-executorManagement] scheduler.AsyncEventQueue:94 : Listener ExecutorAllocationListener threw an exception java.util.NoSuchElementException: key not found: Stage 600 (Attempt 0) at scala.collection.MapLike.default(MapLike.scala:235) at scala.collection.MapLike.default$(MapLike.scala:234) at scala.collection.AbstractMap.default(Map.scala:63) at scala.collection.mutable.HashMap.apply(HashMap.scala:69) at org.apache.spark.ExecutorAllocationManager$ExecutorAllocationListener.onTaskEnd(ExecutorAllocationManager.scala:621) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent(SparkListenerBus.scala:45) at org.apache.spark.scheduler.SparkListenerBus.doPostEvent$(SparkListenerBus.scala:28) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.scheduler.AsyncEventQueue.doPostEvent(AsyncEventQueue.scala:38) at org.apache.spark.util.ListenerBus.postToAll(ListenerBus.scala:115) at org.apache.spark.util.ListenerBus.postToAll$(ListenerBus.scala:99) at org.apache.spark.scheduler.AsyncEventQueue.super$postToAll(AsyncEventQueue.scala:116) at org.apache.spark.scheduler.AsyncEventQueue.$anonfun$dispatch$1(AsyncEventQueue.scala:116) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:62) at org.apache.spark.scheduler.AsyncEventQueue.org$apache$spark$scheduler$AsyncEventQueue$$dispatch(AsyncEventQueue.scala:102) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.$anonfun$run$1(AsyncEventQueue.scala:97) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1320) at org.apache.spark.scheduler.AsyncEventQueue$$anon$2.run(AsyncEventQueue.scala:97) ``` ### Why are the changes needed? To avoid throwing the java.util.NoSuchElementException ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This is a protective patch and it's not easy to reproduce in UT due to the event order is not fixed in a async queue. Closes #31025 from LantaoJin/SPARK-34000. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 21:37:26 -08:00
Kent Yao	f0ffe0cd65	[SPARK-33992][SQL] override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer ### What changes were proposed in this pull request? In https://github.com/apache/spark/pull/29643, we move the plan rewriting methods to QueryPlan. we need to override transformUpWithNewOutput to add allowInvokingTransformsInAnalyzer because it and resolveOperatorsUpWithNewOutput are called in the analyzer. For example, PaddingAndLengthCheckForCharVarchar could fail query when resolveOperatorsUpWithNewOutput with ```logtalk [info] - char/varchar resolution in sub query * FAILED * (367 milliseconds) [info] java.lang.RuntimeException: This method should not be called in the analyzer [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule(AnalysisHelper.scala:150) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.assertNotAnalysisRule$(AnalysisHelper.scala:146) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.assertNotAnalysisRule(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown(AnalysisHelper.scala:161) [info] at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.transformDown$(AnalysisHelper.scala:160) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.transformDown(LogicalPlan.scala:29) [info] at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$updateOuterReferencesInSubquery(QueryPlan.scala:267) ``` ### Why are the changes needed? trivial bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #31013 from yaooqinn/SPARK-33992. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:34:11 +00:00
Terry Kim	15a863fd54	[SPARK-34001][SQL][TESTS] Remove unused runShowTablesSql() in DataSourceV2SQLSuite.scala ### What changes were proposed in this pull request? After #30287, `runShowTablesSql()` in `DataSourceV2SQLSuite.scala` is no longer used. This PR removes the unused method. ### Why are the changes needed? To remove unused method. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test. Closes #31022 from imback82/33382-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 21:32:49 -08:00
Terry Kim	6b00fdc756	[SPARK-33998][SQL] Provide an API to create an InternalRow in V2CommandExec ### What changes were proposed in this pull request? There are many v2 commands such as `SHOW TABLES`, `DESCRIBE TABLE`, etc. that require creating `InternalRow`s. Currently, the code to create `InternalRow`s are duplicated across many commands and it can be moved into `V2CommandExec` to remove duplicate code. ### Why are the changes needed? To clean up duplicate code. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test since this is just refactoring. Closes #31020 from imback82/refactor_v2_command. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:32:36 +00:00
Chongguang LIU	976e97a80d	[SPARK-33794][SQL] NextDay expression throw runtime IllegalArgumentException when receiving invalid input under ANSI mode ### What changes were proposed in this pull request? Instead of returning NULL, the next_day function throws runtime IllegalArgumentException when ansiMode is enable and receiving invalid input of the dayOfWeek parameter. ### Why are the changes needed? For ansiMode. ### Does this PR introduce _any_ user-facing change? Yes. When spark.sql.ansi.enabled = true, the next_day function will throw IllegalArgumentException when receiving invalid input of the dayOfWeek parameter. When spark.sql.ansi.enabled = false, same behaviour as before. ### How was this patch tested? Ansi mode is tested with existing tests. End-to-end tests have been added. Closes #30807 from chongguang/SPARK-33794. Authored-by: Chongguang LIU <chongguang.liu@laposte.fr> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-05 05:20:16 +00:00
tanel.kiis@gmail.com	bb6d6b5602	[SPARK-33964][SQL] Combine distinct unions in more cases ### What changes were proposed in this pull request? Added the `RemoveNoopOperators` rule to optimization batch `Union`. Also made sure that the `RemoveNoopOperators` would be idempotent. ### Why are the changes needed? In several TPCDS queries the `CombineUnions` rule does not manage to combine unions, because they have noop `Project`s between them. The `Project`s will be removed by `RemoveNoopOperators`, but by then `ReplaceDistinctWithAggregate` has been applied and there are aggregates between the unions. Adding a copy of `RemoveNoopOperators` earlier in the optimization chain allows `CombineUnions` to work on more queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UTs and the output of `PlanStabilitySuite` Closes #30996 from tanelk/SPARK-33964_combine_unions. Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-05 11:01:31 +09:00
angerszhu	559f411da8	[SPARK-33908][CORE][FOLLOWUP] Correct Scaladoc of resolveDependencyPaths/resolveMavenDependencies ### What changes were proposed in this pull request? Fix un-correct doc of last change https://github.com/apache/spark/pull/30922#discussion_r551453193 ### Why are the changes needed? FIx doc ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Builds finished correctly. Closes #31016 from AngersZhuuuu/SPARK-33908-FOLLOW-UP. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:44:42 -08:00
Koert Kuipers	9b4173fa95	[SPARK-33894][SQL] Change visibility of private case classes in mllib to avoid runtime compilation errors with Scala 2.13 ### What changes were proposed in this pull request? Change visibility modifier of two case classes defined inside objects in mllib from private to private[OuterClass] ### Why are the changes needed? Without this change when running tests for Scala 2.13 you get runtime code generation errors. These errors look like this: ``` [info] Cause: java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 73, Column 65: No applicable constructor/method found for zero actual parameters; candidates are: "public java.lang.String org.apache.spark.ml.feature.Word2VecModel$Data.word()" ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests now pass for Scala 2.13 Closes #31018 from koertkuipers/feat-visibility-scala213. Authored-by: Koert Kuipers <koert@tresata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:40:32 -08:00
Max Gekk	84c1f43669	[SPARK-33987][SQL] Refresh cache in v2 `ALTER TABLE .. DROP PARTITION` ### What changes were proposed in this pull request? 1. Refresh the cache associated with tables from v2 table catalogs in the `ALTER TABLE .. DROP PARTITION` command. 2. Port the test for v1 catalogs to the base suite to run it for v2 table catalog. ### Why are the changes needed? The changes fix incorrect query results from cached V2 table altered by `ALTER TABLE .. DROP PARTITION`, see the added test and SPARK-33987. ### Does this PR introduce _any_ user-facing change? Yes, it could if users have v2 table catalogs. ### How was this patch tested? By running unified tests for `ALTER TABLE .. DROP PARTITION`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableDropPartitionSuite" ``` Closes #31017 from MaxGekk/drop-partition-refresh-cache-v2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-04 15:00:48 -08:00

... 2 3 4 5 6 ...

29174 commits