ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Leona Yoda	36be232eea	[SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark ### What changes were proposed in this pull request? Replace images in pyspark on pandas document because those images uses the word Koalas ### Why are the changes needed? Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR . https://github.com/apache/spark/pull/32835 I think we have to match the word on that images ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? `make html` Screen shots ![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png) ![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png) ![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png) ![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png) Environment - Windows 10 - Google Chrome 92.0.4515.159 [images.pptx](https://github.com/apache/spark/files/7029087/images.pptx) Closes #33786 from yoda-mon/replace-pyspark-doc-images. Authored-by: Leona Yoda <yodal@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `aeb3da2798`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-26 19:03:11 +09:00
Max Gekk	5198c0c316	[SPARK-35581][SPARK-36567][SQL][DOCS][FOLLOWUP] Update the SQL migration guide about foldable special datetime values ### What changes were proposed in this pull request? In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well. <img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png"> ### Why are the changes needed? To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By generating docs, and checking results manually. Closes #33840 from MaxGekk/special-datetime-cast-migr-guide. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c4e739fb4b`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-26 10:02:15 +08:00
Kousuke Saruta	beabf91ea1	[SPARK-35236][SQL][DOCS][FOLLOWUP] Mention ARCHIVE as an acceptable resource type for CREATE FUNCTION statement ### What changes were proposed in this pull request? This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement. `ARCHIVE` is acceptable as of SPARK-35236 (#32359). ### Why are the changes needed? To maintain the document. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? `SKIP_API=1 bundle exec jekyll build` ![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png) Closes #33823 from sarutak/create-function-archive. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `bd0a4950ae`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-25 10:05:00 +09:00
Gengliang Wang	5463caac0d	Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization" ### What changes were proposed in this pull request? Revert `397b843890` and `5a48eb8d00` ### Why are the changes needed? As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ci tests Closes #33819 from gengliangwang/revert-SPARK-34415. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `de932f51ce`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-24 13:39:29 -07:00
Gengliang Wang	eea7d0037e	[SPARK-36557][DOCS] Update the MAVEN_OPTS in Spark build docs ### What changes were proposed in this pull request? As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation. We should update it with `-Xss64m` to avoid it. ### Why are the changes needed? Correct the documentation ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test. The MAVEN_OPTS is consistent with our github action build. Closes #33804 from gengliangwang/updateBuildDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3da0e9500f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-23 09:46:41 +09:00
Venkata krishnan Sowrirajan	0f2e318894	[SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl ### What changes were proposed in this pull request? Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615. ### Why are the changes needed? To keep the config names consistent ### Does this PR introduce _any_ user-facing change? Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change. ### How was this patch tested? Existing tests. Closes #33799 from venkata91/SPARK-36374-follow-up. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `7b2842e986`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-22 01:29:36 -05:00
Liang-Chi Hsieh	212a21ee4f	[MINOR][SS][DOCS] Update doc for streaming deduplication ### What changes were proposed in this pull request? This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation. ### Why are the changes needed? Update the user document. ### Does this PR introduce _any_ user-facing change? No. It's a doc only change. ### How was this patch tested? Doc only change. Closes #33801 from viirya/minor-ss-deduplication. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `5876e04de2`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-21 18:20:27 -07:00
Angerszhuuuu	45c4b751f3	[SPARK-36549][SQL] Add taskStatus supports multiple value to monitoring doc ### What changes were proposed in this pull request? In Stage related restful API, we support `taskStatus` parameter as a list ``` QueryParam("taskStatus") taskStatus: JList[TaskStatus] ``` In restful we should write like ``` taskStatus=SUCCESS&taskStatus=FAILED ``` It's usefule but not show in the doc, and many user don't know how to write the list parameters. So add this feature to monitoring doc too. ### Why are the changes needed? Make doc clear ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? With restful request ``` http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED ``` Resultful request result tasks ``` tasks" : { "0" : { "taskId" : 0, "index" : 0, "attempt" : 0, "launchTime" : "2021-08-20T04:06:55.515GMT", "duration" : 273, "executorId" : "driver", "host" : "host", "status" : "FAILED", "taskLocality" : "PROCESS_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n", "taskMetrics" : { "executorDeserializeTime" : 0, "executorDeserializeCpuTime" : 0, "executorRunTime" : 206, "executorCpuTime" : 0, "resultSize" : 0, "jvmGcTime" : 0, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "peakExecutionMemory" : 0, "inputMetrics" : { "bytesRead" : 0, "recordsRead" : 0 }, "outputMetrics" : { "bytesWritten" : 0, "recordsWritten" : 0 }, "shuffleReadMetrics" : { "remoteBlocksFetched" : 0, "localBlocksFetched" : 0, "fetchWaitTime" : 0, "remoteBytesRead" : 0, "remoteBytesReadToDisk" : 0, "localBytesRead" : 0, "recordsRead" : 0 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } }, "executorLogs" : { }, "schedulerDelay" : 67, "gettingResultTime" : 0 } }, ``` With restful request ``` http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED&taskStatus=SUCCESS ``` Restful result tasks ``` "tasks" : { "1" : { "taskId" : 1, "index" : 1, "attempt" : 0, "launchTime" : "2021-08-20T04:06:55.786GMT", "duration" : 16, "executorId" : "driver", "host" : "host", "status" : "SUCCESS", "taskLocality" : "PROCESS_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "taskMetrics" : { "executorDeserializeTime" : 2, "executorDeserializeCpuTime" : 2638000, "executorRunTime" : 2, "executorCpuTime" : 1993000, "resultSize" : 837, "jvmGcTime" : 0, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "peakExecutionMemory" : 0, "inputMetrics" : { "bytesRead" : 0, "recordsRead" : 0 }, "outputMetrics" : { "bytesWritten" : 0, "recordsWritten" : 0 }, "shuffleReadMetrics" : { "remoteBlocksFetched" : 0, "localBlocksFetched" : 0, "fetchWaitTime" : 0, "remoteBytesRead" : 0, "remoteBytesReadToDisk" : 0, "localBytesRead" : 0, "recordsRead" : 0 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } }, "executorLogs" : { }, "schedulerDelay" : 12, "gettingResultTime" : 0 }, "0" : { "taskId" : 0, "index" : 0, "attempt" : 0, "launchTime" : "2021-08-20T04:06:55.515GMT", "duration" : 273, "executorId" : "driver", "host" : "host", "status" : "FAILED", "taskLocality" : "PROCESS_LOCAL", "speculative" : false, "accumulatorUpdates" : [ ], "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n", "taskMetrics" : { "executorDeserializeTime" : 0, "executorDeserializeCpuTime" : 0, "executorRunTime" : 206, "executorCpuTime" : 0, "resultSize" : 0, "jvmGcTime" : 0, "resultSerializationTime" : 0, "memoryBytesSpilled" : 0, "diskBytesSpilled" : 0, "peakExecutionMemory" : 0, "inputMetrics" : { "bytesRead" : 0, "recordsRead" : 0 }, "outputMetrics" : { "bytesWritten" : 0, "recordsWritten" : 0 }, "shuffleReadMetrics" : { "remoteBlocksFetched" : 0, "localBlocksFetched" : 0, "fetchWaitTime" : 0, "remoteBytesRead" : 0, "remoteBytesReadToDisk" : 0, "localBytesRead" : 0, "recordsRead" : 0 }, "shuffleWriteMetrics" : { "bytesWritten" : 0, "writeTime" : 0, "recordsWritten" : 0 } }, "executorLogs" : { }, "schedulerDelay" : 67, "gettingResultTime" : 0 } }, ``` Closes #33793 from AngersZhuuuu/SPARK-36549. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `5740d5641d`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-22 09:45:34 +09:00
ulysses-you	e0d2d8f1a6	[SPARK-35083][CORE][FOLLLOWUP] Improve docs and migration guide ### What changes were proposed in this pull request? * improve docs in `docs/job-scheduling.md` * add migration guide docs in `docs/core-migration-guide.md` ### Why are the changes needed? Help user to migrate. ### Does this PR introduce _any_ user-facing change? yes ### How was this patch tested? Pass CI Closes #33794 from ulysses-you/SPARK-35083-f. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Kent Yao <yao@apache.org> (cherry picked from commit `90cbf9ca3e`) Signed-off-by: Kent Yao <yao@apache.org>	2021-08-20 21:33:06 +08:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	f47a519721	[SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile ### What changes were proposed in this pull request? After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`. This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully. Also, this PR mentions it in the README of docs. ### Why are the changes needed? Fix release script and update README of docs ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test locally. Closes #33797 from gengliangwang/fixReleaseDocker. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `42eebb84f5`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-20 20:02:44 +08:00
Yuanjian Li	36c24a03bd	[SPARK-35312][SS][FOLLOW-UP] More documents and checking logic for the new options ### What changes were proposed in this pull request? Add more documents and checking logic for the new options `minOffsetPerTrigger` and `maxTriggerDelay`. ### Why are the changes needed? Have a clear description of the behavior introduced in SPARK-35312 ### Does this PR introduce _any_ user-facing change? Yes. If the user set minOffsetsPerTrigger > maxOffsetsPerTrigger, the new code will throw an AnalysisException. The original behavior is to ignore the maxOffsetsPerTrigger silenctly. ### How was this patch tested? Existing tests. Closes #33792 from xuanyuanking/SPARK-35312-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `a0b24019ed`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-20 10:41:54 +09:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Gengliang Wang	8e58fafb05	Preparing Spark release v3.2.0-rc1	2021-08-19 14:08:26 +00:00
Gengliang Wang	fb56627f21	Revert "[SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the re… …mote scheduler pool files support" This reverts commit `e3902d1975`. The feature is improvement instead of behavior change. Closes #33789 from gengliangwang/revertDoc. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `b36b1c7e8a`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-19 21:30:19 +08:00
yi.wu	9544c24560	[SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the remote scheduler pool files support ### What changes were proposed in this pull request? Add remote scheduler pool files support to the migration guide. ### Why are the changes needed? To highlight this useful support. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass exiting tests. Closes #33785 from Ngone51/SPARK-35083-follow-up. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `e3902d1975`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-19 16:29:19 +08:00
Wenchen Fan	8f3b4c4b7d	[SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30648 ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page. ### Why are the changes needed? simplify the doc ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #33781 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `07d173a8b0`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-08-19 11:04:20 +08:00
Wenchen Fan	5107ad3157	[SPARK-36535][SQL] Refine the sql reference doc ### What changes were proposed in this pull request? Refine the SQL reference doc: - remove useless subitems in the sidebar - remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`) - avoid using `#####` in `sql-ref-literals.md` ### Why are the changes needed? The subitems in the sidebar are quite useless, as the menu page serves the same functionalities: <img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png"> It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync). The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page. The `#####` is not rendered properly: <img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png"> It's better to avoid using it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #33767 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com> (cherry picked from commit `4b015e8d7d`) Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-08-17 12:46:49 -07:00
Gengliang Wang	70635b4b26	Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases" ### What changes were proposed in this pull request? Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129) ### Why are the changes needed? It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable. Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too. As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode. ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. ### How was this patch tested? Unit tests Closes #33758 from gengliangwang/revertGroupByAlias. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org> (cherry picked from commit `8bfb4f1e72`) Signed-off-by: Gengliang Wang <gengliang@apache.org>	2021-08-17 20:24:09 +08:00
Yuanjian Li	4caa43e398	[SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide ### What changes were proposed in this pull request? Add the document for the new RocksDBStateStoreProvider. ### Why are the changes needed? User guide for the new feature. ### Does this PR introduce _any_ user-facing change? No, doc only. ### How was this patch tested? Doc only. Closes #33683 from xuanyuanking/SPARK-36041. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com> (cherry picked from commit `3d57e00a7f`) Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-08-16 12:32:19 -07:00
Venkata krishnan Sowrirajan	233af3d239	[SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation ### What changes were proposed in this pull request? Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png)) ### Why are the changes needed? Helps users understand the feature ### Does this PR introduce _any_ user-facing change? Docs ### How was this patch tested? N/A Closes #33615 from venkata91/SPARK-36374. Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com> (cherry picked from commit `2270ecf32f`) Signed-off-by: Mridul Muralidharan <mridulatgmail.com>	2021-08-16 10:25:33 -05:00
Liang-Chi Hsieh	3aa933b162	[SPARK-36465][SS] Dynamic gap duration in session window ### What changes were proposed in this pull request? This patch supports dynamic gap duration in session window. ### Why are the changes needed? The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows. ### Does this PR introduce _any_ user-facing change? Yes, users can specify dynamic gap duration. ### How was this patch tested? Modified existing tests and new test. Closes #33691 from viirya/dynamic-session-window-gap. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `8b8d91cf64`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-16 11:06:16 +09:00
Max Gekk	8dbcbebc36	[SPARK-36468][SQL][DOCS] Update docs about ANSI interval literals ### What changes were proposed in this pull request? In the PR, I propose to update the doc page https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal, and describe formats of ANSI interval literals. <img width="1032" alt="Screenshot 2021-08-11 at 10 31 36" src="https://user-images.githubusercontent.com/1580697/128988454-7a6ac435-409b-4961-9b79-ebecfb141d5e.png"> <img width="1030" alt="Screenshot 2021-08-10 at 20 58 04" src="https://user-images.githubusercontent.com/1580697/128912018-a4ea3ee5-f252-49c7-a90e-5beaf7ac868f.png"> ### Why are the changes needed? To improve UX with Spark SQL, and inform users about recently added ANSI interval literals. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked the generated docs: ``` $ SKIP_API=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 SKIP_SCALADOC=1 bundle exec jekyll build ``` Closes #33693 from MaxGekk/doc-ansi-interval-literals. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit `bbf988bd73`) Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-08-11 13:38:52 +03:00
Jungtaek Lim	161908c10d	[SPARK-36463][SS] Prohibit update mode in streaming aggregation with session window ### What changes were proposed in this pull request? This PR proposes to prohibit update mode in streaming aggregation with session window. UnsupportedOperationChecker will check and prohibit the case. As a side effect, this PR also simplifies the code as we can remove the implementation of iterator to support outputs of update mode. This PR also cleans up test code via deduplicating. ### Why are the changes needed? The semantic of "update" mode for session window based streaming aggregation is quite unclear. For normal streaming aggregation, Spark will provide the outputs which can be "upsert"ed based on the grouping key. This is based on the fact grouping key won't be changed. This doesn't hold true for session window based streaming aggregation, as session range is changing. If end users leverage their knowledge about streaming aggregation, they will consider the key as grouping key + session (since they'll specify these things in groupBy), and it's high likely possible that existing row is not updated (overwritten) and ended up with having different rows. If end users consider the key as grouping key, there's a small chance for end users to upsert the session correctly, though only the last updated session will be stored so it won't work with event time processing which there could be multiple active sessions. ### Does this PR introduce _any_ user-facing change? No, as we haven't released this feature. ### How was this patch tested? Updated tests. Closes #33689 from HeartSaVioR/SPARK-36463. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `ed60aaa9f1`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-08-11 10:46:03 +09:00
Yuto Akutsu	a5d0eafa32	[SPARK-595][DOCS] Add local-cluster mode option in Documentation ### What changes were proposed in this pull request? Add local-cluster mode option to submitting-applications.md ### Why are the changes needed? Help users to find/use this option for unit tests. ### Does this PR introduce _any_ user-facing change? Yes, docs changed. ### How was this patch tested? `SKIP_API=1 bundle exec jekyll build` <img width="460" alt="docchange" src="https://user-images.githubusercontent.com/87687356/127125380-6beb4601-7cf4-4876-b2c6-459454ce2a02.png"> Closes #33537 from yutoacts/SPARK-595. Lead-authored-by: Yuto Akutsu <yuto.akutsu@jp.nttdata.com> Co-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com> Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com> Signed-off-by: Thomas Graves <tgraves@apache.org> (cherry picked from commit `41b011e416`) Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-08-06 09:27:24 -05:00
Gengliang Wang	87291dced1	[SPARK-36415][SQL][DOCS] Add docs for try_cast/try_add/try_divide ### What changes were proposed in this pull request? Add documentation for new functions try_cast/try_add/try_divide ### Why are the changes needed? Better documentation. These new functions are useful when migrating to the ANSI dialect. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: ![image](https://user-images.githubusercontent.com/1097932/128209312-34a6cc6a-a73d-4aed-8646-22b1cb7ce702.png) Closes #33638 from gengliangwang/addDocForTry. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `8a35243fa7`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-06 12:33:19 +09:00
yi.wu	cc2a5abf7d	[SPARK-36384][CORE][DOC] Add doc for shuffle checksum ### What changes were proposed in this pull request? Add doc for the shuffle checksum configs in `configuration.md`. ### Why are the changes needed? doc ### Does this PR introduce _any_ user-facing change? No, since Spark 3.2 hasn't been released. ### How was this patch tested? Pass existed tests. Closes #33637 from Ngone51/SPARK-36384. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `3b92c721b5`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-05 10:16:53 +09:00
Kousuke Saruta	6e8187b7b5	[MINOR][DOC] Remove obsolete `contributing-to-spark.md` ### What changes were proposed in this pull request? This PR removes obsolete `contributing-to-spark.md` which is not referenced from anywhere. ### Why are the changes needed? Just clean up. ### Does this PR introduce _any_ user-facing change? No. Users can't have access to contributing-to-spark.html unless they directly point to the URL. ### How was this patch tested? Built the document and confirmed that this change doesn't affect the result. Closes #33619 from sarutak/remove-obsolete-contribution-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `c31b653806`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-08-04 10:19:32 +09:00
Gengliang Wang	fa0c7f487b	[SPARK-34399][DOCS][FOLLOWUP] Add docs for the new metrics of task/job commit time ### What changes were proposed in this pull request? This is follow-up of https://github.com/apache/spark/pull/31522. It adds docs for the new metrics of task/job commit time ### Why are the changes needed? So that users can understand the metrics better and know that the new metrics are only for file table writes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build docs and preview: ![image](https://user-images.githubusercontent.com/1097932/127198210-2ab201d3-5fca-4065-ace6-0b930390380f.png) Closes #33542 from gengliangwang/addDocForMetrics. Authored-by: Gengliang Wang <gengliang@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c9a7ff3f36`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 13:54:48 +08:00
Huaxin Gao	33ef52e2c0	[SPARK-34952][SQL][FOLLOW-UP] DSv2 aggregate push down follow-up ### What changes were proposed in this pull request? update java doc, JDBC data source doc, address follow up comments ### Why are the changes needed? update doc and address follow up comments ### Does this PR introduce _any_ user-facing change? Yes, add the new JDBC option `pushDownAggregate` in JDBC data source doc. ### How was this patch tested? manually checked Closes #33526 from huaxingao/aggPD_followup. Authored-by: Huaxin Gao <huaxin_gao@apple.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `c8dd97d456`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-28 12:52:58 +08:00
Max Gekk	56f1ee4b06	[SPARK-36318][SQL][DOCS] Update docs about mapping of ANSI interval types to Java/Scala/SQL types ### What changes were proposed in this pull request? 1. Update the tables at https://spark.apache.org/docs/latest/sql-ref-datatypes.html about mapping ANSI interval types to Java/Scala/SQL types. 2. Remove `CalendarIntervalType` from the table of mapping Catalyst types to SQL types. <img width="1028" alt="Screenshot 2021-07-27 at 20 52 57" src="https://user-images.githubusercontent.com/1580697/127204790-7ccb9c64-daf2-427d-963e-b7367aaa3439.png"> <img width="1017" alt="Screenshot 2021-07-27 at 20 53 22" src="https://user-images.githubusercontent.com/1580697/127204806-a0a51950-3c2d-4198-8a22-0f6614bb1487.png"> ### Why are the changes needed? To inform users which types from language APIs should be used as ANSI interval types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually checking by building the docs: ``` $ SKIP_RDOC=1 SKIP_API=1 SKIP_PYTHONDOC=1 bundle exec jekyll build ``` Closes #33543 from MaxGekk/doc-interval-type-lang-api. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com> (cherry picked from commit `1614d00417`) Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-07-28 13:42:53 +09:00
Gengliang Wang	ee3bd71c92	[SPARK-34249][DOCS] Add documentation for ANSI implicit cast rules ### What changes were proposed in this pull request? Add documentation for the ANSI implicit cast rules which are introduced from https://github.com/apache/spark/pull/31349 ### Why are the changes needed? Better documentation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build and preview in local: ![image](https://user-images.githubusercontent.com/1097932/127149039-f0cc4766-8eca-4061-bc35-c8e67f009544.png) ![image](https://user-images.githubusercontent.com/1097932/127149072-1b65ef56-65ff-4327-9a5e-450d44719073.png) ![image](https://user-images.githubusercontent.com/1097932/127033375-b4536854-ca72-42fa-8ea9-dde158264aa5.png) ![image](https://user-images.githubusercontent.com/1097932/126950445-435ba521-92b8-44d1-8f2c-250b9efb4b98.png) ![image](https://user-images.githubusercontent.com/1097932/126950495-9aa4e960-60cd-4b20-88d9-b697ff57a7f7.png) Closes #33516 from gengliangwang/addDoc. Lead-authored-by: Gengliang Wang <gengliang@apache.org> Co-authored-by: Serge Rielau <serge@rielau.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `df98d5b5f1`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-27 20:49:05 +08:00
Max Gekk	3d86128eae	[SPARK-34619][SQL][DOCS][3.2] Describe ANSI interval types at the Data types page of the SQL reference ### What changes were proposed in this pull request? In the PR, I propose to update the page https://spark.apache.org/docs/latest/sql-ref-datatypes.html and add information about the year-month and day-time interval types introduced by SPARK-27790. <img width="932" alt="Screenshot 2021-07-27 at 10 38 23" src="https://user-images.githubusercontent.com/1580697/127115289-e633ca3a-2c18-49a0-a7c0-22421ae5c363.png"> ### Why are the changes needed? To inform users about new ANSI interval types, and improve UX with Spark SQL. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be tested by a GitHub action. Authored-by: Max Gekk <max.gekkgmail.com> Signed-off-by: Kousuke Saruta <sarutakoss.nttdata.com> (cherry picked from commit `f4837961a9`) Signed-off-by: Kousuke Saruta <sarutakoss.nttdata.com> Closes #33539 from sarutak/backport-SPARK-34619-3.2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-07-27 15:17:03 +03:00
Dominik Gehl	c1f0df402c	[SPARK-36209][PYTHON][DOCS] Fix link to pyspark Dataframe documentation ### What changes were proposed in this pull request? Bugfix: link to correction location of Pyspark Dataframe documentation ### Why are the changes needed? Current website returns "Not found" ### Does this PR introduce _any_ user-facing change? Website fix ### How was this patch tested? Documentation change Closes #33420 from dominikgehl/feature/SPARK-36209. Authored-by: Dominik Gehl <dog@open.ch> Signed-off-by: Sean Owen <srowen@gmail.com> (cherry picked from commit `3a1db2ddd4`) Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-22 08:07:08 -05:00
Gidon Gershinsky	06520b2849	[SPARK-35658][DOCS] Document Parquet encryption feature in Spark SQL ### What changes were proposed in this pull request? Spark 3.2.0 will use parquet-mr.1.12.0 version (or higher), that contains the column encryption feature which can be called from Spark SQL. The aim of this PR is to document the use of Parquet encryption in Spark. ### Why are the changes needed? - To provide information on how to use Parquet column encryption ### Does this PR introduce _any_ user-facing change? Yes, documents a new feature. ### How was this patch tested? bundle exec jekyll build Closes #32895 from ggershinsky/parquet-encryption-doc. Authored-by: Gidon Gershinsky <ggershinsky@apple.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-21 06:52:37 -05:00
Angerszhuuuu	64aee349af	[SPARK-36153][SQL][DOCS] Update transform doc to match the current code ### What changes were proposed in this pull request? Update trasform's doc to latest code. ![image](https://user-images.githubusercontent.com/46485123/126175747-672cccbc-4e42-440f-8f1e-f00b6dc1be5f.png) ### Why are the changes needed? keep consistence ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No Closes #33362 from AngersZhuuuu/SPARK-36153. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-07-21 06:50:22 -05:00
Jungtaek Lim	e264c21707	[SPARK-36172][SS] Document session window into Structured Streaming guide doc ### What changes were proposed in this pull request? This PR documents a new feature "native support of session window" into Structured Streaming guide doc. Screenshots are following: ![스크린샷 2021-07-20 오후 5 04 20](https://user-images.githubusercontent.com/1317309/126284848-526ec056-1028-4a70-a1f4-ae275d4b5437.png) ![스크린샷 2021-07-20 오후 3 34 38](https://user-images.githubusercontent.com/1317309/126276763-763cf841-aef7-412a-aa03-d93273f0c850.png) ### Why are the changes needed? This change is needed to explain a new feature to the end users. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Documentation changes. Closes #33433 from HeartSaVioR/SPARK-36172. Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com> (cherry picked from commit `0eb31a06d6`) Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-07-21 10:45:48 +09:00
Kousuke Saruta	3bc9346a3a	[SPARK-34051][DOCS][FOLLOWUP] Document about unicode literals ### What changes were proposed in this pull request? This PR documents about unicode literals added in SPARK-34051 (#31096) and a past PR in `sql-ref-literals.md`. ### Why are the changes needed? Notice users about the literals. ### Does this PR introduce _any_ user-facing change? Yes, but just add a sentence. ### How was this patch tested? Built the document and confirmed the result. ``` SKIP_API=1 bundle exec jekyll build ``` ![unicode-literals](https://user-images.githubusercontent.com/4736016/126283923-944dc162-1817-47bc-a7e8-c3145225586b.png) Closes #33434 from sarutak/unicode-literal-doc. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `ba1294ea5a`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-20 16:58:32 +08:00
yoda-mon	46ddb17da4	[SPARK-36040][DOCS][K8S] Add reference to kubernetes-client's version ### What changes were proposed in this pull request? Add reference to kubernetes-client's version ### Why are the changes needed? Running Spark on Kubernetes potentially has upper limitation of Kubernetes version. I think it is better for users to notice it because Kubernetes update speed is so fast that users tends to run Spark Jobs on unsupported version. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? SKIP_API=1 bundle exec jekyll build Closes #33255 from yoda-mon/add-reference-kubernetes-client. Authored-by: yoda-mon <yodal@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `eea69c122f`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-18 14:26:25 -07:00
Kousuke Saruta	1f594d902a	[SPARK-36081][SPARK-36066][SQL] Update the document about the behavior change of trimming characters for cast ### What changes were proposed in this pull request? This PR modifies comment for `UTF8String.trimAll` and`sql-migration-guide.mld`. The comment for `UTF8String.trimAll` says like as follows. ``` Trims whitespaces ({literal <=} ASCII 32) from both ends of this string. ``` Similarly, `sql-migration-guide.md` mentions about the behavior of `cast` like as follows. ``` In Spark 3.0, when casting string value to integral types(tinyint, smallint, int and bigint), datetime types(date, timestamp and interval) and boolean type, the leading and trailing whitespaces (<= ASCII 32) will be trimmed before converted to these type values, for example, `cast(' 1\t' as int)` results `1`, `cast(' 1\t' as boolean)` results `true`, `cast('2019-10-10\t as date)` results the date value `2019-10-10`. In Spark version 2.4 and below, when casting string to integrals and booleans, it does not trim the whitespaces from both ends; the foregoing results is `null`, while to datetimes, only the trailing spaces (= ASCII 32) are removed. ``` But SPARK-32559 (#29375) changed the behavior and only whitespace ASCII characters will be trimmed since Spark 3.0.1. ### Why are the changes needed? To follow the previous change. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Confirmed the document built by the following command. ``` SKIP_API=1 bundle exec jekyll build ``` Closes #33287 from sarutak/fix-utf8string-trim-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `57a4f310df`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 20:29:05 +08:00
Kousuke Saruta	1f8e72f9b1	[SPARK-35749][SPARK-35773][SQL] Parse unit list interval literals as tightest year-month/day-time interval types ### What changes were proposed in this pull request? This PR allow the parser to parse unit list interval literals like `'3' day '10' hours '3' seconds` or `'8' years '3' months` as `YearMonthIntervalType` or `DayTimeIntervalType`. ### Why are the changes needed? For ANSI compliance. ### Does this PR introduce _any_ user-facing change? Yes. I noted the following things in the `sql-migration-guide.md`. * Unit list interval literals are parsed as `YearMonthIntervaType` or `DayTimeIntervalType` instead of `CalendarIntervalType`. * `WEEK`, `MILLISECONS`, `MICROSECOND` and `NANOSECOND` are not valid units for unit list interval literals. * Units of year-month and day-time cannot be mixed like `1 YEAR 2 MINUTES`. ### How was this patch tested? New tests and modified tests. Closes #32949 from sarutak/day-time-multi-units. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `8e92ef825a`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-13 18:55:22 +08:00
Max Gekk	4c7ac5fc90	[SPARK-36089][SQL][DOCS] Update the SQL migration guide about encoding auto-detection of CSV files ### What changes were proposed in this pull request? In the PR, I propose to update the SQL migration guide, in particular the section about the migration from Spark 2.4 to 3.0. New item informs users about the following issue: What: Spark doesn't detect encoding (charset) in CSV files with BOM correctly. Such files can be read only in the multiLine mode when the CSV option encoding matches to the actual encoding of CSV files. For example, Spark cannot read UTF-16BE CSV files when encoding is set to UTF-8 which is the default mode. This is the case of the current ES ticket. Why: In previous Spark versions, encoding wasn't propagated to the underlying library that means the lib tried to detect file encoding automatically. It could success for some encodings that require BOM presents at the beginning of files. Starting from the versions 3.0, users can specify file encoding via the CSV option encoding which has UTF-8 as the default value. Spark propagates such default to the underlying library (uniVocity), and as a consequence this turned off encoding auto-detection. When: Since Spark 3.0. In particular, the commit `2df34db586` causes the issue. Workaround: Enabling the encoding auto-detection mechanism in uniVocity by passing null as the value of CSV option encoding. A more recommended approach is to set the encoding option explicitly. ### Why are the changes needed? To improve user experience with Spark SQL. This should help to users in their migration from Spark 2.4. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Should be checked by building docs in GA/jenkins. Closes #33300 from MaxGekk/csv-encoding-migration-guide. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `e788a3fa88`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-12 18:54:46 +09:00
ulysses-you	969c691601	[SPARK-33679][SQL][DOCS][FOLLOWUP] Enable spark.sql.adaptive.enabled by default ### What changes were proposed in this pull request? Update AQE is `disabled` to `enabled` in sql-performance-tuning docs ### Why are the changes needed? Make docs correct. ### Does this PR introduce _any_ user-facing change? yes, docs changed. ### How was this patch tested? Not need. Closes #33295 from ulysses-you/enable-AQE. Lead-authored-by: ulysses-you <ulyssesyou18@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `286c231c1e`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-12 18:49:57 +09:00
ulysses-you	47217e77a1	[SPARK-35813][SQL][DOCS] Add new adaptive config into sql-performance-tuning docs ### What changes were proposed in this pull request? Add new configs in sql-performance-tuning docs. * spark.sql.adaptive.coalescePartitions.parallelismFirst * spark.sql.adaptive.coalescePartitions.minPartitionSize * spark.sql.adaptive.autoBroadcastJoinThreshold * spark.sql.adaptive.maxShuffledHashJoinLocalMapThreshold ### Why are the changes needed? Help user to find them. ### Does this PR introduce _any_ user-facing change? yes, docs changed. ### How was this patch tested? ![image](https://user-images.githubusercontent.com/12025282/125152379-be506200-e17e-11eb-80fe-68328ba1c8f5.png) ![image](https://user-images.githubusercontent.com/12025282/125152388-d1fbc880-e17e-11eb-8515-d4a5ed33159d.png) Closes #32960 from ulysses-you/SPARK-35813. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `0e9786c712`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-12 13:31:56 +08:00
Hyukjin Kwon	86676298d3	[SPARK-36051][DOCS] Remove the automatic build guides of documentation ### What changes were proposed in this pull request? This PR proposes to remove the automatic build guides of documentation in `docs/README.md`. ### Why are the changes needed? This doesn't work very well: 1. It doesn't detect the changes in RST files. But PySpark internally generates RST files so we can't just simply include it in the detection. Otherwise, it goes to an infinite loop 2. During PySpark documentation generation, it launches some jobs to generate plot images now. This is broken with `entr` command, and the job fails. Seems like it's related to how `entr` creates the process internally. 3. Minor issue but the documentation build directory was changed (`_build` -> `build` in `python/docs`) I don't think it's worthwhile testing and fixing the docs to show an working example because dev people are already able to do it manually. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested. Closes #33266 from HyukjinKwon/SPARK-36051. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `a1ce64904f`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-09 10:05:01 +09:00
Linhong Liu	3c683434fa	[SPARK-35686][SQL] Not allow using auto-generated alias when creating view ### What changes were proposed in this pull request? As described in #32831, Spark has compatible issues when querying a view created by an older version. The root cause is that Spark changed the auto-generated alias name. To avoid this in the future, we could ask the user to specify explicit column names when creating a view. ### Why are the changes needed? Avoid compatible issue when querying a view ### Does this PR introduce _any_ user-facing change? Yes. User will get error when running query below after this change ``` CREATE OR REPLACE VIEW v AS SELECT CAST(t.a AS INT), to_date(t.b, 'yyyyMMdd') FROM t ``` ### How was this patch tested? not yet Closes #32832 from linhongliu-db/SPARK-35686-no-auto-alias. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-01 12:47:38 +00:00
Cheng Su	3c3193c0fc	[SPARK-35965][DOCS] Add doc for ORC nested column vectorized reader ### What changes were proposed in this pull request? In https://issues.apache.org/jira/browse/SPARK-34862, we added support for ORC nested column vectorized reader, and it is disabled by default for now. So we would like to add the user-facing documentation for it, and user can opt-in to use it if they want. ### Why are the changes needed? To make user be aware of the feature, and let them know the instruction to use the feature. ### Does this PR introduce _any_ user-facing change? Yes, the documentation itself. ### How was this patch tested? Manually check generated documentation as below. <img width="1153" alt="Screen Shot 2021-07-01 at 12 19 40 AM" src="https://user-images.githubusercontent.com/4629931/124083422-b0724280-da02-11eb-93aa-a25d118ba56e.png"> <img width="1147" alt="Screen Shot 2021-07-01 at 12 19 52 AM" src="https://user-images.githubusercontent.com/4629931/124083442-b5cf8d00-da02-11eb-899f-827d55b8558d.png"> Closes #33168 from c21/orc-doc. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-01 19:01:35 +09:00

1 2 3 4 5 ...

3243 commits