We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks.
Proposed changes in the PR to fix the issue.
### What changes were proposed in this pull request?
Disabled Batch fetch when push based shuffle is enabled.
### Why are the changes needed?
Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested the PR within our PR, with Spark shell and the queries are:
sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales")
// Dynamically coalesce partitions
sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect
Unit tests to be added.
Closes#34156 from zhouyejoe/SPARK-36892.
Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 31b6f614d3)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
The PR fixes SPARK-36791 by replacing JHS_POST with JHS_HOST
### Why are the changes needed?
There are spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Not needed for docs
Closes#34031 from jiaoqingbo/jiaoqingbo.
Authored-by: jiaoqb <jiaoqb@asiainfo.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a1a91bd71)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up to fix the leftover during switching the Scala version.
### Why are the changes needed?
This should be consistent.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This is not tested by UT. We need to check manually. There is no more `2.12.14`.
```
$ git grep 2.12.14
R/pkg/tests/fulltests/test_sparkSQL.R: c(as.Date("2012-12-14"), as.Date("2013-12-15"), as.Date("2014-12-16")))
data/mllib/ridge-data/lpsa.data:3.5307626,0.987291634724086 -0.36279314978779 -0.922212414640967 0.232904453212813 -0.522940888712441 1.79270085261407 0.342627053981254 1.26288870310799
sql/hive/src/test/resources/data/files/over10k:-3|454|65705|4294967468|62.12|14.32|true|mike white|2013-03-01 09:11:58.703087|40.18|joggying
```
Closes#34020 from dongjoon-hyun/SPARK-36759-2.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit adbea252db)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Add documentation for ANSI store assignment rules for
- the valid source/target type combinations
- runtime error will happen on numberic overflow
### Why are the changes needed?
Better docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/133554600-8c80c0a9-8753-4c01-94d0-994d8082e319.png)
Closes#34014 from gengliangwang/addStoreAssignDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ff7705ad2a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Adding `zstandard` to avro supported codecs.
### Why are the changes needed?
To improve the document.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Doc only.
Closes#33943 from viirya/minor-doc.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 647ffe655f)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR fixes the description about the possible values of `spark.sql.catalogImplementation` property.
It was added in SPARK-36153 (#33362) but the possible values are `hive` or `in-memory` rather than `true` or `false`.
### Why are the changes needed?
To fix wrong description.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I just confirmed `in-memory` and `hive` are the valid values with SparkShell.
Closes#33923 from sarutak/fix-doc-about-catalogImplementation.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a5fe5d368c)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.
`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.
### Why are the changes needed?
This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E
### Does this PR introduce _any_ user-facing change?
Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.
### How was this patch tested?
**R shell**
Valid input (`n`):
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```
Invalid input:
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```
**Rscript**
```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```
```
Rscript tmp.R
```
Valid input (`n`):
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```
Invalid input:
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```
`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).
Closes#33887 from HyukjinKwon/SPARK-36631.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e983ba8fce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Improve the user guide document.
### Why are the changes needed?
Make the user guide clear.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Doc change only.
Closes#33854 from xuanyuanking/SPARK-35611-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit dd3f0fa8c2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">
### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By generating docs, and checking results manually.
Closes#33840 from MaxGekk/special-datetime-cast-migr-guide.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c4e739fb4b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).
### Why are the changes needed?
To maintain the document.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)
Closes#33823 from sarutak/create-function-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bd0a4950ae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Revert 397b843890 and 5a48eb8d00
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci tests
Closes#33819 from gengliangwang/revert-SPARK-34415.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.
### Why are the changes needed?
Correct the documentation
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test. The MAVEN_OPTS is consistent with our github action build.
Closes#33804 from gengliangwang/updateBuildDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3da0e9500f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.
### Why are the changes needed?
To keep the config names consistent
### Does this PR introduce _any_ user-facing change?
Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.
### How was this patch tested?
Existing tests.
Closes#33799 from venkata91/SPARK-36374-follow-up.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 7b2842e986)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request?
This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.
### Why are the changes needed?
Update the user document.
### Does this PR introduce _any_ user-facing change?
No. It's a doc only change.
### How was this patch tested?
Doc only change.
Closes#33801 from viirya/minor-ss-deduplication.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 5876e04de2)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
* improve docs in `docs/job-scheduling.md`
* add migration guide docs in `docs/core-migration-guide.md`
### Why are the changes needed?
Help user to migrate.
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
Pass CI
Closes#33794 from ulysses-you/SPARK-35083-f.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 90cbf9ca3e)
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.
### Why are the changes needed?
Fix release script and update README of docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual test locally.
Closes#33797 from gengliangwang/fixReleaseDocker.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 42eebb84f5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add more documents and checking logic for the new options `minOffsetPerTrigger` and `maxTriggerDelay`.
### Why are the changes needed?
Have a clear description of the behavior introduced in SPARK-35312
### Does this PR introduce _any_ user-facing change?
Yes.
If the user set minOffsetsPerTrigger > maxOffsetsPerTrigger, the new code will throw an AnalysisException. The original behavior is to ignore the maxOffsetsPerTrigger silenctly.
### How was this patch tested?
Existing tests.
Closes#33792 from xuanyuanking/SPARK-35312-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit a0b24019ed)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
…mote scheduler pool files support"
This reverts commit e3902d1975. The feature is improvement instead of behavior change.
Closes#33789 from gengliangwang/revertDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit b36b1c7e8a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add remote scheduler pool files support to the migration guide.
### Why are the changes needed?
To highlight this useful support.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass exiting tests.
Closes#33785 from Ngone51/SPARK-35083-follow-up.
Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit e3902d1975)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/30648
ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.
### Why are the changes needed?
simplify the doc
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
N/A
Closes#33781 from cloud-fan/doc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 07d173a8b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#####` in `sql-ref-literals.md`
### Why are the changes needed?
The subitems in the sidebar are quite useless, as the menu page serves the same functionalities:
<img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png">
It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync).
The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page.
The `#####` is not rendered properly:
<img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png">
It's better to avoid using it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
N/A
Closes#33767 from cloud-fan/doc.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4b015e8d7d)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)
### Why are the changes needed?
It turns out that many users are using the group by alias feature. Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.
As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.
### Does this PR introduce _any_ user-facing change?
No, the feature is not released yet.
### How was this patch tested?
Unit tests
Closes#33758 from gengliangwang/revertGroupByAlias.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 8bfb4f1e72)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Add the document for the new RocksDBStateStoreProvider.
### Why are the changes needed?
User guide for the new feature.
### Does this PR introduce _any_ user-facing change?
No, doc only.
### How was this patch tested?
Doc only.
Closes#33683 from xuanyuanking/SPARK-36041.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 3d57e00a7f)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png))
### Why are the changes needed?
Helps users understand the feature
### Does this PR introduce _any_ user-facing change?
Docs
### How was this patch tested?
N/A
Closes#33615 from venkata91/SPARK-36374.
Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2270ecf32f)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
### What changes were proposed in this pull request?
This patch supports dynamic gap duration in session window.
### Why are the changes needed?
The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.
### Does this PR introduce _any_ user-facing change?
Yes, users can specify dynamic gap duration.
### How was this patch tested?
Modified existing tests and new test.
Closes#33691 from viirya/dynamic-session-window-gap.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 8b8d91cf64)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>