Commit graph

3267 commits

Author SHA1 Message Date
Gengliang Wang 5d45a415f3 Preparing Spark release v3.2.0-rc7 2021-10-06 11:45:26 +00:00
Ye Zhou 88f4809142 [SPARK-36892][CORE] Disable batch fetch for a shuffle when push based shuffle is enabled
We found an issue where user configured both AQE and push based shuffle, but the job started to hang after running some  stages. We took the thread dump from the Executors, which showed the task is still waiting to fetch shuffle blocks.
Proposed changes in the PR to fix the issue.

### What changes were proposed in this pull request?
Disabled Batch fetch when push based shuffle is enabled.

### Why are the changes needed?
Without this patch, enabling AQE and Push based shuffle will have a chance to hang the tasks.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested the PR within our PR, with Spark shell and the queries are:

sql("""SELECT CASE WHEN rand() < 0.8 THEN 100 ELSE CAST(rand() * 30000000 AS INT) END AS s_item_id, CAST(rand() * 100 AS INT) AS s_quantity, DATE_ADD(current_date(), - CAST(rand() * 360 AS INT)) AS s_date FROM RANGE(1000000000)""").createOrReplaceTempView("sales")
// Dynamically coalesce partitions
sql("""SELECT s_date, sum(s_quantity) AS q FROM sales GROUP BY s_date ORDER BY q DESC""").collect

Unit tests to be added.

Closes #34156 from zhouyejoe/SPARK-36892.

Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 31b6f614d3)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-10-06 15:42:40 +08:00
Hyukjin Kwon 939c4d93b5 [MINOR][DOCS] Mention other Python dependency tools in documentation
### What changes were proposed in this pull request?

Self-contained.

### Why are the changes needed?

For user's more information on available Python dependency management in PySpark.

### Does this PR introduce _any_ user-facing change?
Yes, documentation change.

### How was this patch tested?
Manaully built the docs and checked the results:
<img width="918" alt="Screen Shot 2021-09-29 at 10 11 56 AM" src="https://user-images.githubusercontent.com/6477701/135186536-2f271378-d06b-4c6b-a4be-691ce395db9f.png">
<img width="976" alt="Screen Shot 2021-09-29 at 10 12 22 AM" src="https://user-images.githubusercontent.com/6477701/135186541-0f4c5615-bc49-48e2-affd-dc2f5c0334bf.png">
<img width="920" alt="Screen Shot 2021-09-29 at 10 12 42 AM" src="https://user-images.githubusercontent.com/6477701/135186551-0b613096-7c86-4562-b345-ddd60208367b.png">

Closes #34134 from HyukjinKwon/minor-docs-py-deps.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 13c2b711e4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-29 14:46:18 +09:00
Gengliang Wang 4bd358474b Preparing development version 3.2.1-SNAPSHOT 2021-09-28 10:53:42 +00:00
Gengliang Wang dde73e2e1c Preparing Spark release v3.2.0-rc6 2021-09-28 10:53:35 +00:00
Gengliang Wang 0c57bb8f7f Preparing development version 3.2.1-SNAPSHOT 2021-09-27 08:24:50 +00:00
Gengliang Wang 49aea14c5a Preparing Spark release v3.2.0-rc5 2021-09-27 08:24:44 +00:00
Gengliang Wang 2348cce37e Preparing development version 3.2.1-SNAPSHOT 2021-09-26 12:28:46 +00:00
Gengliang Wang 2ed8c08c5b Preparing Spark release v3.2.0-rc5 2021-09-26 12:28:40 +00:00
Gengliang Wang da722d43cb Preparing development version 3.2.1-SNAPSHOT 2021-09-24 10:03:23 +00:00
Gengliang Wang 9e35703211 Preparing Spark release v3.2.0-rc5 2021-09-24 10:03:16 +00:00
Gengliang Wang 0fb7127f85 Preparing development version 3.2.1-SNAPSHOT 2021-09-23 08:46:28 +00:00
Gengliang Wang b609f2fe0c Preparing Spark release v3.2.0-rc4 2021-09-23 08:46:22 +00:00
jiaoqb d203ed51ca [SPARK-36791][DOCS] Fix spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST
### What changes were proposed in this pull request?
The PR fixes SPARK-36791 by replacing JHS_POST with JHS_HOST

### Why are the changes needed?
There are spelling mistakes in running-on-yarn.md file where JHS_POST should be JHS_HOST

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Not needed for docs

Closes #34031 from jiaoqingbo/jiaoqingbo.

Authored-by: jiaoqb <jiaoqb@asiainfo.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8a1a91bd71)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-23 12:48:06 +09:00
Gengliang Wang b0249851f6 Preparing development version 3.2.1-SNAPSHOT 2021-09-18 11:30:12 +00:00
Gengliang Wang 96044e9735 Preparing Spark release v3.2.0-rc3 2021-09-18 11:30:06 +00:00
Dongjoon Hyun fbd24621ce [SPARK-36759][BUILD][FOLLOWUP] Update version in scala-2.12 profile and doc
### What changes were proposed in this pull request?

This is a follow-up to fix the leftover during switching the Scala version.

### Why are the changes needed?

This should be consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is not tested by UT. We need to check manually. There is no more `2.12.14`.
```
$ git grep 2.12.14
R/pkg/tests/fulltests/test_sparkSQL.R:               c(as.Date("2012-12-14"), as.Date("2013-12-15"), as.Date("2014-12-16")))
data/mllib/ridge-data/lpsa.data:3.5307626,0.987291634724086 -0.36279314978779 -0.922212414640967 0.232904453212813 -0.522940888712441 1.79270085261407 0.342627053981254 1.26288870310799
sql/hive/src/test/resources/data/files/over10k:-3|454|65705|4294967468|62.12|14.32|true|mike white|2013-03-01 09:11:58.703087|40.18|joggying
```

Closes #34020 from dongjoon-hyun/SPARK-36759-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit adbea252db)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-16 05:11:05 -07:00
Gengliang Wang d20ed030a8 [SPARK-36775][DOCS] Add documentation for ANSI store assignment rules
### What changes were proposed in this pull request?

Add documentation for ANSI store assignment rules for
- the valid source/target type combinations
- runtime error will happen on numberic overflow

### Why are the changes needed?

Better  docs
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/133554600-8c80c0a9-8753-4c01-94d0-994d8082e319.png)

Closes #34014 from gengliangwang/addStoreAssignDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit ff7705ad2a)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-16 15:50:57 +08:00
Liang-Chi Hsieh d22182e474 [SPARK-34479][SQL][DOC][FOLLOWUP] Add zstandard to avro supported codecs
### What changes were proposed in this pull request?

Adding `zstandard` to avro supported codecs.

### Why are the changes needed?

To improve the document.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Doc only.

Closes #33943 from viirya/minor-doc.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 647ffe655f)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-08 23:21:38 -07:00
Kousuke Saruta a41dc4516e [SPARK-36153][SQL][DOCS][FOLLOWUP] Fix the description about the possible values of spark.sql.catalogImplementation property
### What changes were proposed in this pull request?

This PR fixes the description about the possible values of `spark.sql.catalogImplementation` property.
It was added in SPARK-36153 (#33362) but the possible values are `hive` or `in-memory` rather than `true` or `false`.

### Why are the changes needed?

To fix wrong description.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I just confirmed `in-memory` and `hive` are the valid values with SparkShell.

Closes #33923 from sarutak/fix-doc-about-catalogImplementation.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a5fe5d368c)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-07 11:39:53 +09:00
Hyukjin Kwon e9f2e34261 [SPARK-36631][R] Ask users if they want to download and install SparkR in non Spark scripts
### What changes were proposed in this pull request?

This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.

`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.

### Why are the changes needed?

This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E

### Does this PR introduce _any_ user-facing change?

Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.

### How was this patch tested?

**R shell**

Valid input (`n`):

```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```

Invalid input:

```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```

Valid input (`y`):

```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```

**Rscript**

```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```

```
Rscript tmp.R
```

Valid input (`n`):

```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
  Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```

Invalid input:

```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```

Valid input (`y`):

```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```

`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).

Closes #33887 from HyukjinKwon/SPARK-36631.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e983ba8fce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-02 13:27:55 +09:00
Gengliang Wang 1bad04d028 Preparing development version 3.2.1-SNAPSHOT 2021-08-31 17:04:14 +00:00
Gengliang Wang 03f5d23e96 Preparing Spark release v3.2.0-rc2 2021-08-31 17:04:08 +00:00
Yuanjian Li f50f2d474c [SPARK-35611][SS][FOLLOW-UP] Improve the user guide document
### What changes were proposed in this pull request?
Improve the user guide document.

### Why are the changes needed?
Make the user guide clear.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Doc change only.

Closes #33854 from xuanyuanking/SPARK-35611-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit dd3f0fa8c2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:27:37 +09:00
Leona Yoda 36be232eea [SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark
### What changes were proposed in this pull request?

Replace images in pyspark on pandas document because those images uses the word Koalas

### Why are the changes needed?

Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR .
https://github.com/apache/spark/pull/32835

I think we have to match the word on that images

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Screen shots
![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png)
![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png)
![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png)
![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png)
Environment
- Windows 10
- Google Chrome 92.0.4515.159

[images.pptx](https://github.com/apache/spark/files/7029087/images.pptx)

Closes #33786 from yoda-mon/replace-pyspark-doc-images.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit aeb3da2798)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 19:03:11 +09:00
Max Gekk 5198c0c316 [SPARK-35581][SPARK-36567][SQL][DOCS][FOLLOWUP] Update the SQL migration guide about foldable special datetime values
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">

### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
By generating docs, and checking results manually.

Closes #33840 from MaxGekk/special-datetime-cast-migr-guide.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit c4e739fb4b)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-26 10:02:15 +08:00
Kousuke Saruta beabf91ea1 [SPARK-35236][SQL][DOCS][FOLLOWUP] Mention ARCHIVE as an acceptable resource type for CREATE FUNCTION statement
### What changes were proposed in this pull request?

This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).

### Why are the changes needed?

To maintain the document.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)

Closes #33823 from sarutak/create-function-archive.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bd0a4950ae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:05:00 +09:00
Gengliang Wang 5463caac0d Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?

Revert 397b843890 and 5a48eb8d00

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:39:29 -07:00
Gengliang Wang eea7d0037e [SPARK-36557][DOCS] Update the MAVEN_OPTS in Spark build docs
### What changes were proposed in this pull request?

As Jacek Laskowski pointed out in the dev list, there is StackOverflowError if compiling Spark with the current MAVEN_OPTS in Spark documentation.
We should update it with `-Xss64m` to avoid it.

### Why are the changes needed?

Correct the documentation

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manual test. The MAVEN_OPTS is consistent with our github action build.

Closes #33804 from gengliangwang/updateBuildDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3da0e9500f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 09:46:41 +09:00
Venkata krishnan Sowrirajan 0f2e318894 [SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl
### What changes were proposed in this pull request?

Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.

### Why are the changes needed?

To keep the config names consistent

### Does this PR introduce _any_ user-facing change?

Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.

### How was this patch tested?

Existing tests.

Closes #33799 from venkata91/SPARK-36374-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 7b2842e986)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-22 01:29:36 -05:00
Liang-Chi Hsieh 212a21ee4f [MINOR][SS][DOCS] Update doc for streaming deduplication
### What changes were proposed in this pull request?

This patch fixes an error about streaming dedupliaction is Structured Streaming, and also updates an item about unsupported operation.

### Why are the changes needed?

Update the user document.

### Does this PR introduce _any_ user-facing change?

No. It's a doc only change.

### How was this patch tested?

Doc only change.

Closes #33801 from viirya/minor-ss-deduplication.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 5876e04de2)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-21 18:20:27 -07:00
Angerszhuuuu 45c4b751f3 [SPARK-36549][SQL] Add taskStatus supports multiple value to monitoring doc
### What changes were proposed in this pull request?
In Stage related restful API, we support `taskStatus` parameter as a list
```
 QueryParam("taskStatus") taskStatus: JList[TaskStatus]
```
In restful we should write like
```
taskStatus=SUCCESS&taskStatus=FAILED
```

It's usefule but not show in the doc, and many user don't know how to write the list parameters.
So add this feature to monitoring doc too.

### Why are the changes needed?
Make doc clear

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
With restful request
```
http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED
```
Resultful request result tasks
```
tasks" : {
    "0" : {
      "taskId" : 0,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.515GMT",
      "duration" : 273,
      "executorId" : "driver",
      "host" : "host",
      "status" : "FAILED",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorDeserializeCpuTime" : 0,
        "executorRunTime" : 206,
        "executorCpuTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 67,
      "gettingResultTime" : 0
    }
  },
```

With restful request
```
http://localhost:4040/api/v1/applications/local-1629432414554/stages/0?details=true&taskStatus=FAILED&taskStatus=SUCCESS
```
Restful result tasks
```
"tasks" : {
    "1" : {
      "taskId" : 1,
      "index" : 1,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.786GMT",
      "duration" : 16,
      "executorId" : "driver",
      "host" : "host",
      "status" : "SUCCESS",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "taskMetrics" : {
        "executorDeserializeTime" : 2,
        "executorDeserializeCpuTime" : 2638000,
        "executorRunTime" : 2,
        "executorCpuTime" : 1993000,
        "resultSize" : 837,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 12,
      "gettingResultTime" : 0
    },
    "0" : {
      "taskId" : 0,
      "index" : 0,
      "attempt" : 0,
      "launchTime" : "2021-08-20T04:06:55.515GMT",
      "duration" : 273,
      "executorId" : "driver",
      "host" : "host",
      "status" : "FAILED",
      "taskLocality" : "PROCESS_LOCAL",
      "speculative" : false,
      "accumulatorUpdates" : [ ],
      "errorMessage" : "java.lang.RuntimeException\n\tat org.apache.spark.ui.UISuite.$anonfun$new$8(UISuite.scala:95)\n\tat scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)\n\tat scala.collection.Iterator.foreach(Iterator.scala:943)\n\tat scala.collection.Iterator.foreach$(Iterator.scala:943)\n\tat org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2(RDD.scala:1003)\n\tat org.apache.spark.rdd.RDD.$anonfun$foreach$2$adapted(RDD.scala:1003)\n\tat org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2254)\n\tat org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)\n\tat org.apache.spark.scheduler.Task.run(Task.scala:136)\n\tat org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:507)\n\tat org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1468)\n\tat org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:510)\n\tat java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)\n\tat java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)\n\tat java.lang.Thread.run(Thread.java:748)\n",
      "taskMetrics" : {
        "executorDeserializeTime" : 0,
        "executorDeserializeCpuTime" : 0,
        "executorRunTime" : 206,
        "executorCpuTime" : 0,
        "resultSize" : 0,
        "jvmGcTime" : 0,
        "resultSerializationTime" : 0,
        "memoryBytesSpilled" : 0,
        "diskBytesSpilled" : 0,
        "peakExecutionMemory" : 0,
        "inputMetrics" : {
          "bytesRead" : 0,
          "recordsRead" : 0
        },
        "outputMetrics" : {
          "bytesWritten" : 0,
          "recordsWritten" : 0
        },
        "shuffleReadMetrics" : {
          "remoteBlocksFetched" : 0,
          "localBlocksFetched" : 0,
          "fetchWaitTime" : 0,
          "remoteBytesRead" : 0,
          "remoteBytesReadToDisk" : 0,
          "localBytesRead" : 0,
          "recordsRead" : 0
        },
        "shuffleWriteMetrics" : {
          "bytesWritten" : 0,
          "writeTime" : 0,
          "recordsWritten" : 0
        }
      },
      "executorLogs" : { },
      "schedulerDelay" : 67,
      "gettingResultTime" : 0
    }
  },
```

Closes #33793 from AngersZhuuuu/SPARK-36549.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 5740d5641d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-22 09:45:34 +09:00
ulysses-you e0d2d8f1a6 [SPARK-35083][CORE][FOLLLOWUP] Improve docs and migration guide
### What changes were proposed in this pull request?

* improve docs in `docs/job-scheduling.md`
* add migration guide docs in `docs/core-migration-guide.md`

### Why are the changes needed?

Help user to migrate.

### Does this PR introduce _any_ user-facing change?

yes

### How was this patch tested?

Pass CI

Closes #33794 from ulysses-you/SPARK-35083-f.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
(cherry picked from commit 90cbf9ca3e)
Signed-off-by: Kent Yao <yao@apache.org>
2021-08-20 21:33:06 +08:00
Gengliang Wang 69be513c5e Preparing development version 3.2.1-SNAPSHOT 2021-08-20 12:40:47 +00:00
Gengliang Wang 6bb3523d8e Preparing Spark release v3.2.0-rc1 2021-08-20 12:40:40 +00:00
Gengliang Wang fafdc1482b Revert "Preparing Spark release v3.2.0-rc1"
This reverts commit 8e58fafb05.
2021-08-20 20:07:02 +08:00
Gengliang Wang c829ed53ff Revert "Preparing development version 3.2.1-SNAPSHOT"
This reverts commit 4f1d21571d.
2021-08-20 20:07:01 +08:00
Gengliang Wang f47a519721 [SPARK-36551][BUILD] Add sphinx-plotly-directive in Spark release Dockerfile
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/32726, Python doc build requires `sphinx-plotly-directive`.
This PR is to install it from `spark-rm/Dockerfile` to make sure `do-release-docker.sh` can run successfully.
Also, this PR mentions it in the README of docs.

### Why are the changes needed?

Fix release script and update README of docs

### Does this PR introduce _any_ user-facing change?

No
### How was this patch tested?

Manual test locally.

Closes #33797 from gengliangwang/fixReleaseDocker.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 42eebb84f5)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-20 20:02:44 +08:00
Yuanjian Li 36c24a03bd [SPARK-35312][SS][FOLLOW-UP] More documents and checking logic for the new options
### What changes were proposed in this pull request?
Add more documents and checking logic for the new options `minOffsetPerTrigger` and `maxTriggerDelay`.

### Why are the changes needed?
Have a clear description of the behavior introduced in SPARK-35312

### Does this PR introduce _any_ user-facing change?
Yes.
If the user set minOffsetsPerTrigger > maxOffsetsPerTrigger, the new code will throw an AnalysisException. The original behavior is to ignore the maxOffsetsPerTrigger silenctly.

### How was this patch tested?
Existing tests.

Closes #33792 from xuanyuanking/SPARK-35312-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit a0b24019ed)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-20 10:41:54 +09:00
Gengliang Wang 4f1d21571d Preparing development version 3.2.1-SNAPSHOT 2021-08-19 14:08:32 +00:00
Gengliang Wang 8e58fafb05 Preparing Spark release v3.2.0-rc1 2021-08-19 14:08:26 +00:00
Gengliang Wang fb56627f21 Revert "[SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the re…
…mote scheduler pool files support"

This reverts commit e3902d1975. The feature is improvement instead of behavior change.

Closes #33789 from gengliangwang/revertDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit b36b1c7e8a)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 21:30:19 +08:00
yi.wu 9544c24560 [SPARK-35083][FOLLOW-UP][CORE] Add migration guide for the remote scheduler pool files support
### What changes were proposed in this pull request?

Add remote scheduler pool files support to the migration guide.

### Why are the changes needed?

To highlight this useful support.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass exiting tests.

Closes #33785 from Ngone51/SPARK-35083-follow-up.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit e3902d1975)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-19 16:29:19 +08:00
Wenchen Fan 8f3b4c4b7d [SPARK-33687][SQL][DOC][FOLLOWUP] Merge the doc pages of ANALYZE TABLE and ANALYZE TABLES
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/30648

ANALYZE TABLE and TABLES are essentially the same command, it's weird to put them in 2 different doc pages. This PR proposes to merge them into one doc page.

### Why are the changes needed?

simplify the doc

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #33781 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 07d173a8b0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-19 11:04:20 +08:00
Wenchen Fan 5107ad3157 [SPARK-36535][SQL] Refine the sql reference doc
### What changes were proposed in this pull request?

Refine the SQL reference doc:
- remove useless subitems in the sidebar
- remove useless sub-menu-pages (e.g. `sql-ref-syntax-aux.md`)
- avoid using `#####` in `sql-ref-literals.md`

### Why are the changes needed?

The subitems in the sidebar are quite useless, as the menu page serves the same functionalities:
<img width="1040" alt="WX20210817-2358402x" src="https://user-images.githubusercontent.com/3182036/129765924-d7e69bc1-e351-4581-a6de-f2468022f372.png">
It's also extra work to keep the manu page and sidebar subitems in sync (The ANSI compliance page is already out of sync).

The sub-menu-pages are only referenced by the sidebar, and duplicates the content of the menu page. As a result, the `sql-ref-syntax-aux.md` is already outdated compared to the menu page. It's easier to just look at the menu page.

The `#####` is not rendered properly:
<img width="776" alt="WX20210818-0001192x" src="https://user-images.githubusercontent.com/3182036/129766760-6f385443-e597-44aa-888d-14d128d45f84.png">
It's better to avoid using it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #33767 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 4b015e8d7d)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-17 12:46:49 -07:00
Gengliang Wang 70635b4b26 Revert "[SPARK-35028][SQL] ANSI mode: disallow group by aliases"
### What changes were proposed in this pull request?

Revert [[SPARK-35028][SQL] ANSI mode: disallow group by aliases ](https://github.com/apache/spark/pull/32129)

### Why are the changes needed?

It turns out that many users are using the group by alias feature.  Spark has its precedence rule when alias names conflict with column names in Group by clause: always use the table column. This should be reasonable and acceptable.
Also, external DBMS such as PostgreSQL and MySQL allow grouping by alias, too.

As we are going to announce ANSI mode GA in Spark 3.2, I suggest allowing the group by alias in ANSI mode.

### Does this PR introduce _any_ user-facing change?

No, the feature is not released yet.

### How was this patch tested?

Unit tests

Closes #33758 from gengliangwang/revertGroupByAlias.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 8bfb4f1e72)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-17 20:24:09 +08:00
Yuanjian Li 4caa43e398 [SPARK-36041][SS][DOCS] Introduce the RocksDBStateStoreProvider in the programming guide
### What changes were proposed in this pull request?
Add the document for the new RocksDBStateStoreProvider.

### Why are the changes needed?
User guide for the new feature.

### Does this PR introduce _any_ user-facing change?
No, doc only.

### How was this patch tested?
Doc only.

Closes #33683 from xuanyuanking/SPARK-36041.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 3d57e00a7f)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-08-16 12:32:19 -07:00
Venkata krishnan Sowrirajan 233af3d239 [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
### What changes were proposed in this pull request?

Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png))

### Why are the changes needed?

Helps users understand the feature

### Does this PR introduce _any_ user-facing change?

Docs

### How was this patch tested?

N/A

Closes #33615 from venkata91/SPARK-36374.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2270ecf32f)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-16 10:25:33 -05:00
Liang-Chi Hsieh 3aa933b162 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 8b8d91cf64)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:16 +09:00
Max Gekk 8dbcbebc36 [SPARK-36468][SQL][DOCS] Update docs about ANSI interval literals
### What changes were proposed in this pull request?
In the PR, I propose to update the doc page https://spark.apache.org/docs/latest/sql-ref-literals.html#interval-literal, and describe formats of ANSI interval literals.

<img width="1032" alt="Screenshot 2021-08-11 at 10 31 36" src="https://user-images.githubusercontent.com/1580697/128988454-7a6ac435-409b-4961-9b79-ebecfb141d5e.png">
<img width="1030" alt="Screenshot 2021-08-10 at 20 58 04" src="https://user-images.githubusercontent.com/1580697/128912018-a4ea3ee5-f252-49c7-a90e-5beaf7ac868f.png">

### Why are the changes needed?
To improve UX with Spark SQL, and inform users about recently added ANSI interval literals.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually checked the generated docs:
```
$ SKIP_API=1 SKIP_RDOC=1 SKIP_PYTHONDOC=1 SKIP_SCALADOC=1 bundle exec jekyll build
```

Closes #33693 from MaxGekk/doc-ansi-interval-literals.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit bbf988bd73)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-08-11 13:38:52 +03:00