Commit graph

8100 commits

Author SHA1 Message Date
Gengliang Wang b0249851f6 Preparing development version 3.2.1-SNAPSHOT 2021-09-18 11:30:12 +00:00
Gengliang Wang 96044e9735 Preparing Spark release v3.2.0-rc3 2021-09-18 11:30:06 +00:00
Ye Zhou d4d8a6320f [SPARK-36772] FinalizeShuffleMerge fails with an exception due to attempt id not matching
### What changes were proposed in this pull request?
Remove the appAttemptId from TransportConf, and parsing through SparkEnv.

### Why are the changes needed?
Push based shuffle will fail if there are any attemptId set in the SparkConf, as the attemptId is not set correctly in Driver.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tested within our Yarn cluster. Without this PR, the Driver will fail to finalize the shuffle merge on all the mergers. After the patch, Driver can successfully finalize the shuffle merge and the push based shuffle can work fine.
Also with unit test to verify the attemptId is being set in the BlockStoreClient in Driver.

Closes #34018 from zhouyejoe/SPARK-36772.

Authored-by: Ye Zhou <yezhou@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit cabc36b54d)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-18 15:52:14 +08:00
Josh Rosen 3502fda783 [SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite
### What changes were proposed in this pull request?

This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`.

The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`.

In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary.

### Why are the changes needed?

Previously, `SparkSubmitSuite` tests would fail with messages like:

```
[info] - launch simple application with spark-submit *** FAILED *** (1 second, 832 milliseconds)
[info]   Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
```

which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command.

After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces:

```
[info] - launch simple application with spark-submit *** FAILED *** (2 seconds, 800 milliseconds)
[info]   spark-submit returned with exit code 101.
[info]   Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar'
[info]
[info]   2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings.
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[info]   2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually ran the affected test suites.

Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
(cherry picked from commit 3ae6e6775b)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2021-09-16 14:31:22 -07:00
Minchu Yang 4a486f40cf [SPARK-36705][FOLLOW-UP] Fix unnecessary logWarning when PUSH_BASED_SHUFFLE_ENABLED is set to false
### What changes were proposed in this pull request?

Only throw logWarning when `PUSH_BASED_SHUFFLE_ENABLED` is set to true and `canDoPushBasedShuffle` is false

### Why are the changes needed?

Currently, this logWarning will still be printed out even when `PUSH_BASED_SHUFFLE_ENABLED` is set to false, which is unnecessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passed existing UT.

Closes #33984 from rmcyang/SPARK-36705-follow-up.

Authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2d7dc7c7ce)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-09-13 23:24:25 -05:00
Minchu Yang 79e148ee93 [SPARK-36705][SHUFFLE] Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
### What changes were proposed in this pull request?

Disable push-based shuffle when IO encryption is enabled or serializer does not support relocation of serialized objects.

### Why are the changes needed?

Push based shuffle is not compatible with IO encryption or non-relocatable serialization.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some tests to check whether push-based shuffle can be disabled successfully when IO encryption is enabled or a serializer that does not support relocation of serialized object is used.

Closes #33976 from rmcyang/SPARK-36705.

Authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 999473b1a5)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-09-13 16:15:42 -05:00
Lukas Rytz 2e7583799e [SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile)
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.

### What changes were proposed in this pull request?

This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.

I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
  - removed OSGi metadata
  - renamed some internal inner classes
  - added `Automatic-Module-Name`

### Why are the changes needed?

According to the posts, this solves issues for developers that write unit tests for their applications.

Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Locally

Closes #33948 from lrytz/parCollDep.

Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 1a62e6a2c1)
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-13 11:06:58 -05:00
yangjie01 90c4cf3cf3 [SPARK-36636][CORE][TEST] LocalSparkCluster change to use tmp workdir in test to avoid directory name collision
### What changes were proposed in this pull request?
As described in SPARK-36636,if the test cases with config `local-cluster[n, c, m]`  are run continuously within 1 second, the workdir name collision will occur because appid use format as `app-yyyyMMddHHmmss-0000` and workdir name associated with it  in test now,  the related logs are as follows:

```
java.io.IOException: Failed to create directory /spark-mine/work/app-20210908074432-0000/1
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch executor app-20210908074432-0000/0 for test
21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch executor app-20210908074432-0000/0 for test.
java.io.IOException: Failed to create directory /spark-mine/work/app-20210908074432-0000/0
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

Since the default value of s`park.deploy.maxExecutorRetries` is 10, the test failed will occur when 5 consecutive cases with local-cluster[3, 1, 1024] are completed within 1 second:

1. case 1: use worker directories: `/app-202109102324-0000/0`, `/app-202109102324-0000/1`, `/app-202109102324-0000/2`
2. case 2: retry 3 times then use worker directories: `/app-202109102324-0000/3`, `/app-202109102324-0000/4`, `/app-202109102324-0000/5`
3. case 3: retry 6 times then use worker directories: `/app-202109102324-0000/6`, `/app-202109102324-0000/7`, `/app-202109102324-0000/8`
4. case 4: retry 9 times then use worker directories: `/app-202109102324-0000/9`, `/app-202109102324-0000/10`, `/app-202109102324-0000/11`
5. case 5: retry more than **10** times then **failed**

To avoid this issue, this pr change to use tmp workdir in test with  config `local-cluster[n, c, m]`.

### Why are the changes needed?
Avoid UT failures caused by continuous workdir name collision.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA or Jenkins Tests.
- Manual test: `build/mvn clean install -Pscala-2.13 -pl core -am` or `build/mvn clean install -pl core -am`, with Scala 2.13 is easier to reproduce this problem

**Before**

The test failed error logs as follows and randomness in test failure:
```
- SPARK-33084: Add jar support Ivy URI -- test exclude param when transitive=true *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$138(SparkContextSuite.scala:1109)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test different version *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$142(SparkContextSuite.scala:1118)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test invalid param *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$146(SparkContextSuite.scala:1129)
  at org.apache.spark.SparkFunSuite.withLogAppender(SparkFunSuite.scala:235)
  at org.apache.spark.SparkContextSuite.$anonfun$new$145(SparkContextSuite.scala:1127)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  ...
- SPARK-33084: Add jar support Ivy URI -- test multiple transitive params *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$149(SparkContextSuite.scala:1140)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test param key case sensitive *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test transitive value case insensitive *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$159(SparkContextSuite.scala:1166)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)

```

**After**

```
Run completed in 26 minutes, 38 seconds.
Total number of tests run: 2863
Suites: completed 276, aborted 0
Tests: succeeded 2863, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
```

Closes #33963 from LuciferYang/SPARK-36636.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
(cherry picked from commit 0e1157df06)
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-12 09:57:18 -05:00
dgd_contributor 6b62459c09 [SPARK-35822][UI] Spark UI-Executor tab is empty in IE11
### What changes were proposed in this pull request?
Refactor some functions in utils.js to fix the empty UI-Executor tab in yarn mode in IE11.

### Why are the changes needed?
Spark UI-Executor tab is empty in IE11: So this PR to fix this.
![Executortab_IE](https://user-images.githubusercontent.com/84778052/132786964-b17b6d12-457f-4ba3-894f-3f2e1c285b1e.PNG)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT Testcase

Closes #33937 from dgd-contributor/SPARK-35822-v2.

Authored-by: dgd_contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit ebca01f03e)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-11 15:58:42 -07:00
yi.wu dcb57ab42b Revert "[SPARK-35011][CORE] Avoid Block Manager registrations when StopExecutor msg is in-flight"
This reverts commit b9e53f8937.

### What changes were proposed in this pull request?

Revert https://github.com/apache/spark/pull/32114

### Why are the changes needed?

It breaks the expected `BlockManager` re-registration (e.g., heartbeat loss of an active executor) due to deferred removal of `BlockManager`, see the check:
9cefde8db3/core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala (L551)

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass existing tests.

Closes #33959 from Ngone51/revert-35011-3.2.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-10 09:29:55 -07:00
Liang-Chi Hsieh b52fbeee2d [SPARK-36669][SQL] Add Lz4 wrappers for Hadoop Lz4 codec
### What changes were proposed in this pull request?

This patch proposes to add a few LZ4 wrapper classes for Parquet Lz4 compression output that uses Hadoop Lz4 codec.

### Why are the changes needed?

Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency.

```
[info]   Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory
[info]   at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66)
[info]   at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119)
[info]   at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152)
[info]   at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)
```

Before the issue is fixed at Hadoop new release, we can add a few wrapper classes for Lz4 codec.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified test.

Closes #33940 from viirya/lz4-wrappers.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 6bcf330191)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-09 09:31:11 -07:00
Liang-Chi Hsieh acd3856215 [SPARK-36682][CORE][TEST] Add Hadoop sequence file test for different Hadoop codecs
### What changes were proposed in this pull request?

This patch proposes to add e2e tests for using Hadoop codecs to write sequence files.

### Why are the changes needed?

To improve test coverage.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests.

Closes #33924 from viirya/hadoop-seq-test.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
(cherry picked from commit 6745d77818)
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-07 13:20:09 -07:00
Gengliang Wang 1bad04d028 Preparing development version 3.2.1-SNAPSHOT 2021-08-31 17:04:14 +00:00
Gengliang Wang 03f5d23e96 Preparing Spark release v3.2.0-rc2 2021-08-31 17:04:08 +00:00
yi.wu 8be53c3298 [SPARK-36614][CORE][UI] Correct executor loss reason caused by decommission in UI
### What changes were proposed in this pull request?

Post the correct executor loss reason to UI.

### Why are the changes needed?

To show the accurate loss reason.

### Does this PR introduce _any_ user-facing change?

Yes. Users can see the difference from the UI.

Before:
<img width="509" alt="WeChataad8d1f27d9f9aa7cf93ced4bcc820e2" src="https://user-images.githubusercontent.com/16397174/131341692-6f412607-87b8-405e-822d-0d28f07928da.png">
<img width="1138" alt="WeChat13c9f1345a096ff83d193e4e9853b165" src="https://user-images.githubusercontent.com/16397174/131341699-f2c9de09-635f-49df-8e27-2495f34276c0.png">

After:

<img width="599" alt="WeChata4313fa2dbf27bf2dbfaef5c1d4a19cf" src="https://user-images.githubusercontent.com/16397174/131341754-e3c93b5d-5252-4006-a4cc-94d76f41303b.png">
<img width="1182" alt="WeChat5559d52fd3070ae6c42fe32d56f9dc94" src="https://user-images.githubusercontent.com/16397174/131341761-e1e0644f-1e76-49c0-915a-26aad77ec272.png">

### How was this patch tested?

Manully tested.

Closes #33868 from Ngone51/fix-executor-remove-reason.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit ebe7bb6217)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-30 09:09:34 -07:00
Kousuke Saruta 93f2b00501 [SPARK-36509][CORE] Fix the issue that executors are never re-scheduled if the worker stops with standalone cluster
### What changes were proposed in this pull request?

This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops.
As a result, the application stucks.
You can easily reproduce this issue by the following procedures.

```
# Run master
$ sbin/start-master.sh

# Run worker 1
$ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077

# Run worker 2
$ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077

# Run Spark Shell
$ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1

# Check which worker the executor runs on and then kill the worker.
$ kill <worker pid>
```

With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't.

The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`.
So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called.

This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state.
This issue causes assertion error.
```
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal
```

### Why are the changes needed?

It's a critical bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually tested with the procedure shown above and confirmed the executor is re-scheduled.

Closes #33818 from sarutak/fix-scheduling-stuck.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
(cherry picked from commit ea8c31e5ea)
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-08-28 18:02:11 +09:00
Gengliang Wang eca81cc0ae [SPARK-36457][DOCS][3.2] Review and fix issues in Scala/Java API docs
### What changes were proposed in this pull request?

Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:

- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc

### Why are the changes needed?

Improve API docs

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #33845 from gengliangwang/SPARK-36457-3.2.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-27 10:51:27 +08:00
yi.wu 36df86c0d0 [SPARK-36564][CORE] Fix NullPointerException in LiveRDDDistribution.toApi
### What changes were proposed in this pull request?

This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`.

### Why are the changes needed?

Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order.

```
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception
java.lang.NullPointerException
	at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
	at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
	at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
	at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
	at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
	at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
	at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
	at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
	at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
	at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at scala.collection.AbstractTraversable.map(Traversable.scala:108)
	at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
	at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
	at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
	at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
	at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
	...
```

### Does this PR introduce _any_ user-facing change?

Yes, users will see the expected RDD info in UI instead of the NPE error.

 ### How was this patch tested?

Pass existing tests.

Closes #33812 from Ngone51/fix-hostport-npe.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit d6c453aaea)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:33:51 -07:00
Venkata krishnan Sowrirajan 0f2e318894 [SPARK-36374][FOLLOW-UP] Change config key spark.shuffle.server.mergedShuffleFileManagerImpl to spark.shuffle.push.server.mergedShuffleFileManagerImpl
### What changes were proposed in this pull request?

Minor changes to change the config key name from `spark.shuffle.server.mergedShuffleFileManagerImpl` to `spark.shuffle.push.server.mergedShuffleFileManagerImpl`. This is missed out in https://github.com/apache/spark/pull/33615.

### Why are the changes needed?

To keep the config names consistent

### Does this PR introduce _any_ user-facing change?

Yes, this is a change in the config key name. But the new config name changes are yet to be released. Technically there is no user facing change because of this change.

### How was this patch tested?

Existing tests.

Closes #33799 from venkata91/SPARK-36374-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 7b2842e986)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-22 01:29:36 -05:00
Gengliang Wang 69be513c5e Preparing development version 3.2.1-SNAPSHOT 2021-08-20 12:40:47 +00:00
Gengliang Wang 6bb3523d8e Preparing Spark release v3.2.0-rc1 2021-08-20 12:40:40 +00:00
Gengliang Wang fafdc1482b Revert "Preparing Spark release v3.2.0-rc1"
This reverts commit 8e58fafb05.
2021-08-20 20:07:02 +08:00
Gengliang Wang c829ed53ff Revert "Preparing development version 3.2.1-SNAPSHOT"
This reverts commit 4f1d21571d.
2021-08-20 20:07:01 +08:00
Gengliang Wang 4f1d21571d Preparing development version 3.2.1-SNAPSHOT 2021-08-19 14:08:32 +00:00
Gengliang Wang 8e58fafb05 Preparing Spark release v3.2.0-rc1 2021-08-19 14:08:26 +00:00
yi.wu 181d33e16e [SPARK-36532][CORE] Fix deadlock in CoarseGrainedExecutorBackend.onDisconnected to avoid executor shutdown hang
### What changes were proposed in this pull request?

Instead of exiting the executor within the RpcEnv's thread, exit the executor in a separate thread.

### Why are the changes needed?

The current exit way in `onDisconnected` can cause the deadlock, which has the exact same root cause with https://github.com/apache/spark/pull/12012:

* `onDisconnected` -> `System.exit` are called in sequence in the thread of `MessageLoop.threadpool`
* `System.exit` triggers shutdown hooks and `executor.stop` is one of the hooks.
* `executor.stop` stops the `Dispatcher`, which waits for the `MessageLoop.threadpool`  to shutdown further.
* Thus, the thread which runs `System.exit` waits for hooks to be done, but the `MessageLoop.threadpool` in the hook waits that thread to finish. Finally, this mutual dependence results in the deadlock.

### Does this PR introduce _any_ user-facing change?

Yes, the executor shutdown won't hang.

### How was this patch tested?

Pass existing tests.

Closes #33759 from Ngone51/fix-executor-shutdown-hang.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 996551fece)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-18 22:47:06 +08:00
zhuqi-lucas 2fb62e0e3d [SPARK-35548][CORE][SHUFFLE] Handling new attempt has started error message in BlockPushErrorHandler in client
### What changes were proposed in this pull request?
Add a new type of error message in BlockPushErrorHandler which indicates the PushblockStream message is received after a new application attempt has started. This error message should be correctly handled in client without retrying the block push.

### Why are the changes needed?
When we get a block push failure because of the too old attempt, we will not retry pushing the block nor log the exception on the client side.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add the corresponding unit test.

Closes #33617 from zhuqi-lucas/master.

Authored-by: zhuqi-lucas <821684824@qq.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 05cd5f97c3)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-16 13:59:54 -05:00
Kazuyuki Tanimura 9149cad57d [SPARK-32210][CORE] Fix NegativeArraySizeException in MapOutputTracker with large spark.default.parallelism
### What changes were proposed in this pull request?
The current `MapOutputTracker` class may throw `NegativeArraySizeException` with a large number of partitions. Within the serializeOutputStatuses() method, it is trying to compress an array of mapStatuses and outputting the binary data into (Apache)ByteArrayOutputStream . Inside the (Apache)ByteArrayOutputStream.toByteArray(), negative index exception happens because the index is int and overflows (2GB limit) when the output binary size is too large.

This PR proposes two high-level ideas:
  1. Use `org.apache.spark.util.io.ChunkedByteBufferOutputStream`, which has a way to output the underlying buffer as `Array[Array[Byte]]`.
  2. Change the signatures from `Array[Byte]` to `Array[Array[Byte]]` in order to handle over 2GB compressed data.

### Why are the changes needed?
This issue seems to be missed out in the earlier effort of addressing 2GB limitations [SPARK-6235](https://issues.apache.org/jira/browse/SPARK-6235)

Without this fix, `spark.default.parallelism` needs to be kept at the low number. The drawback of setting smaller spark.default.parallelism is that it requires more executor memory (more data per partition). Setting `spark.io.compression.zstd.level` to higher number (default 1) hardly helps.

That essentially means we have the data size limit that for shuffling and does not scale.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Passed existing tests
```
build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite"
```
Also added a new unit test
```
build/sbt "core/testOnly org.apache.spark.MapOutputTrackerSuite  -- -z SPARK-32210"
```
Ran the benchmark using GitHub Actions and didn't not observe any performance penalties. The results are attached in this PR
```
core/benchmarks/MapStatusesSerDeserBenchmark-jdk11-results.txt
core/benchmarks/MapStatusesSerDeserBenchmark-results.txt
```

Closes #33721 from kazuyukitanimura/SPARK-32210.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 8ee464cd7a)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-16 09:11:51 -07:00
Venkata krishnan Sowrirajan 233af3d239 [SPARK-36374][SHUFFLE][DOC] Push-based shuffle high level user documentation
### What changes were proposed in this pull request?

Document the push-based shuffle feature with a high level overview of the feature and corresponding configuration options for both shuffle server side as well as client side. This is how the changes to the doc looks on the browser ([img](https://user-images.githubusercontent.com/8871522/129231582-ad86ee2f-246f-4b42-9528-4ccd693e86d2.png))

### Why are the changes needed?

Helps users understand the feature

### Does this PR introduce _any_ user-facing change?

Docs

### How was this patch tested?

N/A

Closes #33615 from venkata91/SPARK-36374.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2270ecf32f)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-16 10:25:33 -05:00
yi.wu 101f720cc9 [SPARK-32920][CORE][FOLLOW-UP] Fix string interpolator in the log
### What changes were proposed in this pull request?

fix string interpolator

### Why are the changes needed?

To log the correct stage info.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existed tests.

Closes #33738 from Ngone51/fix.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a47ceaf549)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 21:44:30 +09:00
Xingbo Jiang 09a1ddba41 [SPARK-36500][CORE] Fix temp_shuffle file leaking when a task is interrupted
### What changes were proposed in this pull request?

When a task thread is interrupted, the underlying output stream referred by `DiskBlockObjectWriter.mcs` may have been closed, then we get IOException when flushing the buffered data. This breaks the assumption that `revertPartialWritesAndClose()` should not throw exceptions.

To fix the issue, we can catch the IOException in `ManualCloseOutputStream.manualClose()`.

### Why are the changes needed?

Previously the IOException was not captured, thus `revertPartialWritesAndClose()` threw an exception. When this happens, `BypassMergeSortShuffleWriter.stop()` would stop deleting the temp_shuffle files tracked by `partitionWriters`, hens lead to temp_shuffle file leak issues.

### Does this PR introduce _any_ user-facing change?

No, this is an internal bug fix.

### How was this patch tested?

Tested by running a longevity stress test. After the fix, there is no more leaked temp_shuffle files.

Closes #33731 from jiangxb1987/temp_shuffle.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ec5f3a17e3)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-13 19:25:27 +09:00
Min Shen c6b683e5a2 [SPARK-36378][SHUFFLE] Switch to using RPCResponse to communicate common block push failures to the client
We have run performance evaluations on the version of push-based shuffle committed to upstream so far, and have identified a few places for further improvements:
1. On the server side, we have noticed that the usage of `String.format`, especially when receiving a block push request, has a much higher overhead compared with string concatenation.
2. On the server side, the usage of `Throwables.getStackTraceAsString` in the `ErrorHandler.shouldRetryError` and `ErrorHandler.shouldLogError` has generated quite some overhead.

These 2 issues are related to how we are currently handling certain common block push failures.
We are communicating such failures via `RPCFailure` by transmitting the exception stack trace.
This generates the overhead on both server and client side for creating these exceptions and makes checking the type of failures fragile and inefficient with string matching of exception stack trace.
To address these, this PR also proposes to encode the common block push failure as an error code and send that back to the client with a proper RPC message.

Improve shuffle service efficiency for push-based shuffle.
Improve code robustness for handling block push failures.

No

Existing unit tests.

Closes #33613 from Victsm/SPARK-36378.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 3f09093a21)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-10 16:54:21 -05:00
Kazuyuki Tanimura c230ca95e6 [SPARK-36464][CORE] Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
### What changes were proposed in this pull request?
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`.
That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream`

This PR proposes to change the underlying `_size` variable from `Int` to `Long` at the initialization

### Why are the changes needed?
Be cause the `size` method of `ChunkedByteBufferOutputStream` incorrectly returns a negative value when over 2GB data is written.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passed existing tests
```
build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite"
```
Also added a new unit test
```
build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite – -z SPARK-36464"
```

Closes #33690 from kazuyukitanimura/SPARK-36464.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit c888bad6a1)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-10 10:30:07 -07:00
Min Shen 552a332dd4 [SPARK-36423][SHUFFLE] Randomize order of blocks in a push request to improve block merge ratio for push-based shuffle
### What changes were proposed in this pull request?

On the client side, we are currently randomizing the order of push requests before processing each request. In addition we can further randomize the order of blocks within each push request before pushing them.
In our benchmark, this has resulted in a 60%-70% reduction of blocks that fail to be merged due to bock collision (the existing block merge ratio is already pretty good in general, and this further improves it).

### Why are the changes needed?

Improve block merge ratio for push-based shuffle

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Straightforward small change, no additional test needed.

Closes #33649 from Victsm/SPARK-36423.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 6e729515fd)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-06 09:48:31 -05:00
yi.wu cc2a5abf7d [SPARK-36384][CORE][DOC] Add doc for shuffle checksum
### What changes were proposed in this pull request?

Add doc for the shuffle checksum configs in `configuration.md`.

### Why are the changes needed?

doc

### Does this PR introduce _any_ user-facing change?

No, since Spark 3.2 hasn't been released.

### How was this patch tested?

Pass existed tests.

Closes #33637 from Ngone51/SPARK-36384.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3b92c721b5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-05 10:16:53 +09:00
Dongjoon Hyun 7d685dfd6f [SPARK-36354][CORE] EventLogFileReader should skip rolling event log directories with no logs
### What changes were proposed in this pull request?

This PR aims to skip rolling event log directories which has only `appstatus` file.

### Why are the changes needed?

Currently, Spark History server shows `IllegalArgumentException` warning, but the event log might arrive later. The situation also can happen when the job is killed before uploading its first log to the remote storages like S3.
```
21/07/30 07:38:26 WARN FsHistoryProvider:
Error while reading new log s3a://.../eventlog_v2_spark-95b5c736c8e44037afcf152534d08771
java.lang.IllegalArgumentException: requirement failed:
Log directory must contain at least one event log file!
...
at org.apache.spark.deploy.history.RollingEventLogFilesFileReader.files$lzycompute(EventLogFileReaders.scala:216)
```

### Does this PR introduce _any_ user-facing change?

Yes. Users will not see `IllegalArgumentException` warnings.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #33586 from dongjoon-hyun/SPARK-36354.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 28a2a2238f)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-04 20:26:24 +09:00
yi.wu 144040ab28 [SPARK-36383][CORE][3.2] Avoid NullPointerException during executor shutdown
### What changes were proposed in this pull request?

Fix `NullPointerException` in `Executor.stop()`.

### Why are the changes needed?

Some initialization steps could fail before the initialization of `metricsPoller`, `heartbeater`, `threadPool`, which results in the null of `metricsPoller`, `heartbeater`, `threadPool`. For example, I encountered a failure of:

c20af53580/core/src/main/scala/org/apache/spark/executor/Executor.scala (L137)

where the executor itself failed to register at the driver.

This PR helps to eliminate the error messages when the issue happens to not confuse users:

<details>
  <summary><mark><font color=darkred>[click to see the detailed error message]</font></mark></summary>
  <pre>
21/07/23 16:04:10 WARN Executor: Unable to stop executor metrics poller
java.lang.NullPointerException
        at org.apache.spark.executor.Executor.stop(Executor.scala:318)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
21/07/23 16:04:10 WARN Executor: Unable to stop heartbeater
java.lang.NullPointerException
        at org.apache.spark.executor.Executor.stop(Executor.scala:324)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
21/07/23 16:04:10 ERROR Utils: Uncaught exception in thread shutdown-hook-0
java.lang.NullPointerException
        at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:334)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:231)
        at org.apache.spark.executor.Executor.stop(Executor.scala:334)
        at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:76)
        at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$2(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:2025)
        at org.apache.spark.util.SparkShutdownHookManager.$anonfun$runAll$1(ShutdownHookManager.scala:188)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
        at scala.util.Try$.apply(Try.scala:213)
        at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
        at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)
  </pre>
</details>

### Does this PR introduce _any_ user-facing change?

Yes, users won't see error messages of `NullPointerException` after this fix.

### How was this patch tested?

Pass existing tests.

Closes #33620 from Ngone51/spark-36383-3.2.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-03 08:50:11 -07:00
Chandni Singh 369781a331 [SPARK-36389][CORE][SHUFFLE] Revert the change that accepts negative mapId in ShuffleBlockId
### What changes were proposed in this pull request?
With SPARK-32922, we added a change that ShuffleBlockId can have a negative mapId. This was to support push-based shuffle where -1 as mapId indicated a push-merged block. However with SPARK-32923, a different type of BlockId was introduced - ShuffleMergedId, but reverting the change to ShuffleBlockId was missed.

### Why are the changes needed?
This reverts the changes to `ShuffleBlockId` which will never have a negative mapId.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified the unit test to verify the newly added ShuffleMergedBlockId.

Closes #33616 from otterc/SPARK-36389.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 2712343a27)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-03 00:18:06 -07:00
Karen Feng de837a0ebb [SPARK-36331][CORE] Add standard SQLSTATEs to error guidelines
### What changes were proposed in this pull request?

Adds ANSI/ISO SQLSTATE standards to the error guidelines.

### Why are the changes needed?

Provides visibility and consistency to the SQLSTATEs assigned to error classes.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Not needed; docs only

Closes #33560 from karenfeng/sqlstate-manual.

Authored-by: Karen Feng <karen.feng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 63517eb430)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-03 13:58:07 +09:00
yi.wu df43300227 [SPARK-36206][CORE] Support shuffle data corruption diagnosis via shuffle checksum
### What changes were proposed in this pull request?

This PR adds support to diagnose shuffle data corruption. Basically, the diagnosis mechanism works like this:
The shuffler reader would calculate the checksum (c1) for the corrupted shuffle block and send it to the server where the block is stored. At the server, it would read back the checksum (c2) that is stored in the checksum file and recalculate the checksum (c3) for the corresponding shuffle block. Then, if c2 != c3, we suspect the corruption is caused by the disk issue. Otherwise, if c1 != c3, we suspect the corruption is caused by the network issue. Otherwise, the checksum verifies pass. In any case of the error, the cause remains unknown.

After the shuffle reader receives the diagnosis response, it'd take the action bases on the type of cause. Only in case of the network issue, we'd give a retry. Otherwise, we'd throw the fetch failure directly. Also note that, if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs. If corruption happens again after retry, the reducer will throw the fetch failure directly this time without the diagnosis.

Please check out https://github.com/apache/spark/pull/32385 to see the completed proposal of the shuffle checksum project.

### Why are the changes needed?

Shuffle data corruption is a long-standing issue in Spark. For example, in SPARK-18105, people continually reports corruption issue. However, data corruption is difficult to reproduce in most cases and even harder to tell the root cause. We don't know if it's a Spark issue or not. With the diagnosis support for the shuffle corruption, Spark itself can at least distinguish the cause between disk and network, which is very important for users.

### Does this PR introduce _any_ user-facing change?

Yes, users may know the cause of the shuffle corruption after this change.

### How was this patch tested?

Added tests.

Closes #33451 from Ngone51/SPARK-36206.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit a98d919da4)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-02 09:59:30 -05:00
Venkata krishnan Sowrirajan a9b32b390c [SPARK-32923][CORE][SHUFFLE] Handle indeterminate stage retries for push-based shuffle
[[SPARK-23243](https://issues.apache.org/jira/browse/SPARK-23243)] and [[SPARK-25341](https://issues.apache.org/jira/browse/SPARK-25341)] addressed cases of stage retries for indeterminate stage involving operations like repartition. This PR addresses the same issues in the context of push-based shuffle. Currently there is no way to distinguish the current execution of a stage for a shuffle ID. Therefore the changes explained below are necessary.

Core changes are summarized as follows:

1. Introduce a new variable `shuffleMergeId` in `ShuffleDependency` which is monotonically increasing value tracking the temporal ordering of execution of <stage-id, stage-attempt-id> for a shuffle ID.
2. Correspondingly make changes in the push-based shuffle protocol layer in `MergedShuffleFileManager`, `BlockStoreClient` passing the `shuffleMergeId` in order to keep track of the shuffle output in separate files on the shuffle service side.
3. `DAGScheduler` increments the `shuffleMergeId` tracked in `ShuffleDependency` in the cases of a indeterministic stage execution
4. Deterministic stage will have `shuffleMergeId` set to 0 as no special handling is needed in this case and indeterminate stage will have `shuffleMergeId` starting from 1.

New protocol changes are needed due to the reasons explained above.

No

Added new unit tests in `RemoteBlockPushResolverSuite, DAGSchedulerSuite, BlockIdSuite, ErrorHandlerSuite`

Closes #33034 from venkata91/SPARK-32923.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit c039d99812)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-01 23:30:02 -05:00
Venkata krishnan Sowrirajan cd3ab00382 [SPARK-32919][FOLLOW-UP] Filter out driver in the merger locations and fix the return type of RemoveShufflePushMergerLocations
### What changes were proposed in this pull request?

SPARK-32919 added support for fetching shuffle push merger locations with push-based shuffle. Filter out driver host in the shuffle push merger locations as driver won't participate in the shuffle merge also fix ClassCastException in the RemoveShufflePushMergerLocations.
### Why are the changes needed?

No

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests.

Closes #33425 from venkata91/SPARK-32919-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit 2a18f82940)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-08-01 13:12:22 -05:00
zhuqi-lucas d37f732a68 [SPARK-36344][CORE][SHUFFLE] Fix some typos in ShuffleBlockPusher class
### What changes were proposed in this pull request?
Just to fix some typos in ShuffleBlockPusher class.

### Why are the changes needed?
Fix the typos, make code clear.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
No need test.

Closes #33575 from zhuqi-lucas/master.

Authored-by: zhuqi-lucas <821684824@qq.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 900b38d5fa)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 09:11:25 +09:00
Min Shen bbec381f5e [SPARK-36266][SHUFFLE] Rename classes in shuffle RPC used for block push operations
### What changes were proposed in this pull request?
This is a follow-up to #29855 according to the [comments](https://github.com/apache/spark/pull/29855/files#r505536514)
In this PR, the following changes are made:

1. A new `BlockPushingListener` interface is created specifically for block push. The existing `BlockFetchingListener` interface is left as is, since it might be used by external shuffle solutions. These 2 interfaces are unified under `BlockTransferListener` to enable code reuse.
2. `RetryingBlockFetcher`, `BlockFetchStarter`, and `RetryingBlockFetchListener` are renamed to `RetryingBlockTransferor`, `BlockTransferStarter`, and `RetryingBlockTransferListener` respectively. This makes their names more generic to be reused across both block fetch and push.
3. Comments in `OneForOneBlockPusher` are further clarified to better explain how we handle retries for block push.

### Why are the changes needed?
To make code cleaner without sacrificing backward compatibility.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing unit tests.

Closes #33340 from Victsm/SPARK-32915-followup.

Lead-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Min Shen <victor.nju@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit c4aa54ed4e)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-07-26 17:40:19 -05:00
Hyukjin Kwon a77c9d6d17 [SPARK-36217][SQL] Rename CustomShuffleReader and OptimizeLocalShuffleReader in AQE
### What changes were proposed in this pull request?

This PR proposes to rename:

- Rename `*Reader`/`*reader` to `*Read`/`*read` for rules and execution plan (user-facing doc/config name remain untouched)
  - `*ShuffleReaderExec` ->`*ShuffleReadExec`
  - `isLocalReader` -> `isLocalRead`
  - ...
- Rename `CustomShuffle*` prefix to `AQEShuffle*`
- Rename `OptimizeLocalShuffleReader` rule to `OptimizeShuffleWithLocalRead`

### Why are the changes needed?

There are multiple problems in the current naming:

- `CustomShuffle*` -> `AQEShuffle*`
    it sounds like it is a pluggable API. However, this is actually only used by AQE.
- `OptimizeLocalShuffleReader` -> `OptimizeShuffleWithLocalRead`
    it is the name of a rule but it can be misread as a reader, which is counterintuative
- `*ReaderExec` -> `*ReadExec`
    Reader execution reads a bit odd. It should better be read execution (like `ScanExec`, `ProjectExec` and `FilterExec`). I can't find the reason to name it with something that performs an action. See also the generated plans:

    Before:

    ```
    ...
    * HashAggregate (12)
       +- CustomShuffleReader (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ...
    ```

    After:

    ```
    ...
    * HashAggregate (12)
       +- AQEShuffleRead (11)
          +- ShuffleQueryStage (10)
             +- Exchange (9)
    ..
    ```

### Does this PR introduce _any_ user-facing change?

No, internal refactoring.

### How was this patch tested?

Existing unittests should cover the changes.

Closes #33429 from HyukjinKwon/SPARK-36217.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 6e3d404cec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-26 22:42:16 +08:00
Venkata krishnan Sowrirajan 39d6e87bd9 [SPARK-32920][FOLLOW-UP] Fix shuffleMergeFinalized directly calling rdd.getNumPartitions as RDD is not serialized to executor
### What changes were proposed in this pull request?

`ShuffleMapTask` should not push blocks if a shuffle is already merge finalized. Currently block push is disabled for retry cases. Also fix `shuffleMergeFinalized` calling `rdd.getNumPartitions` as RDD is not serialized causing issues.

### Why are the changes needed?

No

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #33426 from venkata91/SPARK-32920-follow-up.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
(cherry picked from commit ba1a7ce5ec)
Signed-off-by: Mridul Muralidharan <mridulatgmail.com>
2021-07-26 09:17:55 -05:00
yi.wu 1e17a5bc19 [SPARK-32920][FOLLOW-UP][CORE] Shutdown shuffleMergeFinalizeScheduler when DAGScheduler stop
### What changes were proposed in this pull request?

Call `shuffleMergeFinalizeScheduler.shutdownNow()` in `DAGScheduler.stop()`.

### Why are the changes needed?

Avoid the thread leak.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existing tests.

Closes #33495 from Ngone51/SPARK-32920-followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 21450b3254)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-24 17:41:20 -07:00
Chandni Singh 96944ac17d [SPARK-36255][SHUFFLE][CORE] Stop pushing and retrying on FileNotFound exceptions
### What changes were proposed in this pull request?
Once the shuffle is cleaned up by the `ContextCleaner`, the shuffle files are deleted by the executors. In this case, the push of the shuffle data by the executors can throw `FileNotFoundException`s because the shuffle files are deleted. When this exception is thrown from the `shuffle-block-push-thread`, it causes the executor to exit. Both the `shuffle-block-push` threads and the netty event-loops will encounter `FileNotFoundException`s in this case.  The fix here stops these threads from pushing more blocks when they encounter `FileNotFoundException`. When the exception is from the `shuffle-block-push-thread`, it will get handled and logged as warning instead of failing the executor.

### Why are the changes needed?
This fixes the bug which causes executor to exits when they are instructed to clean up shuffle data.
Below is the stacktrace of this exception:
```
21/06/17 16:03:57 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[block-push-thread-1,5,main]
java.lang.Error: java.io.IOException: Error in opening FileSegmentManagedBuffer

{file=********/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, offset=10640, length=190}
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: Error in opening FileSegmentManagedBuffer\{file=*******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data, offset=10640, length=190}

at org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:89)
at org.apache.spark.shuffle.ShuffleWriter.sliceReqBufferIntoBlockBuffers(ShuffleWriter.scala:294)
at org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$sendRequest(ShuffleWriter.scala:270)
at org.apache.spark.shuffle.ShuffleWriter.org$apache$spark$shuffle$ShuffleWriter$$pushUpToMax(ShuffleWriter.scala:191)
at org.apache.spark.shuffle.ShuffleWriter$$anon$2$$anon$4.run(ShuffleWriter.scala:244)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
Caused by: java.io.FileNotFoundException: ******/application_1619720975011_11057757/blockmgr-560cb4cf-9918-4ea7-a007-a16c5e3a35fe/0a/shuffle_1_690_0.data (No such file or directory)
at java.io.RandomAccessFile.open0(Native Method)
at java.io.RandomAccessFile.open(RandomAccessFile.java:316)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:243)
at org.apache.spark.network.buffer.FileSegmentManagedBuffer.nioByteBuffer(FileSegmentManagedBuffer.java:62)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a unit to verify no more data is pushed when `FileNotFoundException` is encountered. Have also verified in our environment.

Closes #33477 from otterc/SPARK-36255.

Authored-by: Chandni Singh <singh.chandni@gmail.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 09e1c61272)
Signed-off-by: yi.wu <yi.wu@databricks.com>
2021-07-24 21:10:01 +08:00
yangjie01 b46a9f3b0f [SPARK-36242][CORE] Ensure spill file closed before set success = true in ExternalSorter.spillMemoryIteratorToDisk method
### What changes were proposed in this pull request?
The main change of this pr is move `writer.close()` before `success = true` to ensure spill file closed before set `success = true` in `ExternalSorter.spillMemoryIteratorToDisk` method.

### Why are the changes needed?
Avoid setting `success = true` first and then failure of close spill file

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass the Jenkins or GitHub Action
- Add a new Test case to check `The spill file should not exists if writer close fails`

Closes #33460 from LuciferYang/external-sorter-spill-close.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit f61d5993ea)
Signed-off-by: yi.wu <yi.wu@databricks.com>
2021-07-23 23:18:51 +08:00
Holden Karau e9dd2969c2 [SPARK-36246][CORE][TEST] GHA WorkerDecommissionExtended flake
### What changes were proposed in this pull request?

GHA probably doesn't have the same resources as jenkins so move down from 5 to 3 execs and give a bit more time for them to come up.

### Why are the changes needed?

Test is timing out in GHA

### Does this PR introduce _any_ user-facing change?
No, test only change.

### How was this patch tested?

Run through GHA verify no OOM during WorkerDecommissionExtended

Closes #33467 from holdenk/SPARK-36246-WorkerDecommissionExtendedSuite-flakes-in-GHA.

Lead-authored-by: Holden Karau <holden@pigscanfly.ca>
Co-authored-by: Holden Karau <hkarau@netflix.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 89a83196ac)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 15:18:03 +09:00