Commit graph

637 commits

Author SHA1 Message Date
Gengliang Wang 5d45a415f3 Preparing Spark release v3.2.0-rc7 2021-10-06 11:45:26 +00:00
Gengliang Wang 4bd358474b Preparing development version 3.2.1-SNAPSHOT 2021-09-28 10:53:42 +00:00
Gengliang Wang dde73e2e1c Preparing Spark release v3.2.0-rc6 2021-09-28 10:53:35 +00:00
Gengliang Wang 0c57bb8f7f Preparing development version 3.2.1-SNAPSHOT 2021-09-27 08:24:50 +00:00
Gengliang Wang 49aea14c5a Preparing Spark release v3.2.0-rc5 2021-09-27 08:24:44 +00:00
Chao Sun 228d12e30e [SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Move spark.yarn.isHadoopProvided to parent pom
### What changes were proposed in this pull request?

Move `spark.yarn.isHadoopProvided` to Spark parent pom, so that under `resource-managers/yarn` we can make `hadoop-3.2` as the default profile.

### Why are the changes needed?

Currently under `resource-managers/yarn` there are 3 maven profiles : `hadoop-provided`, `hadoop-2.7`, and `hadoop-3.2`, of which `hadoop-3.2` is activated by default (via `activeByDefault`). The activation, however, doesn't work when there is other explicitly activated profiles. In specific, if users build Spark with `hadoop-provided`, maven will fail because it can't find Hadoop 3.2 related dependencies, which are defined in the `hadoop-3.2` profile section.

To fix the issue, this proposes to move the `hadoop-provided` section to the parent pom. Currently this is only used to define a property `spark.yarn.isHadoopProvided`, and it shouldn't matter where we define it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Tested via running the command:
```
build/mvn clean package -DskipTests -B -Pmesos -Pyarn -Pkubernetes -Pscala-2.12 -Phadoop-provided
```
which was failing before this PR but is succeeding with it.

Also checked active profiles with the command:
```
build/mvn -Pyarn -Phadoop-provided help:active-profiles
```
and it shows that `hadoop-3.2` is active for `spark-yarn` module now.

Closes #34110 from sunchao/SPARK-36835-followup2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit f9efdeea8c)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-27 15:17:28 +08:00
Gengliang Wang 2348cce37e Preparing development version 3.2.1-SNAPSHOT 2021-09-26 12:28:46 +00:00
Gengliang Wang 2ed8c08c5b Preparing Spark release v3.2.0-rc5 2021-09-26 12:28:40 +00:00
Chao Sun 540e45c3cc [SPARK-36835][FOLLOWUP][BUILD][TEST-HADOOP2.7] Fix maven issue for Hadoop 2.7 profile after enabling dependency reduced pom
### What changes were proposed in this pull request?

Fix an issue where Maven may stuck in an infinite loop when building Spark, for Hadoop 2.7 profile.

### Why are the changes needed?

After re-enabling `createDependencyReducedPom` for `maven-shade-plugin`, Spark build stopped working for Hadoop 2.7 profile and will stuck in an infinitely loop, likely due to a Maven shade plugin bug similar to https://issues.apache.org/jira/browse/MSHADE-148. This seems to be caused by the fact that, under `hadoop-2.7` profile, variable `hadoop-client-runtime.artifact` and `hadoop-client-api.artifact`are both `hadoop-client` which triggers the issue.

As a workaround, this changes `hadoop-client-runtime.artifact` to be `hadoop-yarn-api` when using `hadoop-2.7`. Since `hadoop-yarn-api` is a dependency of `hadoop-client`, this essentially moves the former to the same level as the latter. It should have no effect as both are dependencies of Spark.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #34100 from sunchao/SPARK-36835-followup.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit 937a74e6e7)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-26 13:39:50 +08:00
Gengliang Wang da722d43cb Preparing development version 3.2.1-SNAPSHOT 2021-09-24 10:03:23 +00:00
Gengliang Wang 9e35703211 Preparing Spark release v3.2.0-rc5 2021-09-24 10:03:16 +00:00
Gengliang Wang 09a8535cc4 Revert "[SPARK-35672][CORE][YARN] Pass user classpath entries to exec…
…utors using config instead of command line"

### What changes were proposed in this pull request?
This reverts commit 866df69c62.

### Why are the changes needed?
After the change environment variables were not substituted in user classpath entries. Please find an example on SPARK-35672.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #34088 from gengliangwang/revertSPARK-35672.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-24 12:46:22 +09:00
Gengliang Wang 0fb7127f85 Preparing development version 3.2.1-SNAPSHOT 2021-09-23 08:46:28 +00:00
Gengliang Wang b609f2fe0c Preparing Spark release v3.2.0-rc4 2021-09-23 08:46:22 +00:00
Dongjoon Hyun 5d0e51e943 [SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image
### What changes were proposed in this pull request?

This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image.

### Why are the changes needed?

`openjdk:11-jre-slim` image is upgraded to `Debian 11`.

```
$ docker run -it openjdk:11-jre-slim cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
NAME="Debian GNU/Linux"
VERSION_ID="11"
VERSION="11 (bullseye)"
VERSION_CODENAME=bullseye
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"
```

It causes `R 3.5` installation failures in our K8s integration test environment.
- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/
```
The following packages have unmet dependencies:
 r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable
               Depends: libreadline7 (>= 6.0) but it is not installable
E: Unable to correct problems, you have held broken packages.
The command '/bin/sh -c apt-get update &&   apt install -y gnupg &&   echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list &&   apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' &&   apt-get update &&
apt install -y -t buster-cran35 r-base r-base-dev &&   rm -rf
```

### Does this PR introduce _any_ user-facing change?

Yes, this will recover the installation.

### How was this patch tested?

Succeed to build SparkR docker image in the K8s integration test in Jenkins CI.

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/
```
Successfully built 32e1a0cd5ff8
Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51
```

Closes #34048 from dongjoon-hyun/SPARK-36806.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a178752540)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-20 14:38:32 -07:00
Gengliang Wang b0249851f6 Preparing development version 3.2.1-SNAPSHOT 2021-09-18 11:30:12 +00:00
Gengliang Wang 96044e9735 Preparing Spark release v3.2.0-rc3 2021-09-18 11:30:06 +00:00
Gengliang Wang 1bad04d028 Preparing development version 3.2.1-SNAPSHOT 2021-08-31 17:04:14 +00:00
Gengliang Wang 03f5d23e96 Preparing Spark release v3.2.0-rc2 2021-08-31 17:04:08 +00:00
Gengliang Wang 69be513c5e Preparing development version 3.2.1-SNAPSHOT 2021-08-20 12:40:47 +00:00
Gengliang Wang 6bb3523d8e Preparing Spark release v3.2.0-rc1 2021-08-20 12:40:40 +00:00
Gengliang Wang fafdc1482b Revert "Preparing Spark release v3.2.0-rc1"
This reverts commit 8e58fafb05.
2021-08-20 20:07:02 +08:00
Gengliang Wang c829ed53ff Revert "Preparing development version 3.2.1-SNAPSHOT"
This reverts commit 4f1d21571d.
2021-08-20 20:07:01 +08:00
Gengliang Wang 4f1d21571d Preparing development version 3.2.1-SNAPSHOT 2021-08-19 14:08:32 +00:00
Gengliang Wang 8e58fafb05 Preparing Spark release v3.2.0-rc1 2021-08-19 14:08:26 +00:00
attilapiros eb09be9e68 [SPARK-36052][K8S] Introducing a limit for pending PODs
Introducing a limit for pending PODs (newly created/requested executors included).
This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles.

Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load.

No.

With new unit tests.

Closes #33492 from attilapiros/SPARK-36052.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit 1dced492fb)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-16 16:06:29 -07:00
Venkata krishnan Sowrirajan 1a432fe6bb [SPARK-36460][SHUFFLE] Pull out NoOpMergedShuffleFileManager inner class outside
### What changes were proposed in this pull request?

Pull out NoOpMergedShuffleFileManager inner class outside. This is required since passing dollar sign ($) for the config (`spark.shuffle.server.mergedShuffleFileManagerImpl`) value can be an issue. Currently `spark.shuffle.server.mergedShuffleFileManagerImpl` is by default set to `org.apache.spark.network.shuffle.ExternalBlockHandler$NoOpMergedShuffleFileManager`. After this change the default value be set to `org.apache.spark.network.shuffle.NoOpMergedShuffleFileManager`

### Why are the changes needed?

Passing `$` for the config value can be an issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified existing unit tests.

Closes #33688 from venkata91/SPARK-36460.

Authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
(cherry picked from commit df0de83c46)
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-08-10 10:19:41 +08:00
Erik Krogen 23e48dbf77 [SPARK-35259][SHUFFLE] Update ExternalBlockHandler Timer variables to expose correct units
### What changes were proposed in this pull request?
`ExternalBlockHandler` exposes 4 metrics which are Dropwizard `Timer` metrics, and are named with a `millis` suffix:
```
    private final Timer openBlockRequestLatencyMillis = new Timer();
    private final Timer registerExecutorRequestLatencyMillis = new Timer();
    private final Timer fetchMergedBlocksMetaLatencyMillis = new Timer();
    private final Timer finalizeShuffleMergeLatencyMillis = new Timer();
```
However these Dropwizard Timers by default use nanoseconds ([documentation](https://metrics.dropwizard.io/3.2.3/getting-started.html#timers)).

This causes `YarnShuffleServiceMetrics` to expose confusingly-named metrics like `openBlockRequestLatencyMillis_nanos_max` (the actual values are currently in nanos).

This PR adds a new `Timer` subclass, `TimerWithCustomTimeUnit`, which accepts a `TimeUnit` at creation time and exposes timing information using this time unit when values are read. Internally, values are still stored with nanosecond-level precision. The `Timer` metrics within `ExternalBlockHandler` are updated to use the new class with milliseconds as the unit. The logic to include the `nanos` suffix in the metric name within `YarnShuffleServiceMetrics` has also been removed, with the assumption that the metric name itself includes the units.

### Does this PR introduce _any_ user-facing change?
Yes, there are two changes.
First, the names for metrics exposed by `ExternalBlockHandler` via `YarnShuffleServiceMetrics` such as `openBlockRequestLatencyMillis_nanos_max` and `openBlockRequestLatencyMillis_nanos_50thPercentile` have been changed to remove the `_nanos` suffix. This would be considered a breaking change, but these names were only exposed as part of #32388, which has not yet been released (slated for 3.2.0). New names are like `openBlockRequestLatencyMillis_max` and `openBlockRequestLatencyMillis_50thPercentile`
Second, the values of the metrics themselves have changed, to expose milliseconds instead of nanoseconds. Note that this does not affect metrics such as `openBlockRequestLatencyMillis_count` or `openBlockRequestLatencyMillis_rate1`, only the `Snapshot`-related metrics (`max`, `median`, percentiles, etc.). For the YARN case, these metrics were also introduced by #32388, and thus also have not yet been released. It was possible for the nanosecond values to be consumed by some other metrics reporter reading the Dropwizard metrics directly, but I'm not aware of any such usages.

### How was this patch tested?
Unit tests have been updated.

Closes #33116 from xkrogen/xkrogen-SPARK-35259-ess-fix-metric-unit-prefix.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: yi.wu <yi.wu@databricks.com>
(cherry picked from commit 70a15868fc)
Signed-off-by: yi.wu <yi.wu@databricks.com>
2021-07-24 21:27:56 +08:00
Kousuke Saruta 3c825a18b6 [SPARK-36067][BUILD][TEST][YARN] YarnClusterSuite fails due to NoClassDefFoundError unless hadoop-3.2 profile is activated explicitly
### What changes were proposed in this pull request?

This PR fixes an issue that `YarnClusterSuite` fails due to `NoClassDefFoundError unless `hadoop-3.2` profile is activated explicitly regardless of building with SBT or Maven.

```
build/sbt -Pyarn "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite"
...
[info] YarnClusterSuite:
[info] org.apache.spark.deploy.yarn.YarnClusterSuite *** ABORTED *** (598 milliseconds)
[info]   java.lang.NoClassDefFoundError: org/bouncycastle/operator/OperatorCreationException
[info]   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceInit(ResourceManager.java:888)
[info]   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
[info]   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.createAndInitActiveServices(ResourceManager.java:1410)
[info]   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceInit(ResourceManager.java:344)
[info]   at org.apache.hadoop.service.AbstractService.init(AbstractService.java:164)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.initResourceManager(MiniYARNCluster.java:359)
```

The solution is modifying `yarn/pom.xml` to activate `hadoop-3.2` profiles by default.

### Why are the changes needed?

hadoop-3.2 profile should be enabled by default so `YarnClusterSuite` should also successfully finishes without `-Phadoop-3.2`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Run `YarnClusterSuite` with both SBT and Maven without `-Phadoop-3.2` and it successfully finished.
```
build/sbt -Pyarn "yarn/testOnly org.apache.spark.deploy.yarn.YarnClusterSuite"
...
[info] Run completed in 5 minutes, 38 seconds.
[info] Total number of tests run: 27
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.

build/mvn -Pyarn -pl resource-managers/yarn test -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite
...
Run completed in 5 minutes, 49 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #33276 from sarutak/fix-bouncy-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9aa65a6e0e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 15:19:03 +09:00
Kent Yao 94c1e3c38c [SPARK-35969][K8S] Make the pod prefix more readable and tallied with K8S DNS Label Names
### What changes were proposed in this pull request?

By default, the executor pod prefix is generated by the app name. It handles characters that match [^a-z0-9\\-] differently. The '.' and all whitespaces will be converted to '-', but other ones to empty string. Especially,  characters like '_', '|' are commonly used as a word separator in many languages.

According to the K8S DNS Label Names, see https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names, we can convert all special characters to `-`.

 
For example,

```
scala> "xyz_abc_i_am_a_app_name_w/_some_abbrs".replaceAll("[^a-z0-9\\-]", "-").replaceAll("-+", "-")
res11: String = xyz-abc-i-am-a-app-name-w-some-abbrs

scala> "xyz_abc_i_am_a_app_name_w/_some_abbrs".replaceAll("\\s+", "-").replaceAll("\\.", "-").replaceAll("[^a-z0-9\\-]", "").replaceAll("-+", "-")
res12: String = xyzabciamaappnamewsomeabbrs
```

```scala
scala> "time.is%the¥most$valuable_——————thing,it's about time.".replaceAll("[^a-z0-9\\-]", "-").replaceAll("-+", "-")
res9: String = time-is-the-most-valuable-thing-it-s-about-time-

scala> "time.is%the¥most$valuable_——————thing,it's about time.".replaceAll("\\s+", "-").replaceAll("\\.", "-").replaceAll("[^a-z0-9\\-]", "").replaceAll("-+", "-")
res10: String = time-isthemostvaluablethingits-about-time-

```

### Why are the changes needed?

For better UX

### Does this PR introduce _any_ user-facing change?

yes, the executor pod name might look better
### How was this patch tested?

add new ones

Closes #33171 from yaooqinn/SPARK-35969.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-01 08:15:00 -07:00
Erik Krogen 3255511d52 [SPARK-35258][SHUFFLE][YARN] Add new metrics to ExternalShuffleService for better monitoring
### What changes were proposed in this pull request?
This adds two new additional metrics to `ExternalBlockHandler`:
- `blockTransferRate` -- for indicating the rate of transferring blocks, vs. the data within them
- `blockTransferAvgSize_1min` -- a 1-minute trailing average of block sizes transferred by the ESS

Additionally, this enhances `YarnShuffleServiceMetrics` to expose the histogram/`Snapshot` information from `Timer` metrics within `ExternalBlockHandler`.

### Why are the changes needed?
Currently `ExternalBlockHandler` exposes some useful metrics, but is lacking around metrics for the rate of block transfers. We have `blockTransferRateBytes` to tell us the rate of _bytes_, but no metric to tell us the rate of _blocks_, which is especially relevant when running the ESS on HDDs that are sensitive to random reads. Many small block transfers can have a negative impact on performance, but won't show up as a spike in `blockTransferRateBytes` since the sizes are small. Thus the new metrics to show information around average block size and block transfer rate are very useful to monitor the health/performance of the ESS, especially when running on HDDs.

For the `YarnShuffleServiceMetrics`, currently the three `Timer` metrics exposed by `ExternalBlockHandler` are being underutilized in a YARN-based environment -- they are basically treated as a `Meter`, only exposing rate-based information, when the metrics themselves are collected detailed histograms of timing information. We should expose this information for better observability.

### Does this PR introduce _any_ user-facing change?
Yes, there are two entirely new metrics for the ESS, as documented in `monitoring.md`. Additionally in a YARN environment, `Timer` metrics exposed by the ESS will include more rich timing information.

### How was this patch tested?
New unit tests are added to verify that new metrics are showing up as expected.

We have been running this patch internally for approx. 1 year and have found it to be useful for monitoring the health of ESS and diagnosing performance issues.

Closes #32388 from xkrogen/xkrogen-SPARK-35258-ess-new-metrics.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-06-28 02:36:17 -05:00
Erik Krogen 866df69c62 [SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line
### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes #32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-06-25 08:53:57 -05:00
Dongjoon Hyun 59ec7a20b0 [SPARK-35885][K8S][R] Use keyserver.ubuntu.com as a keyserver for CRAN
### What changes were proposed in this pull request?

This PR aims to use `keyserver.ubuntu.com` as a keyserver for CRAN.

### Why are the changes needed?

Currently, both servers fail and K8s IT fails at SparkR image building phase.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/801/console
```
$ docker run -it --rm openjdk:11 /bin/bash
root3e89a8d05378:/# echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list
root3e89a8d05378:/# (apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' || apt-key adv --keyserver keys.openpgp.org --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF')
Executing: /tmp/apt-key-gpghome.8lNIiUuhoE/gpg.1.sh --keyserver keys.gnupg.net --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
gpg: keyserver receive failed: No name
Executing: /tmp/apt-key-gpghome.stxb8XUlx8/gpg.1.sh --keyserver keys.openpgp.org --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
gpg: key AD5F960A256A04AF: new key but contains no user ID - skipped
gpg: Total number processed: 1
gpg:           w/o user IDs: 1
root3e89a8d05378:/# apt-get update
...
Err:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FCAE2A0E115C3D8A
...
W: GPG error: http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FCAE2A0E115C3D8A
E: The repository 'http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
```

`keyserver.ubuntu.com` is a recommended backup server in CRAN document.
- http://cloud.r-project.org/bin/linux/debian/
```
$ docker run -it --rm openjdk:11 /bin/bash
rootc9b183e45ffe:/# echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list
rootc9b183e45ffe:/# apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'
Executing: /tmp/apt-key-gpghome.P6cxYkOge7/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
gpg: key AD5F960A256A04AF: public key "Johannes Ranke (Wissenschaftlicher Berater) <johannes.rankejrwb.de>" imported
gpg: Total number processed: 1
gpg:               imported: 1
rootc9b183e45ffe:/# apt-get update
Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease [4375 B]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://cloud.r-project.org/bin/linux/debian buster-cran35/ Packages [53.3 kB]
Get:6 http://security.debian.org/debian-security buster/updates/main arm64 Packages [287 kB]
Get:7 http://deb.debian.org/debian buster/main arm64 Packages [7735 kB]
Get:8 http://deb.debian.org/debian buster-updates/main arm64 Packages [14.5 kB]
Fetched 8334 kB in 2s (4537 kB/s)
Reading package lists... Done
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass K8s IT Jenkins SparkR image building. Or, manually do the following.

```
$ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build
```

Closes #33071 from dongjoon-hyun/SPARK-35885.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-24 19:51:49 -07:00
Kevin Su 765106cb80 [SPARK-35699][K8S] Improve error message when creating k8s pod failed
### What changes were proposed in this pull request?

Improve error message when clients use wrong master URL to submit a job to k8s.

### Why are the changes needed?

Current error messages are not clear for users.
```
(base) ➜ spark git:(master) ./bin/spark-submit \
--master k8s://https://192.168.49.3:8443 \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=pingsutw/spark:testing \
local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160)
21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [default] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
```
Below command to reproduce;
```
./bin/spark-submit \
  --master k8s://https://192.168.49.2:8443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=3 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=pingsutw/spark:testing \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
```

### Does this PR introduce _any_ user-facing change?

Yes, users will see more clear error messages.

### How was this patch tested?

Pass the CIs.

Closes #32874 from pingsutw/SPARK-35699.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-21 19:10:11 -07:00
Dongjoon Hyun 2b9902d26a [SPARK-35831][YARN][TEST-MAVEN] Handle PathOperationException in copyFileToRemote on the same src and dest
### What changes were proposed in this pull request?

This PR aims to be more robust on the underlying Hadoop library changes. Apache Spark's `copyFileToRemote` has an option, `force`, to invoke copying always and it can hit `org.apache.hadoop.fs.PathOperationException` in some Hadoop versions.

From Apache Hadoop 3.3.1, we reverted [HADOOP-16878](https://issues.apache.org/jira/browse/HADOOP-16878) as the last revert commit on `branch-3.3.1`. However, it's still in Apache Hadoop 3.4.0.
- a3b9c37a39

### Why are the changes needed?

Currently, Apache Spark Jenkins hits a flakiness issue.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/lastCompletedBuild/testReport/org.apache.spark.deploy.yarn/ClientSuite/distribute_jars_archive/history/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/2459/testReport/junit/org.apache.spark.deploy.yarn/ClientSuite/distribute_jars_archive/

```
org.apache.hadoop.fs.PathOperationException:
`Source (file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar) and destination (/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar)
are equal in the copy command.': Operation not supported
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:403)
```

Apache Spark has three cases.
- `!compareFs(srcFs, destFs)`: This is safe because we will not have this exception.
- `"file".equals(srcFs.getScheme)`: This is safe because this cannot be a `false` alarm.
- `force=true`:
    - For the `good` alarm part, Spark works in the same way.
    - For the `false` alarm part, Spark is safe because we use `force = true` only for copying `localConfArchive` instead of a general copy between two random clusters.

```scala
val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE))
```

### Does this PR introduce _any_ user-facing change?

No. This preserves the previous Apache Spark behavior.

### How was this patch tested?

Pass the Jenkins with Maven.

Closes #32983 from dongjoon-hyun/SPARK-35831.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-21 23:28:27 +08:00
Chandni Singh 8ce1e344e5 [SPARK-35671][SHUFFLE][CORE] Add support in the ESS to serve merged shuffle block meta and data to executors
### What changes were proposed in this pull request?
This adds support in the ESS to serve merged shuffle block meta and data requests to executors.
This change is needed for fetching remote merged shuffle data from the remote shuffle services. This is part of push-based shuffle SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

This change introduces new messages between clients and the external shuffle service:

1. `MergedBlockMetaRequest`: The client sends this to external shuffle to get the meta information for a merged block. The response to this is one of these :
  - `MergedBlockMetaSuccess` : contains request id, number of chunks, and a `ManagedBuffer` which is a `FileSegmentBuffer` backed by the merged block meta file.
  - `RpcFailure`: this is sent back to client in case of failure. This is an existing message.

2. `FetchShuffleBlockChunks`: This is similar to `FetchShuffleBlocks` message but it is to fetch merged shuffle chunks instead of blocks.

### Why are the changes needed?
These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Min Shen mshenlinkedin.com

Closes #32811 from otterc/SPARK-35671.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-06-20 17:22:37 -05:00
Dongjoon Hyun 4f51e0045e [SPARK-35832][CORE][ML][K8S][TESTS] Add LocalRootDirsTest trait
### What changes were proposed in this pull request?

To make the test suite more robust, this PR aims to add a new trait, `LocalRootDirsTest`, by refactoring `SortShuffleSuite`'s helper functions and applying it to the following:
- ShuffleNettySuite
- ShuffleOldFetchProtocolSuite
- ExternalShuffleServiceSuite
- KubernetesLocalDiskShuffleDataIOSuite
- LocalDirsSuite
- RDDCleanerSuite
- ALSCleanerSuite

In addition, this fixes a UT in `KubernetesLocalDiskShuffleDataIOSuite`.

### Why are the changes needed?

`ShuffleSuite` is extended by four classes but only `SortShuffleSuite` does the clean-up correctly.
```
ShuffleSuite
- SortShuffleSuite
- ShuffleNettySuite
- ShuffleOldFetchProtocolSuite
- ExternalShuffleServiceSuite
```

Since `KubernetesLocalDiskShuffleDataIOSuite` is looking for the other storage directory, the leftover of `ShuffleSuite` causes flakiness.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/2649/testReport/junit/org.apache.spark.shuffle/KubernetesLocalDiskShuffleDataIOSuite/recompute_is_not_blocked_by_the_recovery/
```
org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 1.0 (TID 3) had a not serializable result: org.apache.spark.ShuffleSuite$NonJavaSerializableClass
...
org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIOSuite.$anonfun$new$2(KubernetesLocalDiskShuffleDataIOSuite.scala:52)
```

For the other suites, the clean-up implementation is used but not complete. So, they are refactored to use new trait.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only change.

### How was this patch tested?

Pass the CIs.

Closes #32986 from dongjoon-hyun/SPARK-35832.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-20 10:53:53 -07:00
Dongjoon Hyun b9d6473e89 [SPARK-35593][K8S][TESTS][FOLLOWUP] Increase timeout in KubernetesLocalDiskShuffleDataIOSuite
### What changes were proposed in this pull request?

This increases the timeout from 10 seconds to 60 seconds in KubernetesLocalDiskShuffleDataIOSuite to reduce the flakiness.

### Why are the changes needed?

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140003/testReport/

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs

Closes #32967 from dongjoon-hyun/SPARK-35593-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-06-19 15:22:29 +09:00
Chao Sun 506ef9aad7 [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1
### What changes were proposed in this pull request?

This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file.

### Why are the changes needed?

Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- Existing unit tests in Spark
- Manually tested using my S3 bucket for event log dir:
```
bin/spark-shell \
  -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \
  -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \
  -c spark.eventLog.enabled=true
  -c spark.eventLog.dir=s3a://<my-bucket>
```
- Manually tested against docker-based YARN dev cluster, by running `SparkPi`.

Closes #30135 from sunchao/SPARK-29250.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-16 13:28:07 -07:00
Kent Yao 1125afd462 [MINOR][K8S] Print the driver pod name instead of Some(name) if absent
Print the driver pod name instead of Some(name) if absent

### What changes were proposed in this pull request?

### Why are the changes needed?

fix error hint

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #32889 from yaooqinn/minork8s.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-13 09:11:14 -07:00
Dongjoon Hyun cf07036d9b [SPARK-35593][K8S][CORE] Support shuffle data recovery on the reused PVCs
### What changes were proposed in this pull request?

Previously, the following two commits allow driver-owned on-demand PVC reuse.
- SPARK-35182 Support driver-owned on-demand PVC
- SPARK-35416 Support PersistentVolumeClaim Reuse

This PR aims to recover the shuffle data on those remounted PVCs. The lifecycle of PVCs are tied to the one of Spark jobs. Since this is K8s specific feature, `ShuffleDataIO` plugin is used.

### Why are the changes needed?

Although Pod is killed, we can remount PVCs and recover some data from it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the newly added test cases.

Closes #32730 from dongjoon-hyun/SPARK-RECOVER-SHUFFLE-DATA.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 16:06:58 -07:00
Kent Yao bc1edba8f6 [SPARK-35692][K8S] Use AtomicInteger for executor id generating
### What changes were proposed in this pull request?

AtomicInteger is enough for executor ids, in this PR, we use it to replace AtomicLong like other cluster managers, e.g. yarn, standalone

### Why are the changes needed?

See the discussion here https://github.com/apache/spark/pull/32610#discussion_r648007320

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass CI with existing tests

Closes #32837 from yaooqinn/SPARK-35692.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 13:42:07 -07:00
Kent Yao b4b78ce265 [SPARK-32975][K8S][FOLLOWUP] Avoid None.get exception
### What changes were proposed in this pull request?

A follow-up for SPARK-32975 to avoid unexpected the `None.get` exception

Run SparkPi with docker desktop, as podName is an option, we will got
```logtalk
21/06/09 01:09:12 ERROR Utils: Uncaught exception in thread main
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:110)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1417)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.start(ExecutorPodsAllocator.scala:111)
	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:99)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2686)
	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:948)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:942)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

Manual.

Closes #32830 from yaooqinn/SPARK-32975.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 13:39:39 -07:00
Chris Wu 497c80a1ad [SPARK-32975][K8S] Add config for driver readiness timeout before executors start
### What changes were proposed in this pull request?
Add a new config that controls the timeout of waiting for driver pod's readiness before allocating executor pods. This wait only happens once on application start.

### Why are the changes needed?
The driver's headless service can be resolved by DNS only after the driver pod is ready. If the executor tries to connect to the headless service before driver pod is ready, it will hit UnkownHostException and get into error state but will not be restarted. **This case usually happens when the driver pod has sidecar containers but hasn't finished their creation when executors start.** So basically there is a race condition. This issue can be mitigated by tweaking this config.

### Does this PR introduce _any_ user-facing change?
A new config `spark.kubernetes.allocation.driver.readinessTimeout` added.

### How was this patch tested?
Exisiting tests.

Closes #32752 from cchriswu/SPARK-32975-fix.

Lead-authored-by: Chris Wu <wucaowei19@gmail.com>
Co-authored-by: Chris Wu <wcaowei@vmware.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-04 06:59:49 -07:00
Dongjoon Hyun 4f0db872a0 [SPARK-35416][K8S][FOLLOWUP] Use Set instead of ArrayBuffer
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32564 .

### Why are the changes needed?

To use Set instead of ArrayBuffer and add a return type.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #32758 from dongjoon-hyun/SPARK-35416-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-06-03 10:41:11 -05:00
Kousuke e04883880f [SPARK-35586][K8S][TESTS] Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests
### What changes were proposed in this pull request?

This PR set a default value for `spark.kubernetes.test.sparkTgz` in `kubernetes/integration-tests/pom.xml` for Kubernetes integration tests.

### Why are the changes needed?

In the current master, running the integration tests with the following command will fail because there is no default value set for the property.
```
build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes -Pkubernetes-integration-tests -Psparkr  -pl resource-managers/kubernetes/integration-tests integration-test
```
```
+ mkdir -p /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
+ tar -xzvf --test-exclude-tags --strip-components=1 -C /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
tar (child): --test-exclude-tags: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
[ERROR] Command execution failed.
```

According to `setup-integration-test-env.sh`, `N/A` is intended as the default value so this PR choose it.
```
SPARK_TGZ="N/A"
MVN="$TEST_ROOT_DIR/build/mvn"
EXCLUDE_TAGS=""
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Build and tests successfully finish with the command shown above.

Closes #32722 from sarutak/fix-pom-for-kube-integ.

Authored-by: Kousuke <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-01 00:40:02 -07:00
Kent Yao 96b0548ab6 [SPARK-35493][K8S] make spark.blockManager.port fallback for spark.driver.blockManager.port as same as other cluster managers
### What changes were proposed in this pull request?

`spark.blockManager.port` does not work for k8s driver pods now, we should make it work as other cluster managers.

### Why are the changes needed?

`spark.blockManager.port` should be able to work for spark driver pod

### Does this PR introduce _any_ user-facing change?

yes, `spark.blockManager.port` will be respect iff it is present  && `spark.driver.blockManager.port` is absent

### How was this patch tested?

new tests

Closes #32639 from yaooqinn/SPARK-35493.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 08:07:57 -07:00
Kent Yao d957426351 [SPARK-35482][K8S] Use spark.blockManager.port not the wrong spark.blockmanager.port in BasicExecutorFeatureStep
### What changes were proposed in this pull request?

most spark conf keys are case sensitive, including `spark.blockManager.port`, we can not get the correct port number with `spark.blockmanager.port`.

This PR changes the wrong key to `spark.blockManager.port` in `BasicExecutorFeatureStep`.

This PR also ensures a fast fail when the port value is invalid for executor containers. When 0 is specified(it is valid as random port, but invalid as a k8s request), it should not be put in the `containerPort` field of executor pod desc. We do not expect executor pods to continuously fail to create because of invalid requests.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #32621 from yaooqinn/SPARK-35482.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 08:27:49 -07:00
Ashray Jain de59e01aa4 [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable
Kubernetes supports marking secrets and config maps as immutable to gain performance.

https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable
https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable

For K8s clusters that run many thousands of Spark applications, this can yield significant reduction in load on the kube-apiserver.

From the K8s docs:

> For clusters that extensively use Secrets (at least tens of thousands of unique Secret to Pod mounts), preventing changes to their data has the following advantages:
> - protects you from accidental (or unwanted) updates that could cause applications outages
> - improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for secrets marked as immutable.

For any secrets and config maps we create in Spark that are immutable, we could mark them as immutable by including the following when building the secret/config map
```
.withImmutable(true)
```
This feature has been supported in K8s as beta since K8s 1.19 and as GA since K8s 1.21

### What changes were proposed in this pull request?
All K8s secrets and config maps created by Spark are marked "immutable".

### Why are the changes needed?
See description above.

### Does this PR introduce _any_ user-facing change?
Don't think so

### How was this patch tested?
Augmented existing unit tests.

Closes #32588 from ashrayjain/patch-1.

Authored-by: Ashray Jain <ashrayjain@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-19 21:25:33 -07:00
Dongjoon Hyun 4c015555da [SPARK-35416][K8S] Support PersistentVolumeClaim Reuse
### What changes were proposed in this pull request?

This PR aims to add a new configuration, `spark.kubernetes.driver.reusePersistentVolumeClaim`, to reuse driver-owned `PersistentVolumeClaims` of the **deleted** executor pods.

Note also that `driver-owned PersistentVolumeClaims` is controlled by `spark.kubernetes.driver.ownPersistentVolumeClaim` which is recently added.

### Why are the changes needed?

PVC creations take some times. This feature can reduce it by reusing it.

For example, we can start `Pi` app with two executors with PVCs.
```
$ k logs -f pi | grep ExecutorPodsAllocator
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 0.
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-1-pvc-0 with StorageClass scaleio
21/05/16 23:36:33 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-2-pvc-0 with StorageClass scaleio
```

After killing one executor, Spark is trying to look up the reusable PVCs, but the dead-executor's PVC may not returned yet because K8s works asynchronously. In this case, Spark is trying to create a new PVC as a normal operation.
```
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 2 PVCs
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-3-pvc-0 with StorageClass scaleio
```

After killing another executor, Spark found one reusable PVC, `pi-exec-1-pvc-0`, and reuse it.
```
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Found 1 reusable PVCs from 3 PVCs
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim pi-exec-1-pvc-0
```

In this case, we can easily notice the remounted PVC because `ClaimName`, `pi-exec-1-pvc-0`, doesn't have the prefix of pod name, `pi-exec-4`.
```
$ k describe pod pi-exec-4 | grep pi-exec-1-pvc-0
    ClaimName:  pi-exec-1-pvc-0
```

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new feature which is disabled by the new conf.

### How was this patch tested?

Pass the CIs with the newly added test case.

K8S IT test also passed.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 7 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  24:14 min
[INFO] Finished at: 2021-05-16T17:24:40-07:00
[INFO] ------------------------------------------------------------------------
```

Closes #32564 from dongjoon-hyun/SPARK-35416.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-17 00:20:48 -07:00