Commit graph

606 commits

Author SHA1 Message Date
Erik Krogen 866df69c62 [SPARK-35672][CORE][YARN] Pass user classpath entries to executors using config instead of command line
### What changes were proposed in this pull request?
Refactor the logic for constructing the user classpath from `yarn.ApplicationMaster` into `yarn.Client` so that it can be leveraged on the executor side as well, instead of having the driver construct it and pass it to the executor via command-line arguments. A new method, `getUserClassPath`, is added to `CoarseGrainedExecutorBackend` which defaults to `Nil` (consistent with the existing behavior where non-YARN resource managers do not configure the user classpath). `YarnCoarseGrainedExecutorBackend` overrides this to construct the user classpath from the existing `APP_JAR` and `SECONDARY_JARS` configs.

### Why are the changes needed?
User-provided JARs are made available to executors using a custom classloader, so they do not appear on the standard Java classpath. Instead, they are passed as a list to the executor which then creates a classloader out of the URLs. Currently in the case of YARN, this list of JARs is crafted by the Driver (in `ExecutorRunnable`), which then passes the information to the executors (`CoarseGrainedExecutorBackend`) by specifying each JAR on the executor command line as `--user-class-path /path/to/myjar.jar`. This can cause extremely long argument lists when there are many JARs, which can cause the OS argument length to be exceeded, typically manifesting as the error message:

> /bin/bash: Argument list too long

A [Google search](https://www.google.com/search?q=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22&oq=spark%20%22%2Fbin%2Fbash%3A%20argument%20list%20too%20long%22) indicates that this is not a theoretical problem and afflicts real users, including ours. Passing this list using the configurations instead resolves this issue.

### Does this PR introduce _any_ user-facing change?
No, except for fixing the bug, allowing for larger JAR lists to be passed successfully. Configuration of JARs is identical to before.

### How was this patch tested?
New unit tests were added in `YarnClusterSuite`. Also, we have been running a similar fix internally for 4 months with great success.

Closes #32810 from xkrogen/xkrogen-SPARK-35672-classpath-scalable.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-06-25 08:53:57 -05:00
Dongjoon Hyun 59ec7a20b0 [SPARK-35885][K8S][R] Use keyserver.ubuntu.com as a keyserver for CRAN
### What changes were proposed in this pull request?

This PR aims to use `keyserver.ubuntu.com` as a keyserver for CRAN.

### Why are the changes needed?

Currently, both servers fail and K8s IT fails at SparkR image building phase.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/801/console
```
$ docker run -it --rm openjdk:11 /bin/bash
root3e89a8d05378:/# echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list
root3e89a8d05378:/# (apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' || apt-key adv --keyserver keys.openpgp.org --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF')
Executing: /tmp/apt-key-gpghome.8lNIiUuhoE/gpg.1.sh --keyserver keys.gnupg.net --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
gpg: keyserver receive failed: No name
Executing: /tmp/apt-key-gpghome.stxb8XUlx8/gpg.1.sh --keyserver keys.openpgp.org --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
gpg: key AD5F960A256A04AF: new key but contains no user ID - skipped
gpg: Total number processed: 1
gpg:           w/o user IDs: 1
root3e89a8d05378:/# apt-get update
...
Err:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease
  The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FCAE2A0E115C3D8A
...
W: GPG error: http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FCAE2A0E115C3D8A
E: The repository 'http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease' is not signed.
N: Updating from such a repository can't be done securely, and is therefore disabled by default.
N: See apt-secure(8) manpage for repository creation and user configuration details.
```

`keyserver.ubuntu.com` is a recommended backup server in CRAN document.
- http://cloud.r-project.org/bin/linux/debian/
```
$ docker run -it --rm openjdk:11 /bin/bash
rootc9b183e45ffe:/# echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list
rootc9b183e45ffe:/# apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF'
Executing: /tmp/apt-key-gpghome.P6cxYkOge7/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF
gpg: key AD5F960A256A04AF: public key "Johannes Ranke (Wissenschaftlicher Berater) <johannes.rankejrwb.de>" imported
gpg: Total number processed: 1
gpg:               imported: 1
rootc9b183e45ffe:/# apt-get update
Get:1 http://deb.debian.org/debian buster InRelease [122 kB]
Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB]
Get:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease [4375 B]
Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 http://cloud.r-project.org/bin/linux/debian buster-cran35/ Packages [53.3 kB]
Get:6 http://security.debian.org/debian-security buster/updates/main arm64 Packages [287 kB]
Get:7 http://deb.debian.org/debian buster/main arm64 Packages [7735 kB]
Get:8 http://deb.debian.org/debian buster-updates/main arm64 Packages [14.5 kB]
Fetched 8334 kB in 2s (4537 kB/s)
Reading package lists... Done
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass K8s IT Jenkins SparkR image building. Or, manually do the following.

```
$ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build
```

Closes #33071 from dongjoon-hyun/SPARK-35885.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-24 19:51:49 -07:00
Kevin Su 765106cb80 [SPARK-35699][K8S] Improve error message when creating k8s pod failed
### What changes were proposed in this pull request?

Improve error message when clients use wrong master URL to submit a job to k8s.

### Why are the changes needed?

Current error messages are not clear for users.
```
(base) ➜ spark git:(master) ./bin/spark-submit \
--master k8s://https://192.168.49.3:8443 \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.executor.instances=3 \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=pingsutw/spark:testing \
local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160)
21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file
21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image.
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [default] failed.
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380)
at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86)
```
Below command to reproduce;
```
./bin/spark-submit \
  --master k8s://https://192.168.49.2:8443 \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=3 \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  --conf spark.kubernetes.container.image=pingsutw/spark:testing \
  local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
```

### Does this PR introduce _any_ user-facing change?

Yes, users will see more clear error messages.

### How was this patch tested?

Pass the CIs.

Closes #32874 from pingsutw/SPARK-35699.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-21 19:10:11 -07:00
Dongjoon Hyun 2b9902d26a [SPARK-35831][YARN][TEST-MAVEN] Handle PathOperationException in copyFileToRemote on the same src and dest
### What changes were proposed in this pull request?

This PR aims to be more robust on the underlying Hadoop library changes. Apache Spark's `copyFileToRemote` has an option, `force`, to invoke copying always and it can hit `org.apache.hadoop.fs.PathOperationException` in some Hadoop versions.

From Apache Hadoop 3.3.1, we reverted [HADOOP-16878](https://issues.apache.org/jira/browse/HADOOP-16878) as the last revert commit on `branch-3.3.1`. However, it's still in Apache Hadoop 3.4.0.
- a3b9c37a39

### Why are the changes needed?

Currently, Apache Spark Jenkins hits a flakiness issue.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2/lastCompletedBuild/testReport/org.apache.spark.deploy.yarn/ClientSuite/distribute_jars_archive/history/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/2459/testReport/junit/org.apache.spark.deploy.yarn/ClientSuite/distribute_jars_archive/

```
org.apache.hadoop.fs.PathOperationException:
`Source (file:/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar) and destination (/home/jenkins/workspace/spark-master-test-maven-hadoop-3.2/resource-managers/yarn/target/tmp/spark-703b8e99-63cc-4ba6-a9bc-25c7cae8f5f9/testJar9120517778809167117.jar)
are equal in the copy command.': Operation not supported
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:403)
```

Apache Spark has three cases.
- `!compareFs(srcFs, destFs)`: This is safe because we will not have this exception.
- `"file".equals(srcFs.getScheme)`: This is safe because this cannot be a `false` alarm.
- `force=true`:
    - For the `good` alarm part, Spark works in the same way.
    - For the `false` alarm part, Spark is safe because we use `force = true` only for copying `localConfArchive` instead of a general copy between two random clusters.

```scala
val localConfArchive = new Path(createConfArchive(confsToOverride).toURI())
copyFileToRemote(destDir, localConfArchive, replication, symlinkCache, force = true,
destName = Some(LOCALIZED_CONF_ARCHIVE))
```

### Does this PR introduce _any_ user-facing change?

No. This preserves the previous Apache Spark behavior.

### How was this patch tested?

Pass the Jenkins with Maven.

Closes #32983 from dongjoon-hyun/SPARK-35831.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-21 23:28:27 +08:00
Chandni Singh 8ce1e344e5 [SPARK-35671][SHUFFLE][CORE] Add support in the ESS to serve merged shuffle block meta and data to executors
### What changes were proposed in this pull request?
This adds support in the ESS to serve merged shuffle block meta and data requests to executors.
This change is needed for fetching remote merged shuffle data from the remote shuffle services. This is part of push-based shuffle SPIP [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

This change introduces new messages between clients and the external shuffle service:

1. `MergedBlockMetaRequest`: The client sends this to external shuffle to get the meta information for a merged block. The response to this is one of these :
  - `MergedBlockMetaSuccess` : contains request id, number of chunks, and a `ManagedBuffer` which is a `FileSegmentBuffer` backed by the merged block meta file.
  - `RpcFailure`: this is sent back to client in case of failure. This is an existing message.

2. `FetchShuffleBlockChunks`: This is similar to `FetchShuffleBlocks` message but it is to fetch merged shuffle chunks instead of blocks.

### Why are the changes needed?
These changes are needed for push-based shuffle. Refer to the SPIP in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added unit tests.
The reference PR with the consolidated changes covering the complete implementation is also provided in [SPARK-30602](https://issues.apache.org/jira/browse/SPARK-30602).
We have already verified the functionality and the improved performance as documented in the SPIP doc.

Lead-authored-by: Chandni Singh chsinghlinkedin.com
Co-authored-by: Min Shen mshenlinkedin.com

Closes #32811 from otterc/SPARK-35671.

Lead-authored-by: Chandni Singh <singh.chandni@gmail.com>
Co-authored-by: Min Shen <mshen@linkedin.com>
Co-authored-by: Chandni Singh <chsingh@linkedin.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-06-20 17:22:37 -05:00
Dongjoon Hyun 4f51e0045e [SPARK-35832][CORE][ML][K8S][TESTS] Add LocalRootDirsTest trait
### What changes were proposed in this pull request?

To make the test suite more robust, this PR aims to add a new trait, `LocalRootDirsTest`, by refactoring `SortShuffleSuite`'s helper functions and applying it to the following:
- ShuffleNettySuite
- ShuffleOldFetchProtocolSuite
- ExternalShuffleServiceSuite
- KubernetesLocalDiskShuffleDataIOSuite
- LocalDirsSuite
- RDDCleanerSuite
- ALSCleanerSuite

In addition, this fixes a UT in `KubernetesLocalDiskShuffleDataIOSuite`.

### Why are the changes needed?

`ShuffleSuite` is extended by four classes but only `SortShuffleSuite` does the clean-up correctly.
```
ShuffleSuite
- SortShuffleSuite
- ShuffleNettySuite
- ShuffleOldFetchProtocolSuite
- ExternalShuffleServiceSuite
```

Since `KubernetesLocalDiskShuffleDataIOSuite` is looking for the other storage directory, the leftover of `ShuffleSuite` causes flakiness.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/2649/testReport/junit/org.apache.spark.shuffle/KubernetesLocalDiskShuffleDataIOSuite/recompute_is_not_blocked_by_the_recovery/
```
org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 1.0 (TID 3) had a not serializable result: org.apache.spark.ShuffleSuite$NonJavaSerializableClass
...
org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIOSuite.$anonfun$new$2(KubernetesLocalDiskShuffleDataIOSuite.scala:52)
```

For the other suites, the clean-up implementation is used but not complete. So, they are refactored to use new trait.

### Does this PR introduce _any_ user-facing change?

No, this is a test-only change.

### How was this patch tested?

Pass the CIs.

Closes #32986 from dongjoon-hyun/SPARK-35832.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-20 10:53:53 -07:00
Dongjoon Hyun b9d6473e89 [SPARK-35593][K8S][TESTS][FOLLOWUP] Increase timeout in KubernetesLocalDiskShuffleDataIOSuite
### What changes were proposed in this pull request?

This increases the timeout from 10 seconds to 60 seconds in KubernetesLocalDiskShuffleDataIOSuite to reduce the flakiness.

### Why are the changes needed?

- https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140003/testReport/

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs

Closes #32967 from dongjoon-hyun/SPARK-35593-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-06-19 15:22:29 +09:00
Chao Sun 506ef9aad7 [SPARK-29250][BUILD] Upgrade to Hadoop 3.3.1
### What changes were proposed in this pull request?

This upgrade default Hadoop version from 3.2.1 to 3.3.1. The changes here are simply update the version number and dependency file.

### Why are the changes needed?

Hadoop 3.3.1 just came out, which comes with many client-side improvements such as for S3A/ABFS (20% faster when accessing S3). These are important for users who want to use Spark in a cloud environment.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

- Existing unit tests in Spark
- Manually tested using my S3 bucket for event log dir:
```
bin/spark-shell \
  -c spark.hadoop.fs.s3a.access.key=$AWS_ACCESS_KEY_ID \
  -c spark.hadoop.fs.s3a.secret.key=$AWS_SECRET_ACCESS_KEY \
  -c spark.eventLog.enabled=true
  -c spark.eventLog.dir=s3a://<my-bucket>
```
- Manually tested against docker-based YARN dev cluster, by running `SparkPi`.

Closes #30135 from sunchao/SPARK-29250.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-16 13:28:07 -07:00
Kent Yao 1125afd462 [MINOR][K8S] Print the driver pod name instead of Some(name) if absent
Print the driver pod name instead of Some(name) if absent

### What changes were proposed in this pull request?

### Why are the changes needed?

fix error hint

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new test

Closes #32889 from yaooqinn/minork8s.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-13 09:11:14 -07:00
Dongjoon Hyun cf07036d9b [SPARK-35593][K8S][CORE] Support shuffle data recovery on the reused PVCs
### What changes were proposed in this pull request?

Previously, the following two commits allow driver-owned on-demand PVC reuse.
- SPARK-35182 Support driver-owned on-demand PVC
- SPARK-35416 Support PersistentVolumeClaim Reuse

This PR aims to recover the shuffle data on those remounted PVCs. The lifecycle of PVCs are tied to the one of Spark jobs. Since this is K8s specific feature, `ShuffleDataIO` plugin is used.

### Why are the changes needed?

Although Pod is killed, we can remount PVCs and recover some data from it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the newly added test cases.

Closes #32730 from dongjoon-hyun/SPARK-RECOVER-SHUFFLE-DATA.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 16:06:58 -07:00
Kent Yao bc1edba8f6 [SPARK-35692][K8S] Use AtomicInteger for executor id generating
### What changes were proposed in this pull request?

AtomicInteger is enough for executor ids, in this PR, we use it to replace AtomicLong like other cluster managers, e.g. yarn, standalone

### Why are the changes needed?

See the discussion here https://github.com/apache/spark/pull/32610#discussion_r648007320

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

pass CI with existing tests

Closes #32837 from yaooqinn/SPARK-35692.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 13:42:07 -07:00
Kent Yao b4b78ce265 [SPARK-32975][K8S][FOLLOWUP] Avoid None.get exception
### What changes were proposed in this pull request?

A follow-up for SPARK-32975 to avoid unexpected the `None.get` exception

Run SparkPi with docker desktop, as podName is an option, we will got
```logtalk
21/06/09 01:09:12 ERROR Utils: Uncaught exception in thread main
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:110)
	at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1417)
	at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.start(ExecutorPodsAllocator.scala:111)
	at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:99)
	at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:581)
	at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2686)
	at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:948)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:942)
	at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30)
	at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
	at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955)
	at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
	at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
	at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
```

### Why are the changes needed?

fix a regression

### Does this PR introduce _any_ user-facing change?

no
### How was this patch tested?

Manual.

Closes #32830 from yaooqinn/SPARK-32975.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-10 13:39:39 -07:00
Chris Wu 497c80a1ad [SPARK-32975][K8S] Add config for driver readiness timeout before executors start
### What changes were proposed in this pull request?
Add a new config that controls the timeout of waiting for driver pod's readiness before allocating executor pods. This wait only happens once on application start.

### Why are the changes needed?
The driver's headless service can be resolved by DNS only after the driver pod is ready. If the executor tries to connect to the headless service before driver pod is ready, it will hit UnkownHostException and get into error state but will not be restarted. **This case usually happens when the driver pod has sidecar containers but hasn't finished their creation when executors start.** So basically there is a race condition. This issue can be mitigated by tweaking this config.

### Does this PR introduce _any_ user-facing change?
A new config `spark.kubernetes.allocation.driver.readinessTimeout` added.

### How was this patch tested?
Exisiting tests.

Closes #32752 from cchriswu/SPARK-32975-fix.

Lead-authored-by: Chris Wu <wucaowei19@gmail.com>
Co-authored-by: Chris Wu <wcaowei@vmware.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-04 06:59:49 -07:00
Dongjoon Hyun 4f0db872a0 [SPARK-35416][K8S][FOLLOWUP] Use Set instead of ArrayBuffer
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/32564 .

### Why are the changes needed?

To use Set instead of ArrayBuffer and add a return type.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #32758 from dongjoon-hyun/SPARK-35416-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-06-03 10:41:11 -05:00
Kousuke e04883880f [SPARK-35586][K8S][TESTS] Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests
### What changes were proposed in this pull request?

This PR set a default value for `spark.kubernetes.test.sparkTgz` in `kubernetes/integration-tests/pom.xml` for Kubernetes integration tests.

### Why are the changes needed?

In the current master, running the integration tests with the following command will fail because there is no default value set for the property.
```
build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes -Pkubernetes-integration-tests -Psparkr  -pl resource-managers/kubernetes/integration-tests integration-test
```
```
+ mkdir -p /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
+ tar -xzvf --test-exclude-tags --strip-components=1 -C /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked
tar (child): --test-exclude-tags: Cannot open: No such file or directory
tar (child): Error is not recoverable: exiting now
tar: Child returned status 2
tar: Error is not recoverable: exiting now
[ERROR] Command execution failed.
```

According to `setup-integration-test-env.sh`, `N/A` is intended as the default value so this PR choose it.
```
SPARK_TGZ="N/A"
MVN="$TEST_ROOT_DIR/build/mvn"
EXCLUDE_TAGS=""
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Build and tests successfully finish with the command shown above.

Closes #32722 from sarutak/fix-pom-for-kube-integ.

Authored-by: Kousuke <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-06-01 00:40:02 -07:00
Kent Yao 96b0548ab6 [SPARK-35493][K8S] make spark.blockManager.port fallback for spark.driver.blockManager.port as same as other cluster managers
### What changes were proposed in this pull request?

`spark.blockManager.port` does not work for k8s driver pods now, we should make it work as other cluster managers.

### Why are the changes needed?

`spark.blockManager.port` should be able to work for spark driver pod

### Does this PR introduce _any_ user-facing change?

yes, `spark.blockManager.port` will be respect iff it is present  && `spark.driver.blockManager.port` is absent

### How was this patch tested?

new tests

Closes #32639 from yaooqinn/SPARK-35493.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-23 08:07:57 -07:00
Kent Yao d957426351 [SPARK-35482][K8S] Use spark.blockManager.port not the wrong spark.blockmanager.port in BasicExecutorFeatureStep
### What changes were proposed in this pull request?

most spark conf keys are case sensitive, including `spark.blockManager.port`, we can not get the correct port number with `spark.blockmanager.port`.

This PR changes the wrong key to `spark.blockManager.port` in `BasicExecutorFeatureStep`.

This PR also ensures a fast fail when the port value is invalid for executor containers. When 0 is specified(it is valid as random port, but invalid as a k8s request), it should not be put in the `containerPort` field of executor pod desc. We do not expect executor pods to continuously fail to create because of invalid requests.

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

new tests

Closes #32621 from yaooqinn/SPARK-35482.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-21 08:27:49 -07:00
Ashray Jain de59e01aa4 [SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable
Kubernetes supports marking secrets and config maps as immutable to gain performance.

https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable
https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable

For K8s clusters that run many thousands of Spark applications, this can yield significant reduction in load on the kube-apiserver.

From the K8s docs:

> For clusters that extensively use Secrets (at least tens of thousands of unique Secret to Pod mounts), preventing changes to their data has the following advantages:
> - protects you from accidental (or unwanted) updates that could cause applications outages
> - improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for secrets marked as immutable.

For any secrets and config maps we create in Spark that are immutable, we could mark them as immutable by including the following when building the secret/config map
```
.withImmutable(true)
```
This feature has been supported in K8s as beta since K8s 1.19 and as GA since K8s 1.21

### What changes were proposed in this pull request?
All K8s secrets and config maps created by Spark are marked "immutable".

### Why are the changes needed?
See description above.

### Does this PR introduce _any_ user-facing change?
Don't think so

### How was this patch tested?
Augmented existing unit tests.

Closes #32588 from ashrayjain/patch-1.

Authored-by: Ashray Jain <ashrayjain@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-19 21:25:33 -07:00
Dongjoon Hyun 4c015555da [SPARK-35416][K8S] Support PersistentVolumeClaim Reuse
### What changes were proposed in this pull request?

This PR aims to add a new configuration, `spark.kubernetes.driver.reusePersistentVolumeClaim`, to reuse driver-owned `PersistentVolumeClaims` of the **deleted** executor pods.

Note also that `driver-owned PersistentVolumeClaims` is controlled by `spark.kubernetes.driver.ownPersistentVolumeClaim` which is recently added.

### Why are the changes needed?

PVC creations take some times. This feature can reduce it by reusing it.

For example, we can start `Pi` app with two executors with PVCs.
```
$ k logs -f pi | grep ExecutorPodsAllocator
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 0.
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-1-pvc-0 with StorageClass scaleio
21/05/16 23:36:33 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-2-pvc-0 with StorageClass scaleio
```

After killing one executor, Spark is trying to look up the reusable PVCs, but the dead-executor's PVC may not returned yet because K8s works asynchronously. In this case, Spark is trying to create a new PVC as a normal operation.
```
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 2 PVCs
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-3-pvc-0 with StorageClass scaleio
```

After killing another executor, Spark found one reusable PVC, `pi-exec-1-pvc-0`, and reuse it.
```
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Found 1 reusable PVCs from 3 PVCs
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim pi-exec-1-pvc-0
```

In this case, we can easily notice the remounted PVC because `ClaimName`, `pi-exec-1-pvc-0`, doesn't have the prefix of pod name, `pi-exec-4`.
```
$ k describe pod pi-exec-4 | grep pi-exec-1-pvc-0
    ClaimName:  pi-exec-1-pvc-0
```

### Does this PR introduce _any_ user-facing change?

Yes, but this is a new feature which is disabled by the new conf.

### How was this patch tested?

Pass the CIs with the newly added test case.

K8S IT test also passed.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 7 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  24:14 min
[INFO] Finished at: 2021-05-16T17:24:40-07:00
[INFO] ------------------------------------------------------------------------
```

Closes #32564 from dongjoon-hyun/SPARK-35416.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-17 00:20:48 -07:00
Holden Karau 160b3bee71 [SPARK-34764][CORE][K8S][UI] Propagate reason for exec loss to Web UI
### What changes were proposed in this pull request?

Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core.

UI change:

![image](https://user-images.githubusercontent.com/59893/117045762-b975ba80-acc4-11eb-9679-8edab3cfadc2.png)

### Why are the changes needed?

Debugging Spark jobs is *hard*, making it clearer why executors have exited could help.

### Does this PR introduce _any_ user-facing change?

Yes a new column on the executor page.

### How was this patch tested?

K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI.

Closes #32436 from holdenk/SPARK-34764-propegate-reason-for-exec-loss.

Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-05-13 16:02:31 -07:00
Dongjoon Hyun dd5464976f [SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file
### What changes were proposed in this pull request?

This PR aims to unify two K8s version variables in two `pom.xml`s into one. `kubernetes-client.version` is correct because the artifact ID is `kubernetes-client`.

```
kubernetes.client.version (kubernetes/core module)
kubernetes-client.version (kubernetes/integration-test module)
```

### Why are the changes needed?

Having two variables for the same value is confusing and inconvenient when we upgrade K8s versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs. (The compilation test passes are enough.)

Closes #32531 from dongjoon-hyun/SPARK-35394.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-13 00:40:53 -07:00
“attilapiros” 8b94eff1ca [SPARK-34736][K8S][TESTS] Kubernetes and Minikube version upgrade for integration tests
### What changes were proposed in this pull request?

This PR upgrades Kubernetes and Minikube version for integration tests and removes/updates the old code for this new version.

Details of this changes:

- As [discussed in the mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html): updating Minikube version from v0.34.1 to v1.7.3 and kubernetes version from v1.15.12 to v1.17.3.
- making Minikube version checked and fail with an explanation when the test is started with on a version <  v1.7.3.
- removing minikube status checking code related to old Minikube versions
- in the Minikube backend using fabric8's `Config.autoConfigure()` method to configure the kubernetes client to use the `minikube` k8s context (like it was in [one of the Minikube's example](https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/kubectl/equivalents/ConfigUseContext.java#L36))
- Introducing `persistentVolume` test tag: this would be a temporary change to skip PVC tests in the Kubernetes integration test, as currently the PCV tests are blocking the move to Docker as Minikube's driver (for details please check https://issues.apache.org/jira/browse/SPARK-34738).

### Why are the changes needed?

With the current suggestion one can run into several problems without noticing the Minikube/kubernetes version is the problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It was tested on Mac with [this script](https://gist.github.com/attilapiros/cd58a16bdde833c80c5803c337fffa94#file-check_minikube_versions-zsh) which installs each Minikube versions from v1.7.2 (including this version to test the negative case of the version check) and runs the integration tests.

It was started with:
```
./check_minikube_versions.zsh > test_log 2>&1
```

And there was only one build failure the rest was successful:

```
$ grep "BUILD SUCCESS" test_log | wc -l
      26
$ grep "BUILD FAILURE" test_log | wc -l
       1
```

It was for Minikube v1.7.2  and the log is:

```
KubernetesSuite:
*** RUN ABORTED ***
  java.lang.AssertionError: assertion failed: Unsupported Minikube version is detected: minikube version: v1.7.2.For integration testing Minikube version 1.7.3 or greater is expected.
  at scala.Predef$.assert(Predef.scala:223)
  at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:52)
  at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33)
  at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163)
  at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
  at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
  at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
  at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:43)
  at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273)
  at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271)
  ...
```

Moreover I made a test with having multiple k8s cluster contexts, too.

Closes #31829 from attilapiros/SPARK-34736.

Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Co-authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
2021-05-10 18:56:52 +02:00
Dongjoon Hyun a0c76a8755 [SPARK-35319][K8S][BUILD] Upgrade K8s client to 5.3.1
### What changes were proposed in this pull request?

This PR aims to upgrade K8s client to 5.3.1.

### Why are the changes needed?

This will bring the latest bug fixes.
- https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

K8s IT is manually tested like the following.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 18 minutes, 33 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT:
[INFO]
[INFO] Spark Project Parent POM ........................... SUCCESS [  3.959 s]
[INFO] Spark Project Tags ................................. SUCCESS [  7.830 s]
[INFO] Spark Project Local DB ............................. SUCCESS [  3.457 s]
[INFO] Spark Project Networking ........................... SUCCESS [  5.496 s]
[INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  3.239 s]
[INFO] Spark Project Unsafe ............................... SUCCESS [  9.006 s]
[INFO] Spark Project Launcher ............................. SUCCESS [  2.422 s]
[INFO] Spark Project Core ................................. SUCCESS [02:17 min]
[INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  23:59 min
[INFO] Finished at: 2021-05-05T11:59:19-07:00
[INFO] ------------------------------------------------------------------------
```

Closes #32443 from dongjoon-hyun/SPARK-35319.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-05-05 19:50:37 -07:00
Dongjoon Hyun 4e8701a77d [SPARK-35280][K8S] Promote KubernetesUtils to DeveloperApi
### What changes were proposed in this pull request?

Since SPARK-22757, `KubernetesUtils` has been used as an important utility class by all K8s modules and `ExternalClusterManager`s. This PR aims to promote `KubernetesUtils` to `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.2.0.

### Why are the changes needed?

Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. To have `ExternalClusterManager` for K8s environment, `KubernetesUtils` class is crucial and needs to be stable. By promoting to a subset of K8s developer API, we can maintain these more sustainable way and give a better and stable functionality to K8s users.

In this PR, `Since` annotations denote the last function signature changes because these are going to become public at Apache Spark 3.2.0.

| Version | Function Name |
|-|-|
| 2.3.0 | parsePrefixedKeyValuePairs |
| 2.3.0 | requireNandDefined |
| 2.3.0 | parsePrefixedKeyValuePairs |
| 2.4.0 | parseMasterUrl |
| 3.0.0 | requireBothOrNeitherDefined |
| 3.0.0 | requireSecondIfFirstIsDefined |
| 3.0.0 | selectSparkContainer |
| 3.0.0 | formatPairsBundle |
| 3.0.0 | formatPodState |
| 3.0.0 | containersDescription |
| 3.0.0 | containerStatusDescription |
| 3.0.0 | formatTime |
| 3.0.0 | uniqueID |
| 3.0.0 | buildResourcesQuantities |
| 3.0.0 | uploadAndTransformFileUris |
| 3.0.0 | uploadFileUri |
| 3.0.0 | requireBothOrNeitherDefined |
| 3.0.0 | buildPodWithServiceAccount |
| 3.0.0 | isLocalAndResolvable |
| 3.1.1 | renameMainAppResource |
| 3.1.1 | addOwnerReference |
| 3.2.0 | loadPodFromTemplate |

### Does this PR introduce _any_ user-facing change?

Yes, but this is new API additions.

### How was this patch tested?

Pass the CIs.

Closes #32406 from dongjoon-hyun/SPARK-35280.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-30 11:39:18 -07:00
Dongjoon Hyun 6ab00488d0 [SPARK-35182][K8S] Support driver-owned on-demand PVC
### What changes were proposed in this pull request?

This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the `ownerReference` to `driver` pod instead of `executor` pod.

### Why are the changes needed?

This allows K8s backend scheduler can reuse this later.

**BEFORE**
```
$ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
...
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: tpcds-pvc-exec-1
```

**AFTER**
```
$ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
...
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Pod
    name: tpcds-pvc
```

### Does this PR introduce _any_ user-facing change?

No. (The default is `false`)

### How was this patch tested?

Manually check the above and pass K8s IT.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 16 minutes, 40 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32288 from dongjoon-hyun/SPARK-35182.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-22 17:03:19 -07:00
Shardul Mahadik 83f753e4e1 [SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode
### What changes were proposed in this pull request?

In YARN, ship the `spark.jars.ivySettings` file to the driver when using `cluster` deploy mode so that `addJar` is able to find it in order to resolve ivy paths.

### Why are the changes needed?

SPARK-33084 introduced support for Ivy paths in `sc.addJar` or Spark SQL `ADD JAR`. If we use a custom ivySettings file using `spark.jars.ivySettings`, it is loaded at b26e7b510b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L1280). However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so `addJar` fails to find it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests to verify that the `ivySettings` file is localized by the YARN client and that a YARN cluster mode application is able to find to load the `ivySettings` file.

Closes #31591 from shardulm94/SPARK-34472.

Authored-by: Shardul Mahadik <smahadik@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-04-20 13:35:57 -05:00
SaurabhChawla 1e64b4fa27 [SPARK-34877][CORE][YARN] Add the code change for adding the Spark AM log link in spark UI
### What changes were proposed in this pull request?
On Running Spark job with yarn and deployment mode as client, Spark Driver and Spark Application master launch in two separate containers. In various scenarios there is need to see Spark Application master logs to see the resource allocation, Decommissioning status and other information shared between yarn RM and Spark Application master.

In Cluster mode Spark driver and Spark AM is on same container, So Log link of the driver already there to see the logs in Spark UI

This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI

This change is only for showing the AM log links in the Client mode when resource manager is yarn.

### Why are the changes needed?
Till now the only way to check this by finding the container id of the AM and check the logs either using Yarn utility or Yarn RM Application History server.

This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added the unit test also checked the Spark UI
**In Yarn Client mode**
Before Change

![image](https://user-images.githubusercontent.com/34540906/112644861-e1733200-8e6b-11eb-939b-c76ca9902a4e.png)

After the Change - The AM info is there

![image](https://user-images.githubusercontent.com/34540906/115264198-b7075280-a153-11eb-98f3-2aed66ffad2a.png)

AM Log

![image](https://user-images.githubusercontent.com/34540906/112645680-c0f7a780-8e6c-11eb-8b82-4ccc0aee927b.png)

**In Yarn Cluster Mode**  - The AM log link will not be there

![image](https://user-images.githubusercontent.com/34540906/112649512-86900980-8e70-11eb-9b37-69d5c4b53ffa.png)

Closes #31974 from SaurabhChawla100/SPARK-34877.

Authored-by: SaurabhChawla <s.saurabhtim@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-04-20 08:56:07 -05:00
Dongjoon Hyun 00f06dd267 [SPARK-35131][K8S] Support early driver service clean-up during app termination
### What changes were proposed in this pull request?

This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination.

### Why are the changes needed?

The K8s service is one of the important resources and sometimes it's controlled by quota.
```
$ k describe quota
Name:       service
Namespace:  default
Resource    Used  Hard
--------    ----  ----
services    1     3
```

Apache Spark creates a service for driver whose lifecycle is the same with driver pod.
It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota.

**BEFORE**
```
$ k get pod
NAME                                                        READY   STATUS      RESTARTS   AGE
org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver   0/1     Completed   0          31m
org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver   0/1     Completed   0          78s

$ k get svc
NAME                                                            TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                      AGE
kubernetes                                                      ClusterIP   10.96.0.1    <none>        443/TCP                      80m
org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   31m
org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc   ClusterIP   None         <none>        7078/TCP,7079/TCP,4040/TCP   80s

$ k describe quota
Name:       service
Namespace:  default
Resource    Used  Hard
--------    ----  ----
services    3     3

$ bin/spark-submit...
Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException:
Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services.
Message: Forbidden! User minikube doesn't have permission.
services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden:
exceeded quota: service, requested: services=1, used: services=3, limited: services=3.
```

**AFTER**
```
$ k get pod
NAME                                                        READY   STATUS      RESTARTS   AGE
org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver   0/1     Completed   0          26s
org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver   0/1     Completed   0          67s
org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver   0/1     Completed   0          44s

$ k get svc
NAME         TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1    <none>        443/TCP   172m

$ k describe quota
Name:       service
Namespace:  default
Resource    Used  Hard
--------    ----  ----
services    1     3
```

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default.
The change is documented at the migration guide.

### How was this patch tested?

Pass the CIs.

This is tested with K8s IT manually.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 19 minutes, 9 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32226 from dongjoon-hyun/SPARK-35131.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-19 12:11:08 -07:00
Dongjoon Hyun 425dc58c02 [SPARK-35125][K8S] Upgrade K8s client to 5.3.0 to support K8s 1.20
### What changes were proposed in this pull request?

Although AS-IS master branch already works with K8s 1.20, this PR aims to upgrade K8s client to 5.3.0 to support K8s 1.20 officially.
- https://github.com/fabric8io/kubernetes-client#compatibility-matrix

The following are the notable breaking API changes.

1. Remove Doneable (5.0+):
    - https://github.com/fabric8io/kubernetes-client/pull/2571
2. Change Watcher.onClose signature (5.0+):
    - https://github.com/fabric8io/kubernetes-client/pull/2616
3. Change Readiness (5.1+)
    - https://github.com/fabric8io/kubernetes-client/pull/2796

### Why are the changes needed?

According to the compatibility matrix, this makes Apache Spark and its external cluster manager extension support all K8s 1.20 features officially for Apache Spark 3.2.0.

### Does this PR introduce _any_ user-facing change?

Yes, this is a dev dependency change which affects K8s cluster extension users.

### How was this patch tested?

Pass the CIs.

This is manually tested with K8s IT.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 44 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #32221 from dongjoon-hyun/SPARK-K8S-530.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-19 07:39:38 -07:00
“attilapiros” 8a3815f722 [SPARK-34789][TEST] Introduce Jetty based construct for integration tests where HTTP server is used
### What changes were proposed in this pull request?

Introducing a new test construct:
```
  withHttpServer() { baseURL =>
    ...
  }
```
Which starts and stops a Jetty server to serve files via HTTP.

Moreover this PR uses this new construct in the test `Run SparkRemoteFileTest using a remote data file`.

### Why are the changes needed?

Before this PR github URLs was used like "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt".
This connects two Spark version in an unhealthy way like connecting the "master" branch which is moving part with the committed test code which is a non-moving (as it might be even released).
So this way a test running for an earlier version of Spark expects something (filename, content, path) from a the latter release and what is worse when the moving version is changed the earlier test will break.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit test.

Closes #31935 from attilapiros/SPARK-34789.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-14 21:22:52 -07:00
HyukjinKwon a153efa643 [SPARK-35002][YARN][TESTS][FOLLOW-UP] Fix java.net.BindException in MiniYARNCluster
### What changes were proposed in this pull request?

This PR fixes two tests below:

https://github.com/apache/spark/runs/2320161984

```
[info] YarnShuffleIntegrationSuite:
[info] org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite *** ABORTED *** (228 milliseconds)
[info]   org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:373)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:128)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:503)
[info]   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
[info]   at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:322)
[info]   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
[info]   at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:95)
...
[info]   Cause: java.net.BindException: Port in use: fv-az186-831:0
[info]   at org.apache.hadoop.http.HttpServer2.constructBindException(HttpServer2.java:1231)
[info]   at org.apache.hadoop.http.HttpServer2.bindForSinglePort(HttpServer2.java:1253)
[info]   at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:1316)
[info]   at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:1167)
[info]   at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:449)
[info]   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1247)
[info]   at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1356)
[info]   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:365)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:128)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:503)
[info]   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
[info]   at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121)
[info]   at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:322)
[info]   at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194)
[info]   at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:95)
[info]   at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212)
[info]   at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
[info]   at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
[info]   at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61)
...
```

https://github.com/apache/spark/runs/2323342094

```
[info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret started
[error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret failed: java.lang.AssertionError: Connecting to /10.1.0.161:39895 timed out (120000 ms), took 120.081 sec
[error]     at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret(ExternalShuffleSecuritySuite.java:85)
[error]     ...
[info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId started
[error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId failed: java.lang.AssertionError: Connecting to /10.1.0.198:44633 timed out (120000 ms), took 120.08 sec
[error]     at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId(ExternalShuffleSecuritySuite.java:76)
[error]     ...
[info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid started
[error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid failed: java.io.IOException: Connecting to /10.1.0.119:43575 timed out (120000 ms), took 120.089 sec
[error]     at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
[error]     at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
[error]     at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
[error]     at org.apache.spark.network.shuffle.ExternalBlockStoreClient.registerWithShuffleServer(ExternalBlockStoreClient.java:211)
[error]     at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.validate(ExternalShuffleSecuritySuite.java:108)
[error]     at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid(ExternalShuffleSecuritySuite.java:68)
[error]     ...
[info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption started
[error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption failed: java.io.IOException: Connecting to /10.1.0.248:35271 timed out (120000 ms), took 120.014 sec
[error]     at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285)
[error]     at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
[error]     at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
[error]     at org.apache.spark.network.shuffle.ExternalBlockStoreClient.registerWithShuffleServer(ExternalBlockStoreClient.java:211)
[error]     at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.validate(ExternalShuffleSecuritySuite.java:108)
[error]     at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption(ExternalShu
```

For Yarn cluster suites, its difficult to fix. This PR makes it skipped if it fails to bind.
For shuffle related suites, it uses local host

### Why are the changes needed?

To make the tests stable

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Its tested in GitHub Actions: https://github.com/HyukjinKwon/spark/runs/2340210765

Closes #32126 from HyukjinKwon/SPARK-35002-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-04-14 17:13:48 +08:00
Dongjoon Hyun a42dc93a2a [SPARK-34948][K8S] Add ownerReference to executor configmap to fix leakages
### What changes were proposed in this pull request?

This PR aims to add `ownerReference` to the executor ConfigMap to fix leakage.

### Why are the changes needed?

SPARK-30985 maintains the executor config map explicitly inside Spark. However, this config map can be leaked when Spark drivers die accidentally or are killed by K8s. We need to add `ownerReference` to make K8s do the garbage collection these automatically.

The number of ConfigMap is one of the resource quota. So, the leaked configMaps currently cause Spark jobs submission failures.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs and check manually.

K8s IT is tested manually.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 19 minutes, 2 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

**BEFORE**
```
$ k get cm spark-exec-450b417895b3b2c7-conf-map -oyaml | grep ownerReferences
```

**AFTER**
```
$ k get cm spark-exec-bb37a27895b1c26c-conf-map -oyaml | grep ownerReferences
        f:ownerReferences:
```

Closes #32042 from dongjoon-hyun/SPARK-34948.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-04-03 00:00:17 -07:00
Erik Krogen 9f065ff375 [SPARK-34828][YARN] Make shuffle service name configurable on client side and allow for classpath-based config override on server side
### What changes were proposed in this pull request?
Add a new config, `spark.shuffle.service.name`, which allows for Spark applications to look for a YARN shuffle service which is defined at a name other than the default `spark_shuffle`.

Add a new config, `spark.yarn.shuffle.service.metrics.namespace`, which allows for configuring the namespace used when emitting metrics from the shuffle service into the NodeManager's `metrics2` system.

Add a new mechanism by which to override shuffle service configurations independently of the configurations in the NodeManager. When a resource `spark-shuffle-site.xml` is present on the classpath of the shuffle service, the configs present within it will be used to override the configs coming from `yarn-site.xml` (via the NodeManager).

### Why are the changes needed?
There are two use cases which can benefit from these changes.

One use case is to run multiple instances of the shuffle service side-by-side in the same NodeManager. This can be helpful, for example, when running a YARN cluster with a mixed workload of applications running multiple Spark versions, since a given version of the shuffle service is not always compatible with other versions of Spark (e.g. see SPARK-27780). With this PR, it is possible to run two shuffle services like `spark_shuffle` and `spark_shuffle_3.2.0`, one of which is "legacy" and one of which is for new applications. This is possible because YARN versions since 2.9.0 support the ability to run shuffle services within an isolated classloader (see YARN-4577), meaning multiple Spark versions can coexist.

Besides this, the separation of shuffle service configs into `spark-shuffle-site.xml` can be useful for administrators who want to change and/or deploy Spark shuffle service configurations independently of the configurations for the NodeManager (e.g., perhaps they are owned by two different teams).

### Does this PR introduce _any_ user-facing change?
Yes. There are two new configurations related to the external shuffle service, and a new mechanism which can optionally be used to configure the shuffle service. `docs/running-on-yarn.md` has been updated to provide user instructions; please see this guide for more details.

### How was this patch tested?
In addition to the new unit tests added, I have deployed this to a live YARN cluster and successfully deployed two Spark shuffle services simultaneously, one running a modified version of Spark 2.3.0 (which supports some of the newer shuffle protocols) and one running Spark 3.1.1. Spark applications of both versions are able to communicate with their respective shuffle services without issue.

Closes #31936 from xkrogen/xkrogen-SPARK-34828-shufflecompat-config-from-classpath.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2021-03-30 10:09:00 -05:00
“attilapiros” c8b7a09d39 [SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods output
### What changes were proposed in this pull request?

Extending "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with `kubectl describe pods` output for the failed test.

### Why are the changes needed?

PR builds frequently fails as the k8s integration tests are very flaky now in Amplab Jenkins environment.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Locally by making temporary one of the test fail. The output is:

```
21/03/25 16:55:16.722 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite:

===== EXTRA LOGS FOR THE FAILED TEST

21/03/25 16:55:17.167 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver DESCRIBE POD
Name:         spark-test-app-a2b03971b7c049e8a2629f6a3198842b
Namespace:    35bdb17e308743afaec17538f89a7c3e
Priority:     0
Node:         minikube/192.168.64.119
Start Time:   Thu, 25 Mar 2021 16:52:10 +0100
Labels:       spark-app-locator=75f695685ae44314a99ec13bb39332bc
              spark-app-selector=spark-150230742d364a77927a08eed0222065
              spark-role=driver
Annotations:  <none>
Status:       Succeeded
IP:           172.17.0.4
Containers:
  spark-kubernetes-driver:
    Container ID:  docker://d6d27b0551060d9b094f12d1e232dfb5ae78ce38559680c7126c548996da4d95
    Image:         docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5
    Image ID:      docker://sha256:3fc556c73a0d5187b5a14dbdc2f69ef292e60b544b4b4d3715f6749417c20918
    Ports:         7078/TCP, 7079/TCP, 4040/TCP
    Host Ports:    0/TCP, 0/TCP, 0/TCP
    Args:
      driver
      --properties-file
      /opt/spark/conf/spark.properties
      --class
      org.apache.spark.examples.SparkPi
      local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Thu, 25 Mar 2021 16:52:11 +0100
      Finished:     Thu, 25 Mar 2021 16:52:20 +0100
    Ready:          False
    Restart Count:  0
    Limits:
      memory:  1408Mi
    Requests:
      cpu:     1
      memory:  1408Mi
    Environment:
      SPARK_USER:                 attilazsoltpiros
      SPARK_APPLICATION_ID:       spark-150230742d364a77927a08eed0222065
      SPARK_DRIVER_BIND_ADDRESS:   (v1:status.podIP)
      SPARK_LOCAL_DIRS:           /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa
      SPARK_CONF_DIR:             /opt/spark/conf
    Mounts:
      /opt/spark/conf from spark-conf-volume-driver (rw)
      /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa from spark-local-dir-1 (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-nmfwl (ro)
Conditions:
  Type              Status
  Initialized       True
  Ready             False
  ContainersReady   False
  PodScheduled      True
Volumes:
  spark-local-dir-1:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:
    SizeLimit:  <unset>
  spark-conf-volume-driver:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      spark-drv-c60832786a15ffbe-conf-map
    Optional:  false
  default-token-nmfwl:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-nmfwl
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  3m7s  default-scheduler  Successfully assigned 35bdb17e308743afaec17538f89a7c3e/spark-test-app-a2b03971b7c049e8a2629f6a3198842b to minikube
  Normal  Pulled     3m7s  kubelet, minikube  Container image "docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5" already present on machine
  Normal  Created    3m7s  kubelet, minikube  Created container spark-kubernetes-driver
  Normal  Started    3m6s  kubelet, minikube  Started container spark-kubernetes-driver
21/03/25 16:55:17.168 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver DESCRIBE POD

```

Closes #31962 from attilapiros/SPARK-34869.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-28 09:44:56 -07:00
hongdongdong 985c653b20 [SPARK-33720][K8S] Support submit to k8s only with token
### What changes were proposed in this pull request?

Support submit to k8s only with token.

### Why are the changes needed?

Now, sumbit to k8s always need oauth files.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Before, submit job out of k8s cluster without correct ca.crt, we may get this exception:
```
Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target
        at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439)
        at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306)
        at sun.security.validator.Validator.validate(Validator.java:271)
        at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:312)
```
When set spark.kubernetes.trust.certificates = true, we can submit only with correct token, no need to config ca.crt in local env.
Submit as:
```
 bin/spark-submit \
     --master $master \
     --name pi \
     --deploy-mode cluster \
     --conf spark.kubernetes.container.image=$image \
     --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
     --conf spark.kubernetes.authenticate.submission.oauthToken=$clusterToken \
     --conf spark.kubernetes.trust.certificates=true \
     local:///opt/spark/examples/src/main/python/pi.py 200
```

Closes #30684 from hddong/trust-certs.

Authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-23 22:07:27 -07:00
Yikun Jiang 31da90762e [SPARK-34820][K8S][R] add apt-update before gnupg install
### What changes were proposed in this pull request?
We added the gnupg installation in https://github.com/apache/spark/pull/30130 , we should do apt update before gnupg isntallation, otherwise we will get a fetch error when package is updated.

See more in:
[1] http://apache-spark-developers-list.1001551.n3.nabble.com/K8s-Integration-test-is-unable-to-run-because-of-the-unavailable-libs-td30986.html

### Why are the changes needed?
add a apt-update cmd before gnupg installation to avoid invaild package cache list.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
K8s Integration test passed

Closes #31923 from Yikun/SPARK-34820.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-22 10:13:31 -07:00
Dongjoon Hyun 2fa792aa64 [SPARK-34783][K8S] Support remote template files
### What changes were proposed in this pull request?

This PR aims to support remote driver/executor template files.

### Why are the changes needed?

Currently, `KubernetesUtils.loadPodFromTemplate` supports only local files.

With this PR, we can do the following.
```bash
bin/spark-submit \
...
-c spark.kubernetes.driver.podTemplateFile=s3a://dongjoon/driver.yml \
-c spark.kubernetes.executor.podTemplateFile=s3a://dongjoon/executor.yml \
...
```

### Does this PR introduce _any_ user-facing change?

Yes, this is an improvement.

### How was this patch tested?

Manual testing.

Closes #31877 from dongjoon-hyun/SPARK-34783-2.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-19 08:52:42 -07:00
“attilapiros” 124b5af114 [SPARK-34732][K8S][TESTS] Fix IndexOutOfBoundsException in logForFailedTest when driver is not started
### What changes were proposed in this pull request?

Fixing `IndexOutOfBoundsException` in `logForFailedTest` method when driver is not started.

### Why are the changes needed?

Before this PR when the driver is not started an `IndexOutOfBoundsException` as the first item is tried to be accessed from an empty list:

```
- PVs with local storage *** FAILED ***
  java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
  at java.util.ArrayList.rangeCheck(ArrayList.java:659)
  at java.util.ArrayList.get(ArrayList.java:435)
  at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.logForFailedTest(KubernetesSuite.scala:83)
  at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:181)
  at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188)
  at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200)
  at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
  at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200)
  at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182)
  at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
  ...
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Running integration tests.
After this changes the above error become:

```
- PVs with local storage *** FAILED ***
  java.io.IOException: No such file or directory
  at java.io.UnixFileSystem.createFileExclusively(Native Method)
  at java.io.File.createTempFile(File.java:2026)
  at org.apache.spark.deploy.k8s.integrationtest.Utils$.createTempFile(Utils.scala:103)
  at org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.$anonfun$$init$$1(PVTestsSuite.scala:135)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  at org.scalatest.Transformer.apply(Transformer.scala:20)
  at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190)
  ...
```

Closes #31824 from attilapiros/SPARK-34732.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-03-13 15:28:02 -08:00
“attilapiros” 6c5322de61 [SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator
### What changes were proposed in this pull request?

This PR modifies the POD allocator to use the scheduler backend to get the known executors and remove those from the pending and newly created list.

This is different from the normal `ExecutorAllocationManager` requested killing of executors where the  `spark.dynamicAllocation.executorIdleTimeout` is used.
In this case POD allocator kills the executors which  should be only responsible for terminating not satisfied POD allocations (new requests where no POD state is received yet and PODs in pending state).

### Why are the changes needed?

Because there is race between executor POD allocator and cluster scheduler backend.
Running several experiment during downscaling we experienced a lot of killed fresh executors wich has already running task on them.

The pattern in the log was the following (see executor 312 and TID 2079):

```
21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new total is 138)
...
21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes)
21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests (408,312,307).
...
21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138: The executor with id 312 was deleted by a user or the framework.
21/02/01 15:12:04 INFO TaskSetManager: Task 2079 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

#### Manually

With this change there was no executor lost with running task on it.

##### With unit test

A new test is added and existing test is modified to check these cases.

Closes #31513 from attilapiros/SPARK-34361.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-03-02 16:58:29 -08:00
Dongjoon Hyun 4d428a821b Revert "[SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests"
This reverts commit b17754a8cb.
2021-02-25 17:10:58 -08:00
HyukjinKwon 8a1e172b51 [SPARK-34520][CORE] Remove unused SecurityManager references
### What changes were proposed in this pull request?

This is kind of a followup of https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945.
Many of references in `SecurityManager` were introduced from SPARK-1189, and related usages were removed later from https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945. This PR proposes to remove them out.

### Why are the changes needed?

For better readability of codes.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually complied. GitHub Actions and Jenkins build should test it out as well.

Closes #31636 from HyukjinKwon/SPARK-34520.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-24 20:38:03 -08:00
“attilapiros” b17754a8cb [SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests
### What changes were proposed in this pull request?

From [minikube version v1.1.0](https://github.com/kubernetes/minikube/blob/v1.1.0/CHANGELOG.md) kubectl is available as a command. So the kubeconfig settings can be accessed like:

```
$ minikube kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority: /Users/attilazsoltpiros/.minikube/ca.crt
    server: https://127.0.0.1:32788
  name: minikube
contexts:
- context:
    cluster: minikube
    namespace: default
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt
    client-key: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key
```

Here the vm-driver was docker and the server port (https://127.0.0.1:32788) is different from the hardcoded 8443.

So the main part of this PR is introducing kubernetes client configuration based on the kubeconfig (output of `minikube kubectl config view`) in case of minikube versions after v1.1.0 and the old legacy way of configuration is also kept as minikube version should be supported back to v0.34.1 .

Moreover as the old style of config parsing pattern wasn't sufficient in my case as when the `minikube kubectl config view` is called kubectl downloading message might be included before the first key I changed it even for the existent keys to be a consistent pattern in this file.

The old parsing in an example:
```
private val HOST_PREFIX = "host:"

val hostString = statusString.find(_.contains(s"$HOST_PREFIX "))

val status1 = hostString.get.split(HOST_PREFIX)(1)
```

The new parsing:
```
private val HOST_PREFIX = "host: "

val hostString = statusString.find(_.contains(HOST_PREFIX))

hostString.get.split(HOST_PREFIX)(1)
```

So the PREFIX is extended with the extra space at the declaration (this way the two separate string operation are more safe and consistent with each other) and the replace is changed to split and getting the 2nd string from the result (which is guaranteed to contain only the text after the PREFIX when the PREFIX is a contained substring).

Finally there is tiny change in `dev-run-integration-tests.sh` to introduce `--skip-building-dependencies` which switchs off building of maven dependencies of `kubernetes-integration-tests` from the Spark project.
This could be used when only the `kubernetes-integration-tests` should be rebuilded as only the tests are modified.

### Why are the changes needed?

Kubernetes client configuration based on kubeconfig settings is more reliable and provides a solution which is minikube version independent.

### Does this PR introduce _any_ user-facing change?

No. This is only test code.

### How was this patch tested?

tested manually on two minikube versions.

Minikube  v0.34.1:

```
$ minikube version
minikube version: v0.34.1

$ grep "version\|building" resource-managers/kubernetes/integration-tests/target/integration-tests.log
20/12/12 12:52:25.135 ScalaTest-main-running-DiscoverySuite INFO Minikube: minikube version: v0.34.1
20/12/12 12:52:25.761 ScalaTest-main-running-DiscoverySuite INFO Minikube: building kubernetes config with apiVersion: v1, masterUrl: https://192.168.99.103:8443, caCertFile: /Users/attilazsoltpiros/.minikube/ca.crt, clientCertFile: /Users/attilazsoltpiros/.minikube/apiserver.crt, clientKeyFile: /Users/attilazsoltpiros/.minikube/apiserver.key
```

Minikube v1.15.1
```
$ minikube version

minikube version: v1.15.1
commit: 23f40a012abb52eff365ff99a709501a61ac5876

$ grep "version\|building" resource-managers/kubernetes/integration-tests/target/integration-tests.log

20/12/13 06:25:55.086 ScalaTest-main-running-DiscoverySuite INFO Minikube: minikube version: v1.15.1
20/12/13 06:25:55.597 ScalaTest-main-running-DiscoverySuite INFO Minikube: building kubernetes config with apiVersion: v1, masterUrl: https://192.168.64.4:8443, caCertFile: /Users/attilazsoltpiros/.minikube/ca.crt, clientCertFile: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt, clientKeyFile: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key

$ minikube kubectl config view
apiVersion: v1
clusters:
- cluster:
    certificate-authority: /Users/attilazsoltpiros/.minikube/ca.crt
    server: https://192.168.64.4:8443
  name: minikube
contexts:
- context:
    cluster: minikube
    namespace: default
    user: minikube
  name: minikube
current-context: minikube
kind: Config
preferences: {}
users:
- name: minikube
  user:
    client-certificate: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt
    client-key: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key
```

Closes #30751 from attilapiros/SPARK-32617.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-02-24 11:46:27 -08:00
Dongjoon Hyun 9942548c37 [SPARK-34487][K8S][TESTS] Use the runtime Hadoop version in K8s IT
### What changes were proposed in this pull request?

This PR aims to use the runtime Hadoop version in K8s integration test.

### Why are the changes needed?

SPARK-33212 upgrades Hadoop dependency from 3.2.0 to 3.2.2 and we will upgrade to 3.3.x+.
We had better use the runtime Hadoop version instead of having a static string.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the K8s IT.

This is tested locally like the following.
```
KubernetesSuite:
...
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
...
```

Closes #31604 from dongjoon-hyun/SPARK-34487.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-21 08:57:02 -08:00
Dongjoon Hyun 020e84e92f [SPARK-34486][K8S] Upgrade kubernetes-client to 4.13.2
### What changes were proposed in this pull request?

This PR aims to upgrade `kubernetes-client` library from 4.12.0 to 4.13.2 for Apache Spark 3.2.0.

### Why are the changes needed?

This will bring [K8s 1.19.1](https://github.com/fabric8io/kubernetes-client/pull/2541) models officially and the latest bug fixes.

- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.1
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.2

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the K8s IT and UT.

```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 19 minutes, 25 seconds.
Total number of tests run: 27
Suites: completed 2, aborted 0
Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```

Closes #31602 from dongjoon-hyun/SPARK-34486.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-21 18:35:38 +09:00
yi.wu 546d2eb5d4 [SPARK-34384][CORE] Add missing docs for ResourceProfile APIs
### What changes were proposed in this pull request?

This PR adds missing docs for ResourceProfile related APIs. Besides, it includes a few minor changes on API:

* ResourceProfileBuilder.build -> ResourceProfileBuilder.builder()
* Provides java specific API `allSupportedExecutorResourcesJList`
* private `ResourceAllocator` since it was mistakenly exposed previously

### Why are the changes needed?

Add missing API docs

### Does this PR introduce _any_ user-facing change?

No, as Apache Spark 3.1 hasn't officially released.

### How was this patch tested?

Updated unit tests due to the signature change of `build()`.

Closes #31496 from Ngone51/resource-profile-api-cleanup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2021-02-21 18:29:44 +09:00
Dongjoon Hyun 484a83e73e [SPARK-34469][K8S] Ignore RegisterExecutor when SparkContext is stopped
### What changes were proposed in this pull request?

This PR aims to make `KubernetesClusterSchedulerBackend` ignore `RegisterExecutor` message when `SparkContext` is stopped already.

### Why are the changes needed?

If `SparkDriver` is terminated, the executors will be removed by K8s automatically.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the newly added test case.

Closes #31587 from dongjoon-hyun/SPARK-34469.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-02-19 09:36:07 -08:00
“attilapiros” 76e5d75e36 [SPARK-33763] Add metrics for better tracking of dynamic allocation
### What changes were proposed in this pull request?

This PR adds the following metrics to track executor remove reasons during dynamic allocation:
-  `numberExecutorsGracefullyDecommissioned`: number of executors which reached the finished decommissioning state and shut itself down cleanly
- `numberExecutorsDecommissionUnfinished`: executors which requested to decommission but they stopped without reaching the finished decommissioning state
- `numberExecutorsKilledByDriver`: executors killed by the driver (requested to stop)
-  `numberExecutorsExitedUnexpectedly`: executors exited without driver request

### Why are the changes needed?

For supporting monitoring of dynamic allocation better with these metrics.

### Does this PR introduce _any_ user-facing change?

Yes. The new metrics will be available for monitoring.

### How was this patch tested?

With unit and integration tests.

Finally manually checked the new metrics in jconsole:
<img width="1054" alt="jmx" src="https://user-images.githubusercontent.com/2017933/107458686-de8adf00-6b54-11eb-86f7-41faf2fb638f.png">

Closes #31450 from attilapiros/SPARK-33763-final.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-02-17 13:44:36 -08:00
“attilapiros” 5f91245cc2 [SPARK-34426][K8S][TESTS] Add driver and executors POD logs to integration tests log when the test fails
### What changes were proposed in this pull request?

This PR introduces a new protected method in `SparkFunSuite` which is only called when the test failed and can be used to collect logs for failed test. By this PR it is implemented in the Kubernetes tests by `KubernetesSuite` class where it collects all the POD logs and logs them out.

This unfortunately cannot be realized with a simple "after" method as in the "after" method the test outcome is not available.

Moreover this PR removes the `appLocator` as a method argument as `appLocator` is available as a member variable.

### Why are the changes needed?

Currently both the driver and executors logs are lost.

In [developer-tools](https://spark.apache.org/developer-tools.html) there is a hint:
"Getting logs from the pods and containers directly is an exercise left to the reader."

But when the test is executed by Jenkins and a failure happened we really need the POD logs to analyze problem.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

By integration testing. I have checked what would happen if one test fails, the output would be:

```
21/02/14 11:05:34.261 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite:

===== EXTRA LOGS FOR THE FAILED TEST

21/02/14 11:05:34.278 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver POD log
++ id -u
+ myuid=185
++ id -g
+ mygid=0
+ set +e
++ getent passwd 185
+ uidentry=
+ set -e
+ '[' -z '' ']'
+ '[' -w /etc/passwd ']'
+ echo '185185:0:anonymous uid:/opt/spark:/bin/false'
+ SPARK_CLASSPATH=':/opt/spark/jars/*'
+ env
+ grep SPARK_JAVA_OPT_
+ sort -t_ -k4 -n
+ sed 's/[^=]*=\(.*\)/\1/g'
+ readarray -t SPARK_EXECUTOR_JAVA_OPTS
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z ']'
+ '[' -n '' ']'
+ '[' -z ']'
+ '[' -z x ']'
+ SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/*'
+ case "$1" in
+ shift 1
+ CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$")
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/tests/decommissioning.py
21/02/14 10:02:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Starting decom test
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
21/02/14 10:02:29 INFO SparkContext: Running Spark version 3.2.0-SNAPSHOT
21/02/14 10:02:29 INFO ResourceUtils: ==============================================================
21/02/14 10:02:29 INFO ResourceUtils: No custom resources configured for spark.driver.
21/02/14 10:02:29 INFO ResourceUtils: ==============================================================
...
21/02/14 10:03:17 INFO ShutdownHookManager: Deleting directory /var/data/spark-fa6961ed-a2c1-444c-bfeb-20e63ba0b5cf/spark-ab4b0287-6e24-4b39-837e-9b0b62c1f26f
21/02/14 10:03:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-d6b11e7d-6a03-4a1d-8559-37cb853319bf

21/02/14 11:05:34.279 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver POD log
```

Closes #31561 from attilapiros/SPARK-34426.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-02-17 05:49:16 +09:00
Holden Karau 5248ecb5ab [SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes
### What changes were proposed in this pull request?

Allow users to have Spark attempt to decommission excluded executors.
Since excluded executors may be flaky, this also adds the ability for users to specify a time limit after which a decommissioning executor will be killed by Spark.

### Why are the changes needed?

This may help prevent fetch failures from excluded executors, and also handle the situation in which executors

### Does this PR introduce _any_ user-facing change?

Yes, two new configuration flags for the behaviour.

### How was this patch tested?

Extended unit and integration tests.

Closes #31539 from holdenk/re=enable-SPARK-34104-SPARK-34105.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2021-02-09 18:16:09 -08:00
HyukjinKwon c8628c943c Revert "[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes"
This reverts commit 50641d2e3d.
2021-02-10 08:00:03 +09:00