### What changes were proposed in this pull request?
This PR aims to add `spark.kubernetes.driver.service.annotation` like `spark.kubernetes.driver.service.annotation`.
### Why are the changes needed?
Annotations are used in many ways. One example is that Prometheus monitoring system search metric endpoint via annotation.
- https://github.com/helm/charts/tree/master/stable/prometheus#scraping-pod-metrics-via-annotations
### Does this PR introduce _any_ user-facing change?
Yes. The documentation is added.
### How was this patch tested?
Pass Jenkins with the updated unit tests.
Closes#28518 from dongjoon-hyun/SPARK-31696.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This patch re-obtain tokens at the start of AM for yarn cluster mode, if principal and keytab are available. It basically transfers the credentials from the original user, so this patch puts the new tokens into credentials from the original user via overwriting.
To obtain tokens from providers in user application, this patch leverages the user classloader as context classloader while initializing token manager in the startup of AM.
### Why are the changes needed?
Submitter will obtain delegation tokens for yarn-cluster mode, and add these credentials to the launch context. AM will be launched with these credentials, and AM and driver are able to leverage these tokens.
In Yarn cluster mode, driver is launched in AM, which in turn initializes token manager (while initializing SparkContext) and obtain delegation tokens (+ schedule to renew) if both principal and keytab are available.
That said, even we provide principal and keytab to run application with yarn-cluster mode, AM always starts with initial tokens from launch context until token manager runs and obtains delegation tokens.
So there's a "gap", and if user codes (driver) access to external system with delegation tokens (e.g. HDFS) before initializing SparkContext, it cannot leverage the tokens token manager will obtain. It will make the application fail if AM is killed "after" the initial tokens are expired and relaunched.
This is even a regression: see below codes in branch-2.4:
https://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scalahttps://github.com/apache/spark/blob/branch-2.4/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/security/AMCredentialRenewer.scala
In Spark 2.4.x, AM runs AMCredentialRenewer at initialization, and AMCredentialRenew obtains tokens and merge with credentials being provided with launch context of AM. So it guarantees new tokens in driver run.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested with specifically crafted application (simple reproducer) - https://github.com/HeartSaVioR/spark-delegation-token-experiment/blob/master/src/main/scala/net/heartsavior/spark/example/LongRunningAppWithHDFSConfig.scala
Before this patch, new AM attempt failed when I killed AM after the expiration of tokens. After this patch the new AM attempt runs fine.
Closes#28336 from HeartSaVioR/SPARK-31559.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@apache.org>
### What changes were proposed in this pull request?
Register `NettyMemoryMetrics` into Node Manager's metrics system through `YarnShuffleServiceMetrics`.
- usedDirectMemory
- usedHeapMemory
### Why are the changes needed?
Such that `NettyMemoryMetrics` can be exposed through Node Manager's JMX.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Update UT to ensure NettyMemoryMetrics are registered into Node Manager's metrics system.
Closes#28416 from manuzhang/spark-31611.
Authored-by: tianlzhang <tianlzhang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Update the input parameters for instantiating `RMAppManager` and `ClientRMService`
### Why are the changes needed?
For hadoop3.2, if `RMAppManager` is not created correctly, the following exception will occur:
```
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:135)
at org.apache.hadoop.yarn.security.YarnAuthorizationProvider.getInstance(YarnAuthorizationProvider.java:55)
at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.<init>(RMAppManager.java:117)
```
### How was this patch tested?
UTs
Closes#28456 from wang-zhun/Fix-SPARK-31235.
Authored-by: wang-zhun <wangzhun6103@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Remove unused `registeredConnections` counter from `ExternalBlockHandler#ShuffleMetrics`
This was added by SPARK-25642 at 3.0.0
- 8dd29fe36b
### Why are the changes needed?
It's `registeredConnections` counter created in `TransportContext` that's really counting the numbers and it's misleading for people who want to add new metrics like `registeredConnections`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Add UTs to ensure all expected metrics are registered for `ExternalShuffleService` and `YarnShuffleService`
Closes#28457 from manuzhang/spark-31611-pre.
Lead-authored-by: tianlzhang <tianlzhang@ebay.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds `spark.yarn.applicationType` to identify the application type
### Why are the changes needed?
The current application defaults to the SPARK type.
In fact, different types of applications have different characteristics and are suitable for different scenarios.For example: SPAKR-SQL, SPARK-STREAMING.
I recommend distinguishing them by the parameter `spark.yarn.applicationType` so that we can more easily manage and maintain different types of applications.
### How was this patch tested?
1.add UT
2.Tested by verifying Yarn-UI `ApplicationType` in the following cases:
- client and cluster mode
Need additional explanation:
limit cannot exceed 20 characters, can be empty or space
The reasons are as follows:
```
// org.apache.hadoop.yarn.server.resourcemanager.submitApplication.
if (submissionContext.getApplicationType() == null) {
submissionContext
.setApplicationType(YarnConfiguration.DEFAULT_APPLICATION_TYPE);
} else {
// APPLICATION_TYPE_LENGTH = 20
if (submissionContext.getApplicationType().length() > YarnConfiguration.APPLICATION_TYPE_LENGTH) {
submissionContext.setApplicationType(submissionContext
.getApplicationType().substring(0,
YarnConfiguration.APPLICATION_TYPE_LENGTH));
}
}
```
Closes#28009 from wang-zhun/SPARK-31235.
Authored-by: wang-zhun <wangzhun6103@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR aims to fix `spark.kubernetes.executor.podNamePrefix` to work.
### Why are the changes needed?
Currently, the configuration is broken like the following.
```
bin/spark-submit \
--master k8s://$K8S_MASTER \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
-c spark.kubernetes.container.image=spark:pr \
-c spark.kubernetes.driver.pod.name=mypod \
-c spark.kubernetes.executor.podNamePrefix=mypod \
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar
```
**BEFORE SPARK-31601**
```
pod/mypod 1/1 Running 0 9s
pod/spark-pi-7469dd71c499fafb-exec-1 1/1 Running 0 4s
pod/spark-pi-7469dd71c499fafb-exec-2 1/1 Running 0 4s
```
**AFTER SPARK-31601**
```
pod/mypod 1/1 Running 0 8s
pod/mypod-exec-1 1/1 Running 0 3s
pod/mypod-exec-2 1/1 Running 0 3s
```
### Does this PR introduce any user-facing change?
Yes. This is a bug fix. The conf will work as described in the documentation.
### How was this patch tested?
Pass the Jenkins and run the above comment manually.
Closes#28401 from dongjoon-hyun/SPARK-31601.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>
### What changes were proposed in this pull request?
We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`.
### Why are the changes needed?
Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster.
However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars.
One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors.
By not populating the Hadoop classpath from the clusters can address this issue.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop.
We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath.
Closes#28376 from dbtsai/yarn-classpath.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
This PR fixes a typo in deploy/yarn/LocalityPreferredContainerPlacementStrategy.scala file.
### Why are the changes needed?
To deliver correct explanation about how the placement policy works.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
UT as specified, although shouldn't influence any functionality since it's in the comment.
Closes#28267 from asclepiusaka/master.
Authored-by: Cong Du <asclepius1993@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This issue mainly shows up when you enable dynamic allocation:
because there are many executor state changes (because of executors
being requested and starting to run, and later stopped), the lifecycle
manager class could end up logging information about the same executor
multiple times, since the different events would cause the same
executor update to be present in multiple pod snapshots. On top of that,
it could end up making multiple redundant calls into the API server
for the same pod.
Another issue was when the config was set to not delete executor
pods; with dynamic allocation, that means pods keep accumulating
in the API server, and every time the full sync is done by the
polling source, all executors, even the finished ones that Spark
technically does not care about anymore, would be processed.
The change modifies the lifecycle monitor so that it:
- logs executor updates a single time, even if it shows up in
multiple snapshots, by checking whether the state change
happened before.
- marks finished-but-not-deleted-in-k8s executors with a label
so that they can be easily filtered out.
This reduces the amount of logging done by the lifecycle manager,
which is a minor thing in general since the logs are at debug level.
But it also reduces the amount of data that needs to be fetched
from the API server under certain configurations, and overall
reduces interaction with the API server when dynamic allocation is on.
There's also a change in the snapshot store to ensure that the
same subscriber is not called concurrently. That is kind of a bug,
since it means subscribers could be processing snapshots out of order,
or even that they could block multiple threads (e.g. the allocator
callback was synchronized). I actually ran into the "concurrent calls"
situation in the lifecycle manager during testing, and while it did not
seem to cause problems, it did make for some head scratching while
looking at the logs. It seemed safer to fix that.
Unit tests were updated to check for the changes. Also tested in real
cluster with dynamic allocation on.
Closes#26535 from vanzin/SPARK-29905.
Lead-authored-by: Marcelo Vanzin <vanzin@apache.org>
Co-authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR (SPARK-31394) aims to add a new feature that enables mounting of Kubernetes NFS volumes. Most of the codes are just slight modifications from the existing codes for EmptyDir/HostDir/PVC support.
### Why are the changes needed?
Kubernetes supports various kinds of volumes, but Spark for Kubernetes supports only EmptyDir/HostDir/PVC. By adding support for NFS, we can use Spark for Kubernetes with NFS storage.
In order to use NFS with the current Spark using PVC, the user needs to first create an empty new PVC with NFS. Kubernetes' NFS provisioner will create a new empty dir in NFS under some pre-configured dir for this PVC, for example, `/nfs/k8s/sjcho-my-notebook-pvc-dce84888-7a9d-11e6-b1ee-5254001e0c1b`. Then the user should add files to process in the newly created PVC using some file-copying job, and then run the desired Spark job using that populated PVC. And then to get the final results out, the user should run another file-copying job.
This in theory works, but for data analysis tasks, is quite cumbersome. With this change, one could simply use existing files in NFS, say `/nfs/home/sjcho/myfiles10.sstable` from the Spark job directly, and also write the results directly to some existing dir under NFS such as `/nfs/home/sjcho/output`.
This PR doesn't use any features other than the features already provided by Kubernetes itself, so there should be no compatibility issues (other than limited by k8s) between the wide variety of NFS choices. This PR merely enables an existing volume type `nfs` supported officially by Kubernetes, just like Spark is currently supporting `hostPath` and `persistentVolumeClaim` right now.
### Does this PR introduce any user-facing change?
Users can now mount NFS volumes by running commands like:
```
spark-submit \
--conf spark.kubernetes.driver.volumes.nfs.myshare.mount.path=/myshare \
--conf spark.kubernetes.driver.volumes.nfs.myshare.mount.readOnly=false \
--conf spark.kubernetes.driver.volumes.nfs.myshare.options.server=nfs.example.com \
--conf spark.kubernetes.driver.volumes.nfs.myshare.options.path=/storage/myshare \
...
```
### How was this patch tested?
Test cases were added just like the existing EmptyDir support.
The code were tested using minikube using the following script:
https://gist.github.com/w4-sjcho/4ba48f8c35a9685f5307fbd46b2c0656#file-run-test-sh
The script creates a new minikube cluster, launches an NFS server inside the cluster, copy `README.md` file to the NFS share, and run `JavaWordCount` example against the file located in NFS.
Closes#27364 from w4-sjcho/master.
Authored-by: Seongjin Cho <sjcho@wisefour.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
[Delay scheduling](http://elmeleegy.com/khaled/papers/delay_scheduling.pdf) is an optimization that sacrifices fairness for data locality in order to improve cluster and workload throughput.
One useful definition of "delay" here is how much time has passed since the TaskSet was using its fair share of resources.
However it is impractical to calculate this delay, as it would require running simulations assuming no delay scheduling. Tasks would be run in different orders with different run times.
Currently the heuristic used to estimate this delay is the time since a task was last launched for a TaskSet. The problem is that it essentially does not account for resource utilization, potentially leaving the cluster heavily underutilized.
This PR modifies the heuristic in an attempt to move closer to the useful definition of delay above.
The newly proposed delay is the time since a TasksSet last launched a task **and** did not reject any resources due to delay scheduling when offered its "fair share".
See the last comments of #26696 for more discussion.
### Why are the changes needed?
cluster can become heavily underutilized as described in [SPARK-18886](https://issues.apache.org/jira/browse/SPARK-18886?jql=project%20%3D%20SPARK%20AND%20text%20~%20delay)
### How was this patch tested?
TaskSchedulerImplSuite
cloud-fan
tgravescs
squito
Closes#27207 from bmarcott/nmarcott-fulfill-slots-2.
Authored-by: Nicholas Marcott <481161+bmarcott@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to add `m01` as a node name additionally to `PVTestsSuite`.
### Why are the changes needed?
minikube 1.8.0 ~ 1.8.2 generate a cluster with a nodename `m01` while all the other versions have `minikube`. This causes `PVTestSuite` failure.
```
$ minikube --vm-driver=hyperkit start --memory 6000 --cpus 8
* minikube v1.8.2 on Darwin 10.15.3
- MINIKUBE_ACTIVE_DOCKERD=minikube
* Using the hyperkit driver based on user configuration
* Creating hyperkit VM (CPUs=8, Memory=6000MB, Disk=20000MB) ...
* Preparing Kubernetes v1.18.0 on Docker 19.03.6 ...
* Launching Kubernetes ...
* Enabling addons: default-storageclass, storage-provisioner
* Waiting for cluster to come online ...
* Done! kubectl is now configured to use "minikube"
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
m01 Ready master 22s v1.17.3
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This only adds a new node name. So, K8S Jenkins job should passed.
In addition, `K8s` integration test suite should be tested on `minikube 1.8.2` manually.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage
- Launcher client dependencies
- Test basic decommissioning
- Run SparkR on simple dataframe.R example
Run completed in 10 minutes, 23 seconds.
Total number of tests run: 20
Suites: completed 2, aborted 0
Tests: succeeded 20, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
For the above test, Minikube 1.8.2 and K8s v1.18.0 is used.
```
$ minikube version
minikube version: v1.8.2
commit: eb13446e786c9ef70cb0a9f85a633194e62396a1
$ kubectl version --short
Client Version: v1.18.0
Server Version: v1.18.0
```
Closes#28080 from dongjoon-hyun/SPARK-31313.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
A small change to fix an error in Docker `entrypoint.sh`
### Why are the changes needed?
When spark running on Kubernetes, I got the following logs:
```log
+ '[' -n ']'
+ '[' -z ']'
++ /bin/hadoop classpath
/opt/entrypoint.sh: line 62: /bin/hadoop: No such file or directory
+ export SPARK_DIST_CLASSPATH=
+ SPARK_DIST_CLASSPATH=
```
This is because you are missing some quotes on bash comparisons.
### Does this PR introduce any user-facing change?
No
## How was this patch tested?
CI
Closes#28075 from dungdm93/patch-1.
Authored-by: Đặng Minh Dũng <dungdm93@live.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is the core scheduler changes to support Stage level scheduling.
The main changes here include modification to the DAGScheduler to look at the ResourceProfiles associated with an RDD and have those applied inside the scheduler.
Currently if multiple RDD's in a stage have conflicting ResourceProfiles we throw an error. logic to allow this will happen in SPARK-29153. I added the interfaces to RDD to add and get the REsourceProfile so that I could add unit tests for the scheduler. These are marked as private for now until we finish the feature and will be exposed in SPARK-29150. If you think this is confusing I can remove those and remove the tests and add them back later.
I modified the task scheduler to make sure to only schedule on executor that exactly match the resource profile. It will then check those executors to make sure the current resources meet the task needs before assigning it. In here I changed the way we do the custom resource assignment.
Other changes here include having the cpus per task passed around so that we can properly account for them. Previously we just used the one global config, but now it can change based on the ResourceProfile.
I removed the exceptions that require the cores to be the limiting resource. With this change all the places I found that used executor cores /task cpus as slots has been updated to use the ResourceProfile logic and look to see what resource is limiting.
### Why are the changes needed?
Stage level sheduling feature
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
unit tests and lots of manual testing
Closes#27773 from tgravescs/SPARK-29154.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR (SPARK-31244) replaces `Ceph` with `Minio` in K8S `DepsTestSuite`.
### Why are the changes needed?
Currently, `DepsTestsSuite` is using `ceph` for S3 storage. However, the used version and all new releases are broken on new `minikube` releases. We had better use more robust and small one.
```
$ minikube version
minikube version: v1.8.2
$ minikube -p minikube docker-env | source
$ docker run -it --rm -e NETWORK_AUTO_DETECT=4 -e RGW_FRONTEND_PORT=8000 -e SREE_PORT=5001 -e CEPH_DEMO_UID=nano -e CEPH_DAEMON=demo ceph/daemon:v4.0.3-stable-4.0-nautilus-centos-7-x86_64 /bin/sh
2020-03-25 04:26:21 /opt/ceph-container/bin/entrypoint.sh: ERROR- it looks like we have not been able to discover the network settings
$ docker run -it --rm -e NETWORK_AUTO_DETECT=4 -e RGW_FRONTEND_PORT=8000 -e SREE_PORT=5001 -e CEPH_DEMO_UID=nano -e CEPH_DAEMON=demo ceph/daemon:v4.0.11-stable-4.0-nautilus-centos-7 /bin/sh
2020-03-25 04:20:30 /opt/ceph-container/bin/entrypoint.sh: ERROR- it looks like we have not been able to discover the network settings
```
Also, the image size is unnecessarily big (almost `1GB`) and growing while `minio` is `55.8MB` with the same features.
```
$ docker images | grep ceph
ceph/daemon v4.0.3-stable-4.0-nautilus-centos-7-x86_64 a6a05ccdf924 6 months ago 852MB
ceph/daemon v4.0.11-stable-4.0-nautilus-centos-7 87f695550d8e 12 hours ago 901MB
$ docker images | grep minio
minio/minio latest 95c226551ea6 5 days ago 55.8MB
```
### Does this PR introduce any user-facing change?
No. (This is a test case change)
### How was this patch tested?
Pass the existing Jenkins K8s integration test job and test with the latest minikube.
```
$ minikube version
minikube version: v1.8.2
$ kubectl version --short
Client Version: v1.17.4
Server Version: v1.17.4
$ NO_MANUAL=1 ./dev/make-distribution.sh --r --pip --tgz -Pkubernetes
$ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz $PWD/spark-*.tgz
...
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- PVs with local storage *** FAILED *** // This is irrelevant to this PR.
- Launcher client dependencies // This is the fixed test case by this PR.
- Test basic decommissioning
- Run SparkR on simple dataframe.R example
Run completed in 12 minutes, 4 seconds.
...
```
The following is the working snapshot of `DepsTestSuite` test.
```
$ kubectl get all -ncf9438dd8a65436686b1196a6b73000f
NAME READY STATUS RESTARTS AGE
pod/minio-0 1/1 Running 0 70s
pod/spark-test-app-8494bddca3754390b9e59a2ef47584eb 1/1 Running 0 55s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/minio-s3 NodePort 10.109.54.180 <none> 9000:30678/TCP 70s
service/spark-test-app-fd916b711061c7b8-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 55s
NAME READY AGE
statefulset.apps/minio 1/1 70s
```
Closes#28015 from dongjoon-hyun/SPARK-31244.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This is a follow up for SPARK-30715 . Kubernetes client version in sync in integration-tests and kubernetes/core
### Why are the changes needed?
More than once, the kubernetes client version has gone out of sync between integration tests and kubernetes/core. So brought them up in sync again and added a comment to save us from future need of this additional followup.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually.
Closes#27948 from ScrapCodes/follow-up-spark-30715.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Change what we consider a deleted pod to not include "Terminating"
### Why are the changes needed?
If we get a new snapshot while a pod is in the process of being cleaned up we shouldn't delete the executor until it is fully terminated.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
This should be covered by the decommissioning tests in that they currently are flaky because we sometimes delete the executor instead of allowing it to decommission all the way.
I also ran this in a loop locally ~80 times with the only failures being the PV suite because of unrelated minikube mount issues.
Closes#27905 from holdenk/SPARK-31125-Processing-state-snapshots-incorrect.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR adds the proxy user on the spark-submit command to the childArgs, so the proxy user can be retrieved and used in the KubernetesAplication to add the proxy user in the driver container args
### Why are the changes needed?
The proxy user when used on the spark submit doesn't work on the Kubernetes environment since it doesn't add the `--proxy-user` argument on the driver container and when I added it manually to the Pod definition it worked just fine.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tests were added
Closes#27422 from PedroRossi/SPARK-25355.
Authored-by: Pedro Rossi <pgrr@cin.ufpe.br>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Spark's Web UI is using an older version of Bootstrap (v. 2.3.2) for the portal pages. Bootstrap 2.x was moved to EOL in Aug 2013 and Bootstrap 3.x was moved to EOL in July 2019 (https://github.com/twbs/release). Older versions of Bootstrap are also getting flagged in security scans for various CVEs:
https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-72889https://snyk.io/vuln/SNYK-JS-BOOTSTRAP-173700https://snyk.io/vuln/npm:bootstrap:20180529https://snyk.io/vuln/npm:bootstrap:20160627
I haven't validated each CVE, but it would be nice to resolve any potential issues and get on a supported release.
The bad news is that there have been quite a few changes between Bootstrap 2 and Bootstrap 4. I've tried updating the library, refactoring/tweaking the CSS and JS to maintain a similar appearance and functionality, and testing the UI for functionality and appearance. This is a fairly large change so I'm sure additional testing and fixes will be needed.
### How was this patch tested?
This has been manually tested, but there is a ton of functionality and there are many pages and detail pages so it is very possible bugs introduced from the upgrade were missed. Additional testing and feedback is welcomed. If it appears a whole page was missed let me know and I'll take a pass at addressing that page/section.
Closes#27370 from clarkead/bootstrap4-core-upgrade.
Authored-by: Dale Clarke <a.dale.clarke@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Replace a sleep with waiting for the first collect to happen to try and make the K8s test code more reliable.
### Why are the changes needed?
Currently the Decommissioning test appears to be flaky in Jenkins.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Ran K8s test suite in a loop on minikube on my desktop for 10 iterations without this test failing on any of the runs.
Closes#27858 from holdenk/SPARK-31062-Improve-Spark-Decommissioning-K8s-test-teliability.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Yarn side changes for Stage level scheduling. The previous PR for dynamic allocation changes was https://github.com/apache/spark/pull/27313
Modified the data structures to store things on a per ResourceProfile basis.
I tried to keep the code changes to a minimum, the main loop that requests just goes through each Resourceprofile and the logic inside for each one stayed very close to the same.
On submission we now have to give each ResourceProfile a separate yarn Priority because yarn doesn't support asking for containers with different resources at the same Priority. We just use the profile id as the priority level.
Using a different Priority actually makes things easier when the containers come back to match them again which ResourceProfile they were requested for.
The expectation is that yarn will only give you a container with resource amounts you requested or more. It should never give you a container if it doesn't satisfy your resource requests.
If you want to see the full feature changes you can look at https://github.com/apache/spark/pull/27053/files for reference
### Why are the changes needed?
For stage level scheduling YARN support.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Tested manually on YARN cluster and then unit tests.
Closes#27583 from tgravescs/SPARK-29149.
Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.
### Why are the changes needed?
N/A
### Does this PR introduce any user-facing change?
N/A
### How was this patch tested?
N/A
Closes#27698 from gatorsmile/updateVersion.
Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This PR is based on an existing/previou PR - https://github.com/apache/spark/pull/19045
### What changes were proposed in this pull request?
This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache.
There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing
### Why are the changes needed?
With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required.
### Does this PR introduce any user-facing change?
There is no API change, however an additional configuration flag is added to enable/disable this behaviour.
### How was this patch tested?
New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s.
Closes#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4.
Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
This PR aims to fix `dev-run-integration-tests.sh` to ignore empty params correctly.
### Why are the changes needed?
The following script runs `mvn` integration test like the following.
```
$ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh
...
build/mvn integration-test
-f /Users/dongjoon/APACHE/spark/pom.xml
-pl resource-managers/kubernetes/integration-tests
-am
-Pscala-2.12
-Pkubernetes
-Pkubernetes-integration-tests
-Djava.version=8
-Dspark.kubernetes.test.sparkTgz=N/A
-Dspark.kubernetes.test.imageTag=N/A
-Dspark.kubernetes.test.imageRepo=docker.io/kubespark
-Dspark.kubernetes.test.deployMode=minikube
-Dtest.include.tags=k8s
-Dspark.kubernetes.test.namespace=
-Dspark.kubernetes.test.serviceAccountName=
-Dspark.kubernetes.test.kubeConfigContext=
-Dspark.kubernetes.test.master=
-Dtest.exclude.tags=
-Dspark.kubernetes.test.jvmImage=spark
-Dspark.kubernetes.test.pythonImage=spark-py
-Dspark.kubernetes.test.rImage=spark-r
```
After this PR, the empty parameters like the followings will be skipped like the original design.
```
-Dspark.kubernetes.test.namespace=
-Dspark.kubernetes.test.serviceAccountName=
-Dspark.kubernetes.test.kubeConfigContext=
-Dspark.kubernetes.test.master=
-Dtest.exclude.tags=
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins K8S integration test.
Closes#27566 from dongjoon-hyun/SPARK-30816.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to support JDK11 test in K8S integration tests.
- This is an update in testing framework instead of individual tests.
- This will enable JDK11 runtime test when you didn't installed JDK11 on your local system.
### Why are the changes needed?
Apache Spark 3.0.0 adds JDK11 support, but K8s integration tests use JDK8 until now.
### Does this PR introduce any user-facing change?
No. This is a dev-only test-related PR.
### How was this patch tested?
This is irrelevant to Jenkins UT, but Jenkins K8S IT (JDK8) should pass.
- https://github.com/apache/spark/pull/27559#issuecomment-585903489 (JDK8 Passed)
And, manually do the following for JDK11 test.
```
$ NO_MANUAL=1 ./dev/make-distribution.sh --r --pip --tgz -Phadoop-3.2 -Pkubernetes
$ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --java-image-tag 11-jre-slim --spark-tgz $PWD/spark-*.tgz
```
```
$ docker run -it --rm kubespark/spark:1318DD8A-2B15-4A00-BC69-D0E90CED235B /usr/local/openjdk-11/bin/java --version | tail -n1
OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode)
```
Closes#27559 from dongjoon-hyun/SPARK-30807.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is another PR for stage level scheduling. In particular this adds changes to the dynamic allocation manager and the scheduler backend to be able to track what executors are needed per ResourceProfile. Note the api is still private to Spark until the entire feature gets in, so this functionality will be there but only usable by tests for profiles other then the DefaultProfile.
The main changes here are simply tracking things on a ResourceProfile basis as well as sending the executor requests to the scheduler backend for all ResourceProfiles.
I introduce a ResourceProfileManager in this PR that will track all the actual ResourceProfile objects so that we can keep them all in a single place and just pass around and use in datastructures the resource profile id. The resource profile id can be used with the ResourceProfileManager to get the actual ResourceProfile contents.
There are various places in the code that use executor "slots" for things. The ResourceProfile adds functionality to keep that calculation in it. This logic is more complex then it should due to standalone mode and mesos coarse grained not setting the executor cores config. They default to all cores on the worker, so calculating slots is harder there.
This PR keeps the functionality to make the cores the limiting resource because the scheduler still uses that for "slots" for a few things.
This PR does also add the resource profile id to the Stage and stage info classes to be able to test things easier. That full set of changes will come with the scheduler PR that will be after this one.
The PR stops at the scheduler backend pieces for the cluster manager and the real YARN support hasn't been added in this PR, that again will be in a separate PR, so this has a few of the API changes up to the cluster manager and then just uses the default profile requests to continue.
The code for the entire feature is here for reference: https://github.com/apache/spark/pull/27053/files although it needs to be upmerged again as well.
### Why are the changes needed?
Needed for stage level scheduling feature.
### Does this PR introduce any user-facing change?
No user facing api changes added here.
### How was this patch tested?
Lots of unit tests and manually testing. I tested on yarn, k8s, standalone, local modes. Ran both failure and success cases.
Closes#27313 from tgravescs/SPARK-29148.
Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Currently, it doesn't seem to be possible to have Spark Driver set the serviceAccountName for executor pods it launches.
### Why are the changes needed?
it will allow settings serviceAccountName for executors pods.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
It was covered by unit test.
Closes#27034 from ayudovin/srevice-account-name-for-executor-pods.
Authored-by: yudovin <artsiom.yudovin@profitero.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR use a specific version of docker image instead of `latest`. As of today, when I run K8s integration test locally, this test case fails always.
Also, in this PR, I shows two consecutive failures with a dummy change.
- https://github.com/apache/spark/pull/27465#issuecomment-582326614
- https://github.com/apache/spark/pull/27465#issuecomment-582329114
```
- Launcher client dependencies *** FAILED ***
```
After that, I added the patch and K8s Integration test passed.
- https://github.com/apache/spark/pull/27465#issuecomment-582361696
### Why are the changes needed?
[SPARK-28465](https://github.com/apache/spark/pull/25222) switched from `v4.0.0-stable-4.0-master-centos-7-x86_64` to `latest` to catch up the API change. However, the API change seems to occur again. We had better use a specific version to prevent accidental failures.
```scala
- .withImage("ceph/daemon:v4.0.0-stable-4.0-master-centos-7-x86_64")
+ .withImage("ceph/daemon:latest")
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass `Launcher client dependencies` test in Jenkins K8s Integration Suite.
Or, run K8s Integration test locally.
Closes#27465 from dongjoon-hyun/SPARK-K8S-IT.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Bump fabric8 kubernetes-client to 4.7.1
### Why are the changes needed?
New fabric8 version brings support for Kubernetes 1.17 clusters.
Full release notes:
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.0
- https://github.com/fabric8io/kubernetes-client/releases/tag/v4.7.1
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit and integration tests cover creation of K8S objects. Adjusted them to work with the new fabric8 version
Closes#27443 from onursatici/os/bump-fabric8.
Authored-by: Onur Satici <onursatici@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This change is to allow custom resource scheduler (GPUs,FPGAs,etc) resource discovery to be more flexible. Users are asking for it to work with hadoop 2.x versions that do not support resource scheduling in YARN and/or also they may not run in an isolated environment.
This change creates a plugin api that users can write their own resource discovery class that allows a lot more flexibility. The user can chain plugins for different resource types. The user specified plugins execute in the order specified and will fall back to use the discovery script plugin if they don't return information for a particular resource.
I had to open up a few of the classes to be public and change them to not be case classes and make them developer api in order for the the plugin to get enough information it needs.
I also relaxed the yarn side so that if yarn isn't configured for resource scheduling we just warn and go on. This helps users that have yarn 3.1 but haven't configured the resource scheduling side on their cluster yet, or aren't running in isolated environment.
The user would configured this like:
--conf spark.resources.discovery.plugin="org.apache.spark.resource.ResourceDiscoveryFPGAPlugin, org.apache.spark.resource.ResourceDiscoveryGPUPlugin"
Note the executor side had to be wrapped with a classloader to make sure we include the user classpath for jars they specified on submission.
Note this is more flexible because the discovery script has limitations such as spawning it in a separate process. This means if you are trying to allocate resources in that process they might be released when the script returns. Other things are the class makes it more flexible to be able to integrate with existing systems and solutions for assigning resources.
### Why are the changes needed?
to more easily use spark resource scheduling with older versions of hadoop or in non-isolated enivronments.
### Does this PR introduce any user-facing change?
Yes a plugin api
### How was this patch tested?
Unit tests added and manual testing done on yarn and standalone modes.
Closes#27410 from tgravescs/hadoop27spark3.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Add the allocated resources to parameters to the PluginContext so that any plugins in driver or executor could use this information to initialize devices or use this information in a useful manner.
### Why are the changes needed?
To allow users to initialize/track devices once at the executor level before each task runs to use them.
### Does this PR introduce any user-facing change?
Yes to the people using the Executor/Driver plugin interface.
### How was this patch tested?
Unit tests and manually by writing a plugin that initialized GPU's using this interface.
Closes#27367 from tgravescs/pluginWithResources.
Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Add SPARK_APPLICATION_ID environment when spark configure driver pod.
### Why are the changes needed?
Currently, driver doesn't have this in environments and it's no convenient to retrieve spark id.
The use case is we want to look up spark application id and create application folder and redirect driver logs to application folder.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
unit tested. I also build new distribution and container image to kick off a job in Kubernetes and I do see SPARK_APPLICATION_ID added there. .
Closes#27347 from Jeffwan/SPARK-30626.
Authored-by: Jiaxin Shan <seedjeffwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
make KUBERNETES_MASTER_INTERNAL_URL configurable
### Why are the changes needed?
we do not always use the default port number 443 to access our kube-apiserver, and even in some mulit-tenant cluster, people do not use the service `kubernetes.default.svc` to access the kube-apiserver, so make the internal master configurable is necessary。
### Does this PR introduce any user-facing change?
user can configure the internal master url by
```
--conf spark.kubernetes.internal.master=https://kubernetes.default.svc:6443
```
### How was this patch tested?
run in multi-cluster that do not use the https://kubernetes.default.svc to access the kube-apiserver
Closes#27029 from wackxu/internalmaster.
Authored-by: xushiwei 00425595 <xushiwei5@huawei.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is the second PR for the Stage Level Scheduling. This is adding in the necessary executor side changes:
1) executors to know what ResourceProfile they should be using
2) handle parsing the resource profile settings - these are not in the global configs
3) then reporting back to the driver what resource profile it was started with.
This PR adds all the piping for YARN to pass the information all the way to executors, but it just uses the default ResourceProfile (which is the global applicatino level configs).
At a high level these changes include:
1) adding a new --resourceProfileId option to the CoarseGrainedExecutorBackend
2) Add the ResourceProfile settings to new internal confs that gets passed into the Executor
3) Executor changes that use the resource profile id passed in to read the corresponding ResourceProfile confs and then parse those requests and discover resources as necessary
4) Executor registers to Driver with the Resource profile id so that the ExecutorMonitor can track how many executor with each profile are running
5) YARN side changes to show that passing the resource profile id and confs actually works. Just uses the DefaultResourceProfile for now.
I also removed a check from the CoarseGrainedExecutorBackend that used to check to make sure there were task requirements before parsing any custom resource executor requests. With the resource profiles this becomes much more expensive because we would then have to pass the task requests to each executor and the check was just a short cut and not really needed. It was much cleaner just to remove it.
Note there were some changes to the ResourceProfile, ExecutorResourceRequests, and TaskResourceRequests in this PR as well because I discovered some issues with things not being immutable. That api now look like:
val rpBuilder = new ResourceProfileBuilder()
val ereq = new ExecutorResourceRequests()
val treq = new TaskResourceRequests()
ereq.cores(2).memory("6g").memoryOverhead("2g").pysparkMemory("2g").resource("gpu", 2, "/home/tgraves/getGpus")
treq.cpus(2).resource("gpu", 2)
val resourceProfile = rpBuilder.require(ereq).require(treq).build
This makes is so that ResourceProfile is immutable and Spark can use it directly without worrying about the user changing it.
### Why are the changes needed?
These changes are needed for the executor to report which ResourceProfile they are using so that ultimately the dynamic allocation manager can use that information to know how many with a profile are running and how many more it needs to request. Its also needed to get the resource profile confs to the executor so that it can run the appropriate discovery script if needed.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit tests and manually on YARN.
Closes#26682 from tgravescs/SPARK-29306.
Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
The issue here is that when Spark is downscaling the application and deletes
a few pod requests that aren't needed anymore, it may actually race with the
K8S scheduler, who may be bringing up those executors. So they may have enough
time to connect back to the driver, register, to just be deleted soon after.
This wastes resources and causes misleading entries in the driver log.
The change (ab)uses the blacklisting mechanism to consider the deleted excess
pods as blacklisted, so that if they try to connect back, the driver will deny
it.
It also changes the executor registration slightly, since even with the above
change there were misleading logs. That was because the executor registration
message was an RPC that always succeeded (bar network issues), so the executor
would always try to send an unregistration message to the driver, which would
then log several messages about not knowing anything about the executor. The
change makes the registration RPC succeed or fail directly, instead of using
the separate failure message that would lead to this issue.
Note the last change required some changes in a standalone test suite related
to dynamic allocation, since it relied on the driver not throwing exceptions
when a duplicate executor registration happened.
Tested with existing unit tests, and with live cluster with dyn alloc on.
Closes#26586 from vanzin/SPARK-29950.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
Remove `executorsPendingToRemove.clear()` from `CoarseGrainedSchedulerBackend.reset()`.
### Why are the changes needed?
Clear `executorsPendingToRemove` before remove executors will cause all tasks running on those "pending to remove" executors to count failures. But that's not true for the case of `executorsPendingToRemove(execId)=true`.
Besides, `executorsPendingToRemove` will be cleaned up within `removeExecutor()` at the end just as same as `executorsPendingLossReason`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a new test in `TaskSetManagerSuite`.
Closes#27017 from Ngone51/dont-clear-eptr-in-reset.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
ShutdownHook of YarnClientSchedulerBackend prints just "Stopped" which can be improved to "YarnClientSchedulerBackend Stopped" for better understanding.
### Why are the changes needed?
While stopping or gracefully exiting the spark-shell/spark-sql --master yarn, only printing `stopped` is useless.
### Does this PR introduce any user-facing change?
Yes. Log info message change.
### How was this patch tested?
Manually
Closes#27049 from jobitmathew/imp_stop_message.
Authored-by: Jobit Mathew <jobit.mathew@huawei.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79
2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1
### Why are the changes needed?
Shouldn't change master.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-masterCloses#26915 from wangyum/revert-master.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Include `$SPARK_DIST_CLASSPATH` in class path when launching `CoarseGrainedExecutorBackend` on Kubernetes executors using the provided `entrypoint.sh`
### Why are the changes needed?
For user provided Hadoop, `$SPARK_DIST_CLASSPATH` contains the required jars.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Kubernetes 1.14, Spark 2.4.4, Hadoop 3.2.1. Adding $SPARK_DIST_CLASSPATH to `-cp ` param of entrypoint.sh enables launching the executors correctly.
Closes#26493 from sshakeri/master.
Authored-by: Shahin Shakeri <shahin.shakeri@pwc.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
Added apt-get update as per [docker best-practices](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get)
### Why are the changes needed?
Builder is failing because:
Without doing apt-get update, the APT lists get outdated and begins referring to package versions that no longer exist, hence the 404 trying to download them (Debian does not keep old versions in the archive when a package is updated).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
k8s builder
Closes#26753 from ifilonenko/SPARK-30111.
Authored-by: Ilan Filonenko <ifilonenko@bloomberg.net>
Signed-off-by: shane knapp <incomplete@gmail.com>