### What changes were proposed in this pull request?
We added the gnupg installation in https://github.com/apache/spark/pull/30130 , we should do apt update before gnupg isntallation, otherwise we will get a fetch error when package is updated.
See more in:
[1] http://apache-spark-developers-list.1001551.n3.nabble.com/K8s-Integration-test-is-unable-to-run-because-of-the-unavailable-libs-td30986.html
### Why are the changes needed?
add a apt-update cmd before gnupg installation to avoid invaild package cache list.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
K8s Integration test passed
Closes#31923 from Yikun/SPARK-34820.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes:
- Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark.
- Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables.
NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed.
- In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below:
```python
conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack
conda activate pyspark_conda_env
conda pack -f -o pyspark_conda_env.tar.gz
PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py
```
- Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS`
### Why are the changes needed?
- To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations.
- To provide Conda and virtualenv support via `spark.archives` options.
### Does this PR introduce _any_ user-facing change?
Yes:
- `spark.kubernetes.pyspark.pythonVersion` is deprecated.
- `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected.
### How was this patch tested?
Manually tested via:
```bash
minikube delete
minikube start --cpus 12 --memory 16384
kubectl create namespace spark-integration-test
cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: ServiceAccount
metadata:
name: spark
namespace: spark-integration-test
EOF
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test
dev/make-distribution.sh --pip --tgz -Pkubernetes
resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test
```
Unittests were also added.
Closes#30735 from HyukjinKwon/SPARK-33748.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
While building R docker image if we can't fetch the key from gnupg.net fall back to openpgp.org
### Why are the changes needed?
gnupg.net key servers are flaky and sometimes fail to resolve or return keys.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tried to add key on my desktop, it failed, then tried to add key with openpgp.org and it succeed.
Closes#30696 from holdenk/SPARK-33727-gnupg-server-is-flaky.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is an improvement, we mount all the user specific configuration files(except the templates and spark properties files) from `SPARK_CONF_DIR` at the point of spark-submit, to both executor and driver pods. Currently, only `spark.properties` is mounted, only on driver.
### Why are the changes needed?
`SPARK_CONF_DIR` hosts several configuration files, for example,
1) `spark-defaults.conf` - containing all the spark properties.
2) `log4j.properties` - Logger configuration.
3) `core-site.xml` - Hadoop related configuration.
4) `fairscheduler.xml` - Spark's fair scheduling policy at the job level.
5) `metrics.properties` - Spark metrics.
6) Any user specific - library or framework specific configuration file.
At the moment, we can cannot propagate these files to the driver and executor configuration directory.
There is a design doc, with more details, and this patch is currently providing a reference implementation. Please take a look at the doc and comment, how we can improve. [google docs link to the doc](https://bit.ly/spark-30985)
### Further scope
Support user defined configMaps.
### Does this PR introduce any user-facing change?
Yes, previously the user configuration files(e.g. hdfs-site.xml, log4j.properties etc...) were not propagated by default, now after this patch it is propagated to driver and executor pods' `SPARK_CONF_DIR`.
### How was this patch tested?
Added tests.
Also manually tested, by deploying it to a minikube cluster and observing the additional configuration files were present, and taking effect. For example, changes to log4j.properties was properly applied to executors.
Closes#27735 from ScrapCodes/SPARK-30985/spark-conf-k8s-propagate.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This adds support for Stage level scheduling to kubernetes. Kubernetes can support dynamic allocation via the shuffle tracking option which means we can support stage level scheduling by getting new executors.
The main changes here are having the k8s cluster manager pass the resource profile id into the executors and then the ExecutorsPodsAllocator has to request executors based on the individual resource profiles. I tried to keep code changes here to a minimum. I specifically choose to leave the ExecutorPodsSnapshot the way it was and construct the resource profile to pod states on the fly, with a fast path when not using other resource profiles, to keep the impact to a minimum. This results in the main changes required are just wrapping the allocation logic in a for loop over each profile. The other main change is in the basic feature step we have to look at the resources in the ResourceProfile to request pods with the correct resources. Much of the other logic like in the executor life cycle manager doesn't need to be resource profile.
This also adds support for [SPARK-32661]Spark executors on K8S should request extra memory for off-heap allocations because the stage level scheduling api has support for this and it made sense to make consistent with YARN. This was started with PR https://github.com/apache/spark/pull/29477 but never updated so I just did it here. To do this I moved a few functions around that were now used by both YARN and kubernetes so you will see some changes in Utils.
### Why are the changes needed?
Add the feature to Kubernetes based on customer feedback.
### Does this PR introduce _any_ user-facing change?
Yes the feature now works with K8s, but not underlying API changes.
### How was this patch tested?
Tested manually on kubernetes cluster and with unit tests.
Closes#30204 from tgravescs/stagek8sOrigSnapshotsRebase.
Lead-authored-by: Thomas Graves <tgraves@apache.org>
Co-authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR aims to use R 3.6.3 in K8s R image and re-enable `RTestsSuite`.
### Why are the changes needed?
Jenkins Server is using `R 3.6.3`.
```
+ SPARK_HOME=/home/jenkins/workspace/SparkPullRequestBuilder-K8s
+ /usr/bin/R CMD check --as-cran --no-tests SparkR_3.1.0.tar.gz
* using log directory ‘/home/jenkins/workspace/SparkPullRequestBuilder-K8s/R/SparkR.Rcheck’
* using R version 3.6.3 (2020-02-29)
```
OpenJDK docker image is using `R 3.5.2 (2018-12-20)` which is old and currently `spark-3.0.1` fails to run SparkR.
```
$ cd spark-3.0.1-bin-hadoop3.2
$ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile -n build
...
exit code: 1
termination reason: Error
...
$ bin/spark-submit --master k8s://https://192.168.64.49:8443 --deploy-mode cluster --conf spark.kubernetes.container.image=spark-r:latest local:///opt/spark/examples/src/main/r/dataframe.R
$ k logs dataframe-r-b1c14b75b0c09eeb-driver
...
+ exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.4 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.RRunner local:///opt/spark/examples/src/main/r/dataframe.R
20/11/10 06:03:58 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
log4j:WARN No appenders could be found for logger (io.netty.util.internal.logging.InternalLoggerFactory).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
Error: package or namespace load failed for ‘SparkR’ in rbind(info, getNamespaceInfo(env, "S3methods")):
number of columns of matrices must match (see arg 2)
In addition: Warning message:
package ‘SparkR’ was built under R version 4.0.2
Execution halted
```
In addition, this PR aims to recover the test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass K8S IT Jenkins job.
Closes#30130 from dongjoon-hyun/SPARK-32354.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to use `openjdk:11-jre-slim` as default in K8s Dockerfile.
### Why are the changes needed?
Although Apache Spark supports both Java8/Java11, there is a difference.
1. Java8-built distribution can run both Java8/Java11
2. Java11-built distribution can run on Java11, but not Java8.
In short, we had better use Java11 in Dockerfile to embrace both cases without any issues.
### Does this PR introduce _any_ user-facing change?
Yes. This will remove the change of user frustration when they build with JDK11 and build the image without overriding Java base image.
### How was this patch tested?
Pass the K8s IT.
Closes#30083 from dongjoon-hyun/SPARK-33176.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR makes `spark.kubernetes.pyspark.pythonVersion` allow only `3`. In other words, it will reject `2` for `Python 2`.
- [x] Configuration description and check is updated.
- [x] Documentation is updated
- [x] Unit test cases are updated.
- [x] Docker image script is updated.
### Why are the changes needed?
After SPARK-32138, Apache Spark 3.1 dropped Python 2 support.
### Does this PR introduce _any_ user-facing change?
Yes, but Python 2 support is already dropped officially.
### How was this patch tested?
Pass the CI.
Closes#30049 from dongjoon-hyun/SPARK-DROP-PYTHON2.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to remove python2 installation in K8s python image because spark 3.1 does not support python2.
### Why are the changes needed?
This will save disk space.
**BEFORE**
```
kubespark/spark-py ... 917MB
```
**AFTER**
```
kubespark/spark-py ... 823MB
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the K8s IT.
Closes#29751 from williamhyun/remove_py2.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
If graceful decommissioning is enabled, Spark's dynamic scaling uses this instead of directly killing executors.
### Why are the changes needed?
When scaling down Spark we should avoid triggering recomputes as much as possible.
### Does this PR introduce _any_ user-facing change?
Hopefully their jobs run faster or at the same speed. It also enables experimental shuffle service free dynamic scaling when graceful decommissioning is enabled (using the same code as the shuffle tracking dynamic scaling).
### How was this patch tested?
For now I've extended the ExecutorAllocationManagerSuite for both core & streaming.
Closes#29367 from holdenk/SPARK-31198-use-graceful-decommissioning-as-part-of-dynamic-scaling.
Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
At the moment, we switch to `https` urls for all the debian mirrors, but turns out some of the mirrors do not support. In this patch, we turn on https mode only for `deb.debian.org` mirror (as it supports SSL).
### Why are the changes needed?
It appears, that security.debian.org does not support https.
```
curl https://security.debian.org
curl: (35) LibreSSL SSL_connect: SSL_ERROR_SYSCALL in connection to security.debian.org:443
```
While building the image, it fails in the following way.
```
MacBook-Pro:spark prashantsharma$ bin/docker-image-tool.sh -r scrapcodes -t v3.1.0-1 build
Sending build context to Docker daemon 222.1MB
Step 1/18 : ARG java_image_tag=8-jre-slim
Step 2/18 : FROM openjdk:${java_image_tag}
---> 381b20190cf7
Step 3/18 : ARG spark_uid=185
---> Using cache
---> 65c06f86753c
Step 4/18 : RUN set -ex && sed -i 's/http:/https:/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*
---> Running in a3461dadd6eb
+ sed -i s/http:/https:/g /etc/apt/sources.list
+ apt-get update
Ign:1 https://security.debian.org/debian-security buster/updates InRelease
Err:2 https://security.debian.org/debian-security buster/updates Release
Could not handshake: The TLS connection was non-properly terminated. [IP: 151.101.0.204 443]
Get:3 https://deb.debian.org/debian buster InRelease [121 kB]
Get:4 https://deb.debian.org/debian buster-updates InRelease [51.9 kB]
Get:5 https://deb.debian.org/debian buster/main amd64 Packages [7905 kB]
Get:6 https://deb.debian.org/debian buster-updates/main amd64 Packages [7868 B]
Reading package lists...
E: The repository 'https://security.debian.org/debian-security buster/updates Release' does not have a Release file.
The command '/bin/sh -c set -ex && sed -i 's/http:/https:/g' /etc/apt/sources.list && apt-get update && ln -s /lib /lib64 && apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps && mkdir -p /opt/spark && mkdir -p /opt/spark/examples && mkdir -p /opt/spark/work-dir && touch /opt/spark/RELEASE && rm /bin/sh && ln -sv /bin/bash /bin/sh && echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && chgrp root /etc/passwd && chmod ug+rw /etc/passwd && rm -rf /var/cache/apt/*' returned a non-zero code: 100
Failed to build Spark JVM Docker image, please refer to Docker build output for details.
```
So, if we limit the `https` support to only deb.debian.org, does the trick.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually, by building an image and testing it by running spark shell against it locally using kubernetes.
Closes#28834 from ScrapCodes/spark-31994/debian_mirror_fix.
Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
A small change to fix an error in Docker `entrypoint.sh`
### Why are the changes needed?
When spark running on Kubernetes, I got the following logs:
```log
+ '[' -n ']'
+ '[' -z ']'
++ /bin/hadoop classpath
/opt/entrypoint.sh: line 62: /bin/hadoop: No such file or directory
+ export SPARK_DIST_CLASSPATH=
+ SPARK_DIST_CLASSPATH=
```
This is because you are missing some quotes on bash comparisons.
### Does this PR introduce any user-facing change?
No
## How was this patch tested?
CI
Closes#28075 from dungdm93/patch-1.
Authored-by: Đặng Minh Dũng <dungdm93@live.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
This PR is based on an existing/previou PR - https://github.com/apache/spark/pull/19045
### What changes were proposed in this pull request?
This changes adds a decommissioning state that we can enter when the cloud provider/scheduler lets us know we aren't going to be removed immediately but instead will be removed soon. This concept fits nicely in K8s and also with spot-instances on AWS / preemptible instances all of which we can get a notice that our host is going away. For now we simply stop scheduling jobs, in the future we could perform some kind of migration of data during scale-down, or at least stop accepting new blocks to cache.
There is a design document at https://docs.google.com/document/d/1xVO1b6KAwdUhjEJBolVPl9C6sLj7oOveErwDSYdT-pE/edit?usp=sharing
### Why are the changes needed?
With more move to preemptible multi-tenancy, serverless environments, and spot-instances better handling of node scale down is required.
### Does this PR introduce any user-facing change?
There is no API change, however an additional configuration flag is added to enable/disable this behaviour.
### How was this patch tested?
New integration tests in the Spark K8s integration testing. Extension of the AppClientSuite to test decommissioning seperate from the K8s.
Closes#26440 from holdenk/SPARK-20628-keep-track-of-nodes-which-are-going-to-be-shutdown-r4.
Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
This PR aims to support JDK11 test in K8S integration tests.
- This is an update in testing framework instead of individual tests.
- This will enable JDK11 runtime test when you didn't installed JDK11 on your local system.
### Why are the changes needed?
Apache Spark 3.0.0 adds JDK11 support, but K8s integration tests use JDK8 until now.
### Does this PR introduce any user-facing change?
No. This is a dev-only test-related PR.
### How was this patch tested?
This is irrelevant to Jenkins UT, but Jenkins K8S IT (JDK8) should pass.
- https://github.com/apache/spark/pull/27559#issuecomment-585903489 (JDK8 Passed)
And, manually do the following for JDK11 test.
```
$ NO_MANUAL=1 ./dev/make-distribution.sh --r --pip --tgz -Phadoop-3.2 -Pkubernetes
$ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --java-image-tag 11-jre-slim --spark-tgz $PWD/spark-*.tgz
```
```
$ docker run -it --rm kubespark/spark:1318DD8A-2B15-4A00-BC69-D0E90CED235B /usr/local/openjdk-11/bin/java --version | tail -n1
OpenJDK 64-Bit Server VM 18.9 (build 11.0.6+10, mixed mode)
```
Closes#27559 from dongjoon-hyun/SPARK-30807.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Include `$SPARK_DIST_CLASSPATH` in class path when launching `CoarseGrainedExecutorBackend` on Kubernetes executors using the provided `entrypoint.sh`
### Why are the changes needed?
For user provided Hadoop, `$SPARK_DIST_CLASSPATH` contains the required jars.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Kubernetes 1.14, Spark 2.4.4, Hadoop 3.2.1. Adding $SPARK_DIST_CLASSPATH to `-cp ` param of entrypoint.sh enables launching the executors correctly.
Closes#26493 from sshakeri/master.
Authored-by: Shahin Shakeri <shahin.shakeri@pwc.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
Added apt-get update as per [docker best-practices](https://docs.docker.com/develop/develop-images/dockerfile_best-practices/#apt-get)
### Why are the changes needed?
Builder is failing because:
Without doing apt-get update, the APT lists get outdated and begins referring to package versions that no longer exist, hence the 404 trying to download them (Debian does not keep old versions in the archive when a package is updated).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
k8s builder
Closes#26753 from ifilonenko/SPARK-30111.
Authored-by: Ilan Filonenko <ifilonenko@bloomberg.net>
Signed-off-by: shane knapp <incomplete@gmail.com>
### What changes were proposed in this pull request?
The current docker image used by Kubernetes is `openjdk:8-alpine`. It was not supported and was removed with the commit 3eb0351b20 (diff-f95ffa3d1377774732c33f7b8368e099).
This PR proposes to move to a supported docker image.
### Why are the changes needed?
I think there are at least two reasons:
1. According to the commit, Alpine/musl is not officially supported by the OpenJDK project.
2. As no more OpenJDK 8 Alpine images, new JDK updates including security fixes
, are not applied to it. See below:
```
docker run -it --rm openjdk:8-alpine java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (Alpine 8.212.04-r0)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)
```
```
docker run -it --rm openjdk:8-jdk-slim java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
```
### Does this PR introduce any user-facing change?
Yes. This changes the base docker image of Spark.
### How was this patch tested?
Existing tests.
Closes#26037 from viirya/SPARK-28938.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace:
```
Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so
at sun.security.pkcs11.Secmod.initialize(Secmod.java:193)
at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218)
... 81 more
```
after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked.
this is also impacting current builds. see: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console
## How was this patch tested?
i tested locally before pushing, and the build system will test the rest.
Closes#24111 from shaneknapp/add-nss-package-to-dockerfile.
Authored-by: shane knapp <incomplete@gmail.com>
Signed-off-by: shane knapp <incomplete@gmail.com>
Running Spark in Docker image with Alpine Linux 3.9.0 throws errors when using snappy.
The issue can be reproduced for example as follows: `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`
The key part of the error stack is as follows `SparkException: Task failed while writing rows. .... Caused by: java.lang.UnsatisfiedLinkError: /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so: Error loading shared library ld-linux-x86-64.so.2: Noded by /tmp/snappy-1.1.7-2b4872f1-7c41-4b84-bda1-dbcb8dd0ce4c-libsnappyjava.so)`
The source of the error appears to be that libsnappyjava.so needs ld-linux-x86-64.so.2 and looks for it in /lib, while in Alpine Linux 3.9.0 with libc6-compat version 1.1.20-r3 ld-linux-x86-64.so.2 is located in /lib64.
Note: this issue is not present with Alpine Linux 3.8 and libc6-compat version 1.1.19-r10
## What changes were proposed in this pull request?
A possible workaround proposed with this PR is to modify the Dockerfile by adding a symbolic link between /lib and /lib64 so that linux-x86-64.so.2 can be found in /lib. This is probably not the cleanest solution, but I have observed that this is what happened/happens already when using Alpine Linux 3.8.1 (a version of Alpine Linux which was not affected by the issue reported here).
## How was this patch tested?
Manually tested by running a simple workload with spark-shell, using docker on a client machine and using Spark on a Kubernetes cluster. The test workload is: `Seq(1,2).toDF("id").write.format("parquet").save("DELETEME1")`
Added a test to the KubernetesSuite / BasicTestsSuite
Closes#23898 from LucaCanali/dockerfileUpdateSPARK26995.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Before this change, there was some code in the k8s backend to deal
with how to resolve dependencies and make them available to the
Spark application. It turns out that none of that code is necessary,
since spark-submit already handles all that for applications started
in client mode - like the k8s driver that is run inside a Spark-created
pod.
For that reason, specifically for pyspark, there's no need for the
k8s backend to deal with PYTHONPATH; or, in general, to change the URIs
provided by the user at all. spark-submit takes care of that.
For testing, I created a pyspark script that depends on another module
that is shipped with --py-files. Then I used:
- --py-files http://.../dep.pyhttp://.../test.py
- --py-files http://.../dep.ziphttp://.../test.py
- --py-files local:/.../dep.py local:/.../test.py
- --py-files local:/.../dep.zip local:/.../test.py
Without this change, all of the above commands fail. With the change, the
driver is able to see the dependencies in all the above cases; but executors
don't see the dependencies in the last two. That's a bug in shared Spark code
that deals with local: dependencies in pyspark (SPARK-26934).
I also tested a Scala app using the main jar from an http server.
Closes#23793 from vanzin/SPARK-24736.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Merge both case statements, and remove unused variables that
are not set by the Scala code anymore.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#23655 from vanzin/SPARK-26733.
Latest Docker releases are stricter in their enforcement of build argument scope. The location of the `ARG spark_uid` declaration in the Python and R Dockerfiles means the variable is out of scope by the time it is used in a `USER` declaration resulting in a container running as root rather than the default/configured UID.
Also with some of the refactoring of the script that has happened since my PR that introduced the configurable UID it turns out the `-u <uid>` argument is not being properly passed to the Python and R image builds when those are opted into
## What changes were proposed in this pull request?
This commit moves the `ARG` declaration to just before the argument is used such that it is in scope. It also ensures that Python and R image builds receive the build arguments that include the `spark_uid` argument where relevant
## How was this patch tested?
Prior to the patch images are produced where the Python and R images ignore the default/configured UID:
```
> docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
bash-4.4# whoami
root
bash-4.4# id -u
0
bash-4.4# exit
> docker run -it --entrypoint /bin/bash rvesse/spark:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
```
Note that the Python image is still running as `root` having ignored the configured UID of 456 while the base image has the correct UID because the relevant `ARG` declaration is correctly in scope.
After the patch the correct UID is observed:
```
> docker run -it --entrypoint /bin/bash rvesse/spark-r:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
exit
> docker run -it --entrypoint /bin/bash rvesse/spark-py:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
exit
> docker run -it --entrypoint /bin/bash rvesse/spark:uid456
bash-4.4$ id -u
456
bash-4.4$ exit
```
Closes#23611 from rvesse/SPARK-26685.
Authored-by: Rob Vesse <rvesse@dotnetrdf.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
When I try to run `./bin/pyspark` cmd in a pod in Kubernetes(image built without change from pyspark Dockerfile), I'm getting an error:
```
$SPARK_HOME/bin/pyspark --deploy-mode client --master k8s://https://$KUBERNETES_SERVICE_HOST:$KUBERNETES_SERVICE_PORT_HTTPS ...
Python 2.7.15 (default, Aug 22 2018, 13:24:18)
[GCC 6.4.0] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Could not open PYTHONSTARTUP
IOError: [Errno 2] No such file or directory: '/opt/spark/python/pyspark/shell.py'
```
This is because `pyspark` folder doesn't exist under `/opt/spark/python/`
## What changes were proposed in this pull request?
Added `COPY python/pyspark ${SPARK_HOME}/python/pyspark` to pyspark Dockerfile to resolve issue above.
## How was this patch tested?
Google Kubernetes Engine
Closes#23037 from AzureQ/master.
Authored-by: Qi Shao <qi.shao.nyu@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Adds USER directives to the Dockerfiles which is configurable via build argument (`spark_uid`) for easy customisation. A `-u` flag is added to `bin/docker-image-tool.sh` to make it easy to customise this e.g.
```
> bin/docker-image-tool.sh -r rvesse -t uid -u 185 build
> bin/docker-image-tool.sh -r rvesse -t uid push
```
If no UID is explicitly specified it defaults to `185` - this is per skonto's suggestion to align with the OpenShift standard reserved UID for Java apps (
https://lists.openshift.redhat.com/openshift-archives/users/2016-March/msg00283.html)
Notes:
- We have to make the `WORKDIR` writable by the root group or otherwise jobs will fail with `AccessDeniedException`
To Do:
- [x] Debug and resolve issue with client mode test
- [x] Consider whether to always propagate `SPARK_USER_NAME` to environment of driver and executor pods so `entrypoint.sh` can insert that into `/etc/passwd` entry
- [x] Rebase once PR #23013 is merged and update documentation accordingly
Built the Docker images with the new Dockerfiles that include the `USER` directives. Ran the Spark on K8S integration tests against the new images. All pass except client mode which I am currently debugging further.
Also manually dropped myself into the resulting container images via `docker run` and checked `id -u` output to see that UID is as expected.
Tried customising the UID from the default via the new `-u` argument to `docker-image-tool.sh` and again checked the resulting image for the correct runtime UID.
cc felixcheung skonto vanzin
Closes#23017 from rvesse/SPARK-26015.
Authored-by: Rob Vesse <rvesse@dotnetrdf.org>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
The "build context" for a docker image - basically the whole contents of the
current directory where "docker" is invoked - can be huge in a dev build,
easily breaking a couple of gigs.
Doing that copy 3 times during the build of docker images severely slows
down the process.
This patch creates a smaller build context - basically mimicking what the
make-distribution.sh script does, so that when building the docker images,
only the necessary bits are in the current directory. For PySpark and R that
is optimized further, since those images are built based on the previously
built Spark main image.
In my current local clone, the dir size is about 2G, but with this script
the "context" sent to docker is about 250M for the main image, 1M for the
pyspark image and 8M for the R image. That speeds up the image builds
considerably.
I also snuck in a fix to the k8s integration test dependencies in the sbt
build, so that the examples are properly built (without having to do it
manually).
Closes#23019 from vanzin/SPARK-26025.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
Right now there are 3 different classes dealing with building the driver
command to run inside the pod, one for each "binding" supported by Spark.
This has two main shortcomings:
- the code in the 3 classes is very similar; changing things in one place
would probably mean making a similar change in the others.
- it gives the false impression that the step implementation is the only
place where binding-specific logic is needed. That is not true; there
was code in KubernetesConf that was binding-specific, and there's also
code in the executor-specific config step. So the 3 classes weren't really
working as a language-specific abstraction.
On top of that, the current code was propagating command line parameters in
a different way depending on the binding. That doesn't seem necessary, and
in fact using environment variables for command line parameters is in general
a really bad idea, since you can't handle special characters (e.g. spaces)
that way.
This change merges the 3 different code paths for Java, Python and R into
a single step, and also merges the 3 code paths to start the Spark driver
in the k8s entry point script. This increases the amount of shared code,
and also moves more feature logic into the step itself, so it doesn't live
in KubernetesConf.
Note that not all logic related to setting up the driver lives in that
step. For example, the memory overhead calculation still lives separately,
except it now happens in the driver config step instead of outside the
step hierarchy altogether.
Some of the noise in the diff is because of changes to KubernetesConf, which
will be addressed in a separate change.
Tested with new and updated unit tests + integration tests.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#22897 from vanzin/SPARK-25875.
This way the image generated from both environments has the same layout,
with just a difference in contents that should not affect functionality.
Also added some minor error checking to the image script.
Closes#22681 from vanzin/SPARK-25682.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
This is the work on setting up Secure HDFS interaction with Spark-on-K8S.
The architecture is discussed in this community-wide google [doc](https://docs.google.com/document/d/1RBnXD9jMDjGonOdKJ2bA1lN4AAV_1RwpU_ewFuCNWKg)
This initiative can be broken down into 4 Stages
**STAGE 1**
- [x] Detecting `HADOOP_CONF_DIR` environmental variable and using Config Maps to store all Hadoop config files locally, while also setting `HADOOP_CONF_DIR` locally in the driver / executors
**STAGE 2**
- [x] Grabbing `TGT` from `LTC` or using keytabs+principle and creating a `DT` that will be mounted as a secret or using a pre-populated secret
**STAGE 3**
- [x] Driver
**STAGE 4**
- [x] Executor
## How was this patch tested?
Locally tested on a single-noded, pseudo-distributed Kerberized Hadoop Cluster
- [x] E2E Integration tests https://github.com/apache/spark/pull/22608
- [ ] Unit tests
## Docs and Error Handling?
- [x] Docs
- [x] Error Handling
## Contribution Credit
kimoonkim skonto
Closes#21669 from ifilonenko/secure-hdfs.
Lead-authored-by: Ilan Filonenko <if56@cornell.edu>
Co-authored-by: Ilan Filonenko <ifilondz@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
The docker file was referencing a path that only existed in the
distribution tarball; it needs to be parameterized so that the
right path can be used in a dev build.
Tested on local dev build.
Closes#22634 from vanzin/SPARK-25646.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
## What changes were proposed in this pull request?
Add spark.executor.pyspark.memory limit for K8S
## How was this patch tested?
Unit and Integration tests
Closes#22298 from ifilonenko/SPARK-25021.
Authored-by: Ilan Filonenko <if56@cornell.edu>
Signed-off-by: Holden Karau <holden@pigscanfly.ca>
## What changes were proposed in this pull request?
Add a PAM configuration in k8s dockerfile to require authentication into wheel to run as `su`
## How was this patch tested?
Verify against CI that PAM config succeeds & causes no regressions
Closes#22285 from erikerlandson/spark-25275.
Authored-by: Erik Erlandson <eerlands@redhat.com>
Signed-off-by: Erik Erlandson <eerlands@redhat.com>
## What changes were proposed in this pull request?
Introducing R Bindings for Spark R on K8s
- [x] Running SparkR Job
## How was this patch tested?
This patch was tested with
- [x] Unit Tests
- [x] Integration Tests
## Example:
Commands to run example spark job:
1. `dev/make-distribution.sh --pip --r --tgz -Psparkr -Phadoop-2.7 -Pkubernetes`
2. `bin/docker-image-tool.sh -m -t testing build`
3.
```
bin/spark-submit \
--master k8s://https://192.168.64.33:8443 \
--deploy-mode cluster \
--name spark-r \
--conf spark.executor.instances=1 \
--conf spark.kubernetes.container.image=spark-r:testing \
local:///opt/spark/examples/src/main/r/dataframe.R
```
This above spark-submit command works given the distribution. (Will include this integration test in PR once PRB is ready).
Author: Ilan Filonenko <if56@cornell.edu>
Closes#21584 from ifilonenko/spark-r.
## What changes were proposed in this pull request?
Support client mode for the Kubernetes scheduler.
Client mode works more or less identically to cluster mode. However, in client mode, the Spark Context needs to be manually bootstrapped with certain properties which would have otherwise been set up by spark-submit in cluster mode. Specifically:
- If the user doesn't provide a driver pod name, we don't add an owner reference. This is for usage when the driver is not running in a pod in the cluster. In such a case, the driver can only provide a best effort to clean up the executors when the driver exits, but cleaning up the resources is not guaranteed. The executor JVMs should exit if the driver JVM exits, but the pods will still remain in the cluster in a COMPLETED or FAILED state.
- The user must provide a host (spark.driver.host) and port (spark.driver.port) that the executors can connect to. When using spark-submit in cluster mode, spark-submit generates the headless service automatically; in client mode, the user is responsible for setting up their own connectivity.
We also change the authentication configuration prefixes for client mode.
## How was this patch tested?
Adding an integration test to exercise client mode support.
Author: mcheah <mcheah@palantir.com>
Closes#21748 from mccheah/k8s-client-mode.
## What changes were proposed in this pull request?
Remove code that is misleading and is a leftover from a previous implementation.
## How was this patch tested?
Manually.
Author: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com>
Closes#21462 from skonto/fix-k8s-docs.
## What changes were proposed in this pull request?
This PR changes the entrypoint.sh to provide an option to run non spark-on-k8s commands (init, driver, executor) in order to let the user keep with the normal workflow without hacking the image to bypass the entrypoint
## How was this patch tested?
This patch was built manually in my local machine and I ran some tests with a combination of ```docker run``` commands.
Author: rimolive <ricardo.martinelli.oliveira@gmail.com>
Closes#21572 from rimolive/rimolive-spark-24534.
## What changes were proposed in this pull request?
Introducing Python Bindings for PySpark.
- [x] Running PySpark Jobs
- [x] Increased Default Memory Overhead value
- [ ] Dependency Management for virtualenv/conda
## How was this patch tested?
This patch was tested with
- [x] Unit Tests
- [x] Integration tests with [this addition](https://github.com/apache-spark-on-k8s/spark-integration/pull/46)
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- Run SparkPi with a test secret mounted into the driver and executor pods
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Run PySpark on simple pi.py example
- Run PySpark with Python2 to test a pyfiles example
- Run PySpark with Python3 to test a pyfiles example
Run completed in 4 minutes, 28 seconds.
Total number of tests run: 11
Suites: completed 2, aborted 0
Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Author: Ilan Filonenko <if56@cornell.edu>
Author: Ilan Filonenko <ifilondz@gmail.com>
Closes#21092 from ifilonenko/master.
## What changes were proposed in this pull request?
Removal of the init-container for downloading remote dependencies. Built off of the work done by vanzin in an attempt to refactor driver/executor configuration elaborated in [this](https://issues.apache.org/jira/browse/SPARK-22839) ticket.
## How was this patch tested?
This patch was tested with unit and integration tests.
Author: Ilan Filonenko <if56@cornell.edu>
Closes#20669 from ifilonenko/remove-init-container.
## What changes were proposed in this pull request?
As described in SPARK-23680, entrypoint.sh returns an error code because of a command pipeline execution where it is expected in case of Openshift environments, where arbitrary UIDs are used to run containers
## How was this patch tested?
This patch was manually tested by using docker-image-toll.sh script to generate a Spark driver image and running an example against an OpenShift cluster.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Ricardo Martinelli de Oliveira <rmartine@rmartine.gru.redhat.com>
Closes#20822 from rimolive/rmartine-spark-23680.
For some JVM options, like `-XX:+UnlockExperimentalVMOptions` ordering is necessary.
## What changes were proposed in this pull request?
Keep original `extraJavaOptions` ordering, when passing them through environment variables inside the Docker container.
## How was this patch tested?
Ran base branch a couple of times and checked startup command in logs. Ordering differed every time. Added sorting, ordering was consistent to what user had in `extraJavaOptions`.
Author: Andrew Korzhuev <korzhuev@andrusha.me>
Closes#20628 from andrusha/patch-2.
Pass through spark java options to the executor in context of docker image.
Closes#20296
andrusha: Deployed two version of containers to local k8s, checked that java options were present in the updated image on the running executor.
Manual test
Author: Andrew Korzhuev <korzhuev@andrusha.me>
Closes#20322 from foxish/patch-1.
This change allows a user to submit a Spark application on kubernetes
having to provide a single image, instead of one image for each type
of container. The image's entry point now takes an extra argument that
identifies the process that is being started.
The configuration still allows the user to provide different images
for each container type if they so desire.
On top of that, the entry point was simplified a bit to share more
code; mainly, the same env variable is used to propagate the user-defined
classpath to the different containers.
Aside from being modified to match the new behavior, the
'build-push-docker-images.sh' script was renamed to 'docker-image-tool.sh'
to more closely match its purpose; the old name was a little awkward
and now also not entirely correct, since there is a single image. It
was also moved to 'bin' since it's not necessarily an admin tool.
Docs have been updated to match the new behavior.
Tested locally with minikube.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#20192 from vanzin/SPARK-22994.