ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
shane knapp	13fd32c9a9	[SPARK-28701][TEST-HADOOP3.2][TEST-JAVA11][K8S] adding java11 support for pull request builds ## What changes were proposed in this pull request? we need to add the ability to test PRBs against java11. see comments here: https://github.com/apache/spark/pull/25405 ## How was this patch tested? the build system will test this. Closes #25423 from shaneknapp/spark-prb-java11. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-27 00:48:01 +09:00
Anton Kirillov	f17f1d01e2	[SPARK-28778][MESOS] Fixed executors advertised address when running in virtual network ### What changes were proposed in this pull request? Resolves [SPARK-28778: Shuffle jobs fail due to incorrect advertised address when running in a virtual network on Mesos](https://issues.apache.org/jira/browse/SPARK-28778). This patch fixes a bug which occurs when shuffle jobs are launched by Mesos in a virtual network. Mesos scheduler sets executor `--hostname` parameter to `0.0.0.0` in the case when `spark.mesos.network.name` is provided. This makes executors use `0.0.0.0` as their advertised address and, in the presence of shuffle, executors fail to fetch shuffle blocks from each other using `0.0.0.0` as the origin. When a virtual network is used the hostname or IP address is not known upfront and assigned to a container at its start time so the executor process needs to advertise the correct dynamically assigned address to be reachable by other executors. Changes: - added a fallback to `Utils.localHostName()` in Spark Executors when `--hostname` is not provided - removed setting executor address to `0.0.0.0` from Mesos scheduler - refactored the code related to building executor command in Mesos scheduler - added network configuration support to Docker containerizer - added unit tests ### Why are the changes needed? The bug described above prevents Mesos users from running any jobs which involve shuffle due to the inability of executors to fetch shuffle blocks because of incorrect advertised address when virtual network is used. ### Does this PR introduce any user-facing change? No ### How was this patch tested? - added unit test to `MesosCoarseGrainedSchedulerBackendSuite` which verifies the absence of `--hostname` parameter when `spark.mesos.network.name` is provided and its presence otherwise - added unit test to `MesosSchedulerBackendUtilSuite` which verifies that `MesosSchedulerBackendUtil.buildContainerInfo` sets network-related properties for Docker containerizer - unit tests from this repo launched with profiles: `./build/mvn test -Pmesos -Pnetlib-lgpl -Psparkr -Phive -Phive-thriftserver`, build log attached: [mvn.test.log](https://github.com/apache/spark/files/3516891/mvn.test.log) - integration tests from [DCOS Spark repo](https://github.com/mesosphere/spark-build), more specifically - [test_spark_cni.py](https://github.com/mesosphere/spark-build/blob/master/tests/test_spark_cni.py) which runs a specific [shuffle job](https://github.com/mesosphere/spark-build/blob/master/tests/jobs/scala/src/main/scala/ShuffleApp.scala) and verifies its successful completion, Mesos task network configuration, and IP addresses for both Mesos and Docker containerizers Closes #25500 from akirillov/DCOS-45840-fix-advertised-ip-in-virtual-networks. Authored-by: Anton Kirillov <akirillov@mesosophere.io> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-23 18:30:05 -07:00
Marcelo Vanzin	5f6eb5d20d	[SPARK-28634][YARN] Ignore kerberos login config in client mode AM This change makes the client mode AM ignore any login configuration, which is now always handled by the driver. The previous code tried to achieve that by modifying the configuration visible to the AM, but that missed the case where old configuration names were being used. Tested in real cluster with reproduction provided in the bug. Closes #25467 from vanzin/SPARK-28634. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-19 11:06:02 -07:00
Dongjoon Hyun	f7c9de9035	[SPARK-28765][BUILD] Add explict exclusions to avoid JDK11 dependency issue ### What changes were proposed in this pull request? This PR adds explicit exclusions to avoid Maven `JDK11` dependency issues. ### Why are the changes needed? Maven/Ivy seems to be confused during dependency generation on `JDK11` environment. This is not only wrong, but also causes a Jenkins failure during dependency manifest check on `JDK11` environment. JDK8 ``` $ cd core $ mvn -X dependency:tree -Dincludes=jakarta.activation:jakarta.activation-api ... [DEBUG] org.glassfish.jersey.core:jersey-server:jar:2.29:compile (version managed from 2.22.2) [DEBUG] org.glassfish.jersey.media:jersey-media-jaxb:jar:2.29:compile [DEBUG] javax.validation:validation-api:jar:2.0.1.Final:compile ``` JDK11 ``` [DEBUG] org.glassfish.jersey.core:jersey-server:jar:2.29:compile (version managed from 2.22.2) [DEBUG] org.glassfish.jersey.media:jersey-media-jaxb:jar:2.29:compile [DEBUG] javax.validation:validation-api:jar:2.0.1.Final:compile [DEBUG] jakarta.xml.bind:jakarta.xml.bind-api🫙2.3.2:compile [DEBUG] jakarta.activation:jakarta.activation-api🫙1.2.1:compile ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Do the following in both `JDK8` and `JDK11` environment. The dependency manifest should not be changed. In the current `master` branch, `JDK11` changes the dependency manifest. ``` $ dev/test-dependencies.sh --replace-manifest ``` Closes #25481 from dongjoon-hyun/SPARK-28765. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-17 10:16:22 -07:00
Marcelo Vanzin	0343854f54	[SPARK-28487][K8S] More responsive dynamic allocation with K8S This change implements a few changes to the k8s pod allocator so that it behaves a little better when dynamic allocation is on. (i) Allow the application to ramp up immediately when there's a change in the target number of executors. Without this change, scaling would only trigger when a change happened in the state of the cluster, e.g. an executor going down, or when the periodical snapshot was taken (default every 30s). (ii) Get rid of pending pod requests, both acknowledged (i.e. Spark knows that a pod is pending resource allocation) and unacknowledged (i.e. Spark has requested the pod but the API server hasn't created it yet), when they're not needed anymore. This avoids starting those executors to just remove them after the idle timeout, wasting resources in the meantime. (iii) Re-work some of the code to avoid unnecessary logging. While not bad without dynamic allocation, the existing logging was very chatty when dynamic allocation was on. With the changes, all the useful information is still there, but only when interesting changes happen. (iv) Gracefully shut down executors when they become idle. Just deleting the pod causes a lot of ugly logs to show up, so it's better to ask pods to exit nicely. That also allows Spark to respect the "don't delete pods" option when dynamic allocation is on. Tested on a small k8s cluster running different TPC-DS workloads. Closes #25236 from vanzin/SPARK-28487. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-08-13 17:29:54 -07:00
Liang-Chi Hsieh	37eedf6149	[SPARK-28652][TESTS][K8S] Add python version check for executor ## What changes were proposed in this pull request? Current two PySpark version tests in PythonTestsSuite, just test against Python version at driver side. Because the test script doesn't run any spark job requiring python worker, it doesn't actually do version check at worker side. This patch adds pieces of code to the test script, to run a simple job to verify Python version. ## How was this patch tested? Unit test. Locally manual test. Closes #25411 from viirya/SPARK-28652. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-12 14:43:32 +09:00
Jungtaek Lim (HeartSaVioR)	128ea37bda	[SPARK-28601][CORE][SQL] Use StandardCharsets.UTF_8 instead of "UTF-8" string representation, and get rid of UnsupportedEncodingException ## What changes were proposed in this pull request? This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used. This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible. This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings. ## How was this patch tested? Existing unit tests. Closes #25335 from HeartSaVioR/SPARK-28601. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-05 20:45:54 -07:00
Yuanjian Li	db39f45baf	[SPARK-28593][CORE] Rename ShuffleClient to BlockStoreClient which more close to its usage ## What changes were proposed in this pull request? After SPARK-27677, the shuffle client not only handles the shuffle block but also responsible for local persist RDD blocks. For better code scalability and precise semantics(as the [discussion](https://github.com/apache/spark/pull/24892#discussion_r300173331)), here we did several changes: - Rename ShuffleClient to BlockStoreClient. - Correspondingly rename the ExternalShuffleClient to ExternalBlockStoreClient, also change the server-side class from ExternalShuffleBlockHandler to ExternalBlockHandler. - Move MesosExternalBlockStoreClient to Mesos package. Note, we still keep the name of BlockTransferService, because the `Service` contains both client and server, also the name of BlockTransferService is not referencing shuffle client only. ## How was this patch tested? Existing UT. Closes #25327 from xuanyuanking/SPARK-28593. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-05 14:54:45 +08:00
HyukjinKwon	946aef0535	[SPARK-28550][K8S][TESTS] Unset SPARK_HOME environment variable in K8S integration preparation ## What changes were proposed in this pull request? Currently, if we run the Kubernetes integration tests with `SPARK_HOME` already set, it refers the `SPARK_HOME` even when `--spark-tgz` is specified. This PR proposes to unset `SPARK_HOME` to let the docker-image-tool script detect `SPARK_HOME`. Otherwise, it cannot indicate the unpacked directory as its home. ## How was this patch tested? ```bash export SPARK_HOME=`pwd` dev/make-distribution.sh --pip --tgz -Phadoop-2.7 -Pkubernetes resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --deploy-mode docker-for-desktop --spark-tgz $PWD/spark-.tgz ``` Before:* ``` + /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh -r docker.io/kubespark -t 650B51C8-BBED-47C9-AEAB-E66FC9A0E64E -p /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/kubernetes/dockerfiles/spark/bindings/python/Dockerfile build cp: resource-managers/kubernetes/docker/src/main/dockerfiles: No such file or directory cp: assembly/target/scala-2.12/jars: No such file or directory cp: resource-managers/kubernetes/integration-tests/tests: No such file or directory cp: examples/target/scala-2.12/jars/: No such file or directory cp: resource-managers/kubernetes/docker/src/main/dockerfiles: No such file or directory cp: resource-managers/kubernetes/docker/src/main/dockerfiles: No such file or directory Cannot find docker image. This script must be run from a runnable distribution of Apache Spark. ... [INFO] Spark Project Kubernetes Integration Tests ......... FAILURE [ 4.870 s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE ``` After:* ``` + /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/bin/docker-image-tool.sh -r docker.io/kubespark -t 2BA5883A-A0AC-4D2B-8D00-702D31B59B23 -p /.../spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked/kubernetes/dockerfiles/spark/bindings/python/Dockerfile build Sending build context to Docker daemon 250.2MB Step 1/15 : FROM openjdk:8-alpine ---> a3562aa0b991 ... Successfully built 8614fb5ac279 Successfully tagged kubespark/spark:2BA5883A-A0AC-4D2B-8D00-702D31B59B23 ``` Closes #25283 from HyukjinKwon/SPARK-28550. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-29 10:47:28 -07:00
Junjie Chen	780d176136	[SPARK-28042][K8S] Support using volume mount as local storage ## What changes were proposed in this pull request? This pr is used to support using hostpath/PV volume mounts as local storage. In KubernetesExecutorBuilder.scala, the LocalDrisFeatureStep is built before MountVolumesFeatureStep which means we cannot use any volumes mount later. This pr adjust the order of feature building steps which moves localDirsFeature at last so that we can check if directories in SPARK_LOCAL_DIRS are set to volumes mounted such as hostPath, PV, or others. ## How was this patch tested? Unit tests Closes #24879 from chenjunjiedada/SPARK-28042. Lead-authored-by: Junjie Chen <jimmyjchen@tencent.com> Co-authored-by: Junjie Chen <cjjnjust@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-29 10:44:17 -07:00
Dongjoon Hyun	767802500c	[SPARK-28534][K8S][TEST] Update node affinity for DockerForDesktop backend in PVTestsSuite ## What changes were proposed in this pull request? This PR aims to recover our K8s integration test suite by extending node affinity in order to pass `PVTestsSuite` in `DockerForDesktop` environment, too. Previously, `PVTestsSuite` fails at `--deploy-mode docker-for-desktop` option because the node affinity requires `minibase` node. For `Docker Desktop`, there are two node names like the following. Note that Spark testing needs K8s v1.13 and above. So, this PR should be verified with `Docker Desktop (Edge)` version. This PR adds both because next stable `Docker Desktop` will have K8s v1.14.3. Docker Desktop (Stable, K8s v1.10.11) ``` $ kubectl get node NAME STATUS ROLES AGE VERSION docker-for-desktop Ready master 52s v1.10.11 ``` Docker Desktop 2.1.0.0 (Edge, K8s v1.14.3, Released 2019-07-26) ``` $ kubectl get node NAME STATUS ROLES AGE VERSION docker-desktop Ready master 16h v1.14.3 ``` ## How was this patch tested? Pass the Jenkins K8s integration test (`minibase`) and install `Docker Desktop 2.1.0.0 (Edge)` and run the integration test in `DockerForDesktop`. Note that this fixes only `PVTestsSuite`. ``` $ dev/make-distribution.sh --pip --tgz -Phadoop-2.7 -Pkubernetes $ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --deploy-mode docker-for-desktop --spark-tgz $PWD/spark-*.tgz ... KubernetesSuite: ... - PVs with local storage ... ``` Closes #25269 from dongjoon-hyun/SPARK-28534. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-07-29 16:28:56 +09:00
Stavros Kontopoulos	7504eab42f	[SPARK-28465][K8S] Fix integration tests which fail due to missing ceph-nano image ## What changes were proposed in this pull request? Fixes the tests. Follows instructions here: https://github.com/ceph/cn/issues/115#issuecomment-497384369 ## How was this patch tested? Manually by running the tests with minikube. Closes #25222 from skonto/fix-ceph. Authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-24 14:58:20 -07:00
Douglas R Colkitt	8fc5cb6285	[SPARK-28473][DOC] Stylistic consistency of build command in README ## What changes were proposed in this pull request? Change the format of the build command in the README to start with a `./` prefix ./build/mvn -DskipTests clean package This increases stylistic consistency across the README- all the other commands have a `./` prefix. Having a visible `./` prefix also makes it clear to the user that the shell command requires the current working directory to be at the repository root. ## How was this patch tested? README.md was reviewed both in raw markdown and in the Github rendered landing page for stylistic consistency. Closes #25231 from Mister-Meeseeks/master. Lead-authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com> Co-authored-by: Mister-Meeseeks <douglas.colkitt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-07-23 16:29:46 -07:00
Thomas Graves	43d68cd4ff	[SPARK-27959][YARN] Change YARN resource configs to use .amount ## What changes were proposed in this pull request? we are adding in generic resource support into spark where we have suffix for the amount of the resource so that we could support other configs. Spark on yarn already had added configs to request resources via the configs spark.yarn.{executor/driver/am}.resource=<some amount>, where the <some amount> is value and unit together. We should change those configs to have a `.amount` suffix on them to match the spark configs and to allow future configs to be more easily added. YARN itself already supports tags and attributes so if we want the user to be able to pass those from spark at some point having a suffix makes sense. it would allow for a spark.yarn.{executor/driver/am}.resource.{resource}.tag= type config. ## How was this patch tested? Tested via unit tests and manually on a yarn 3.x cluster with GPU resources configured on. Closes #24989 from tgravescs/SPARK-27959-yarn-resourceconfigs. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-16 10:56:07 -07:00
Gabor Somogyi	f83000597f	[SPARK-23472][CORE] Add defaultJavaOptions for driver and executor. ## What changes were proposed in this pull request? This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options). ## How was this patch tested? Existing + additional unit tests. ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #24804 from gaborgsomogyi/SPARK-23472. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-07-11 09:37:26 -07:00
Onur Satici	e7c97a3d86	[SPARK-28145][K8S] safe runnable in polling executor source ## What changes were proposed in this pull request? Add error handling to `ExecutorPodsPollingSnapshotSource` Closes #24952 from onursatici/os/polling-source. Authored-by: Onur Satici <onursatici@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-28 09:38:43 -05:00
Gabor Somogyi	8313015e8d	[SPARK-28005][YARN] Remove unnecessary log from SparkRackResolver ## What changes were proposed in this pull request? SparkRackResolver generates an INFO message every time is called with 0 arguments. In this PR I've deleted it because it's too verbose. ## How was this patch tested? Existing unit tests + spark-shell. Closes #24935 from gaborgsomogyi/SPARK-28005. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-06-26 09:50:54 -05:00
Xiangrui Meng	7056e004ee	[SPARK-27823][CORE] Refactor resource handling code ## What changes were proposed in this pull request? Continue the work from https://github.com/apache/spark/pull/24821. Refactor resource handling code to make the code more readable. Major changes: * Moved resource-related classes to `spark.resource` from `spark`. * Added ResourceUtils and helper classes so we don't need to directly deal with Spark conf. * ResourceID: resource identifier and it provides conf keys * ResourceRequest/Allocation: abstraction for requested and allocated resources * Added `TestResourceIDs` to reference commonly used resource IDs in tests like `spark.executor.resource.gpu`. cc: tgravescs jiangxb1987 Ngone51 ## How was this patch tested? Unit tests for added utils and existing unit tests. Closes #24856 from mengxr/SPARK-27823. Lead-authored-by: Xiangrui Meng <meng@databricks.com> Co-authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>	2019-06-18 17:18:17 -07:00
Stavros Kontopoulos	7912ab85a6	[SPARK-27872][K8S] Fix executor service account inconsistency ## What changes were proposed in this pull request? Fixes the service account inconsistency that breaks pull secrets. It gives the option to the user to setup a specific service account for the executors if he has to (via `spark.kubernetes.authenticate.executor.serviceAccountName`). Defaults to the driver's one. We are not supporting special authentication credentials for the executors with this PR. ## How was this patch tested? Tested manually by launching a Spark job exercising the introduced settings. Added a new integration tests for this fix. Closes #24748 from skonto/fix_executor_sa. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-06-09 16:28:37 -05:00
Thomas Graves	d30284b5a5	[SPARK-27760][CORE] Spark resources - change user resource config from .count to .amount ## What changes were proposed in this pull request? Change the resource config spark.{executor/driver}.resource.{resourceName}.count to .amount to allow future usage of containing both a count and a unit. Right now we only support counts - # of gpus for instance, but in the future we may want to support units for things like memory - 25G. I think making the user only have to specify a single config .amount is better then making them specify 2 separate configs of a .count and then a .unit. Change it now since its a user facing config. Amount also matches how the spark on yarn configs are setup. ## How was this patch tested? Unit tests and manually verified on yarn and local cluster mode Closes #24810 from tgravescs/SPARK-27760-amount. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-06-06 14:16:05 -05:00
Xingbo Jiang	ac808e2a02	[SPARK-27366][CORE] Support GPU Resources in Spark job scheduling ## What changes were proposed in this pull request? This PR adds support to schedule tasks with extra resource requirements (eg. GPUs) on executors with available resources. It also introduce a new method `TaskContext.resources()` so tasks can access available resource addresses allocated to them. ## How was this patch tested? * Added new end-to-end test cases in `SparkContextSuite`; * Added new test case in `CoarseGrainedSchedulerBackendSuite`; * Added new test case in `CoarseGrainedExecutorBackendSuite`; * Added new test case in `TaskSchedulerImplSuite`; * Added new test case in `TaskSetManagerSuite`; * Updated existing tests. Closes #24374 from jiangxb1987/gpu. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2019-06-04 16:57:47 -07:00
HyukjinKwon	8b18ef5c7b	[MINOR] Avoid hardcoded py4j-0.10.8.1-src.zip in Scala ## What changes were proposed in this pull request? This PR targets to deduplicate hardcoded `py4j-0.10.8.1-src.zip` in order to make py4j upgrade easier. ## How was this patch tested? N/A Closes #24770 from HyukjinKwon/minor-py4j-dedup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-06-02 21:23:17 -07:00
Thomas Graves	1277f8fa92	[SPARK-27362][K8S] Resource Scheduling support for k8s ## What changes were proposed in this pull request? Add ability to map the spark resource configs spark.{executor/driver}.resource.{resourceName} to kubernetes Container builder so that we request resources (gpu,s/fpgas/etc) from kubernetes. Note that the spark configs will overwrite any resource configs users put into a pod template. I added a generic vendor config which is only used by kubernetes right now. I intentionally didn't put it into the kubernetes config namespace just to avoid adding more config prefixes. I will add more documentation for this under jira SPARK-27492. I think it will be easier to do all at once to get cohesive story. ## How was this patch tested? Unit tests and manually testing on k8s cluster. Closes #24703 from tgravescs/SPARK-27362. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-05-31 15:26:14 -05:00
Steven Rand	568512cc82	[SPARK-27773][SHUFFLE] add metrics for number of exceptions caught in ExternalShuffleBlockHandler ## What changes were proposed in this pull request? Add a metric for number of exceptions caught in the `ExternalShuffleBlockHandler`, the idea being that spikes in this metric over some time window (or more desirably, the lack thereof) can be used as an indicator of the health of an external shuffle service. (Where "health" refers to its ability to successfully respond to client requests.) ## How was this patch tested? Deployed a build of this PR to a YARN cluster, and confirmed that the NodeManagers' JMX metrics include `numCaughtExceptions`. Closes #24645 from sjrand/SPARK-27773. Authored-by: Steven Rand <srand@palantir.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-30 13:57:15 -07:00
Thomas Graves	0ced4c0b13	[SPARK-27378][YARN] YARN support for GPU-aware scheduling ## What changes were proposed in this pull request? Add yarn support for GPU-aware scheduling. Since SPARK-20327 already added yarn custom resource support, this jira is really just making sure the spark resource configs get mapped into the yarn resource configs and user doesn't specify both yarn and spark config for the known types of resources (gpu and fpga are the known types on yarn). You can find more details on the design and requirements documented: https://issues.apache.org/jira/browse/SPARK-27376 Note that the running on yarn docs already state to use it, it must be yarn 3.0+. We will add any further documentation under SPARK-20327 ## How was this patch tested? Unit tests and manually testing on yarn cluster Closes #24634 from tgravescs/SPARK-27361. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-30 13:23:46 -07:00
Yuming Wang	db3e746b64	[SPARK-27875][CORE][SQL][ML][K8S] Wrap all PrintWriter with Utils.tryWithResource ## What changes were proposed in this pull request? This pr wrap all `PrintWriter` with `Utils.tryWithResource` to prevent resource leak. ## How was this patch tested? Existing test Closes #24739 from wangyum/SPARK-27875. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-30 19:54:32 +09:00
Stavros Kontopoulos	5e74570c8f	[SPARK-23153][K8S] Support client dependencies with a Hadoop Compatible File System ## What changes were proposed in this pull request? - solves the current issue with --packages in cluster mode (there is no ticket for it). Also note of some [issues](https://issues.apache.org/jira/browse/SPARK-22657) of the past here when hadoop libs are used at the spark submit side. - supports spark.jars, spark.files, app jar. It works as follows: Spark submit uploads the deps to the HCFS. Then the driver serves the deps via the Spark file server. No hcfs uris are propagated. The related design document is [here](https://docs.google.com/document/d/1peg_qVhLaAl4weo5C51jQicPwLclApBsdR1To2fgc48/edit). the next option to add is the RSS but has to be improved given the discussion in the past about it (Spark 2.3). ## How was this patch tested? - Run integration test suite. - Run an example using S3: ``` ./bin/spark-submit \ ... --packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.memory=1G \ --conf spark.kubernetes.namespace=spark \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark-sa \ --conf spark.driver.memory=1G \ --conf spark.executor.instances=2 \ --conf spark.sql.streaming.metricsEnabled=true \ --conf "spark.driver.extraJavaOptions=-Divy.cache.dir=/tmp -Divy.home=/tmp" \ --conf spark.kubernetes.container.image.pullPolicy=Always \ --conf spark.kubernetes.container.image=skonto/spark:k8s-3.0.0 \ --conf spark.kubernetes.file.upload.path=s3a://fdp-stavros-test \ --conf spark.hadoop.fs.s3a.access.key=... \ --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \ --conf spark.hadoop.fs.s3a.fast.upload=true \ --conf spark.kubernetes.executor.deleteOnTermination=false \ --conf spark.hadoop.fs.s3a.secret.key=... \ --conf spark.files=client:///...resolv.conf \ file:///my.jar ** ``` Added integration tests based on [Ceph nano](https://github.com/ceph/cn). Looks very [active](http://www.sebastien-han.fr/blog/2019/02/24/Ceph-nano-is-getting-better-and-better/). Unfortunately minio needs hadoop >= 2.8. Closes #23546 from skonto/support-client-deps. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Erik Erlandson <eerlands@redhat.com>	2019-05-22 16:15:42 -07:00
wenxuanguan	e7443d6412	[SPARK-27774][CORE][MLLIB] Avoid hardcoded configs ## What changes were proposed in this pull request? avoid hardcoded configs in `SparkConf` and `SparkSubmit` and test ## How was this patch tested? N/A Closes #24631 from wenxuanguan/minor-fix. Authored-by: wenxuanguan <choose_home@126.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-05-22 10:45:11 +09:00
Arun Mahadevan	1a8c09334d	[SPARK-27754][K8S] Introduce additional config (spark.kubernetes.driver.request.cores) for driver request cores for spark on k8s ## What changes were proposed in this pull request? Spark on k8s supports config for specifying the executor cpu requests (spark.kubernetes.executor.request.cores) but a similar config is missing for the driver. Instead, currently `spark.driver.cores` value is used for integer value. Although `pod spec` can have `cpu` for the fine-grained control like the following, this PR proposes additional configuration `spark.kubernetes.driver.request.cores` for driver request cores. ``` resources: requests: memory: "64Mi" cpu: "250m" ``` ## How was this patch tested? Unit tests Closes #24630 from arunmahadevan/SPARK-27754. Authored-by: Arun Mahadevan <arunm@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-18 21:28:46 -07:00
Sean Owen	bfb3ffe9b3	[SPARK-27682][CORE][GRAPHX][MLLIB] Replace use of collections and methods that will be removed in Scala 2.13 with work-alikes ## What changes were proposed in this pull request? This replaces use of collection classes like `MutableList` and `ArrayStack` with workalikes that are available in 2.12, as they will be removed in 2.13. It also removes use of `.to[Collection]` as its uses was superfluous anyway. Removing `collection.breakOut` will have to wait until 2.13 ## How was this patch tested? Existing tests Closes #24586 from srowen/SPARK-27682. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-05-15 09:29:12 -05:00
Thomas Graves	db2e3c4341	[SPARK-27024] Executor interface for cluster managers to support GPU and other resources ## What changes were proposed in this pull request? Add in GPU and generic resource type allocation to the executors. Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow : - users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this. - cluster manager allocates a container or set of resources (standalone mode) - When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them. - Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node. The argument is a file contains JSON Array of ResourceInformation objects. The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use. Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615 ## How was this patch tested? Added unit tests and manually tested. Please review http://spark.apache.org/contributing.html before opening a pull request. Closes #24406 from tgravescs/gpu-sched-executor-clean. Authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2019-05-14 08:41:41 -05:00
Sam Tran	bcd3b61c4b	[SPARK-27347][MESOS] Fix supervised driver retry logic for outdated tasks ## What changes were proposed in this pull request? This patch fixes a bug where `--supervised` Spark jobs would retry multiple times whenever an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent. That is: ``` - supervised driver is running on agent1 - agent1 crashes - driver is relaunched on another agent as `<task-id>-retry-1` - agent1 comes back online and re-registers with scheduler - spark relaunches the same job as `<task-id>-retry-2` - now there are two jobs running simultaneously ``` This is because when an agent would come back and re-register it would send a status update `TASK_FAILED` for its old driver-task. Previous logic would indiscriminately remove the `submissionId` from Zookeeper's `launchedDrivers` node and add it to `retryList` node. Then, when a new offer came in, it would relaunch another `-retry-` task even though one was previously running. For example logs, scroll to bottom ## How was this patch tested? - Added a unit test to simulate behavior described above - Tested manually on a DC/OS cluster by ``` - launching a --supervised spark job - dcos node ssh <to the agent with the running spark-driver> - systemctl stop dcos-mesos-slave - docker kill <driver-container-id> - [ wait until spark job is relaunched ] - systemctl start dcos-mesos-slave - [ observe spark driver is not relaunched as `-retry-2` ] ``` Log snippets included below. Notice the `-retry-1` task is running when status update for the old task comes in afterward: ``` 19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos: ... [offers] ... 19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001" ... 19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message='' 19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message='' ... 19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED ... 19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1" ... 19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message='' 19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message='' ... 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered' ... 19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED 19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update ... 19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2" ... 19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message='' 19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message='' ``` Closes #24276 from samvantran/SPARK-27347-duplicate-retries. Authored-by: Sam Tran <stran@mesosphere.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-05-10 10:53:31 -07:00
jiafu.zhang@intel.com	fa5dc0a45a	[SPARK-26632][CORE] Separate Thread Configurations of Driver and Executor ## What changes were proposed in this pull request? For the below three thread configuration items applied to both driver and executor, spark.rpc.io.serverThreads spark.rpc.io.clientThreads spark.rpc.netty.dispatcher.numThreads, we separate them to driver specifics and executor specifics. spark.driver.rpc.io.serverThreads < - > spark.executor.rpc.io.serverThreads spark.driver.rpc.io.clientThreads < - > spark.executor.rpc.io.clientThreads spark.driver.rpc.netty.dispatcher.numThreads < - > spark.executor.rpc.netty.dispatcher.numThreads Spark reads these specifics first and fall back to the common configurations. ## How was this patch tested? We ran the SimpleMap app without shuffle for benchmark purpose to test Spark's scalability in HPC with omini-path NIC which has higher bandwidth than normal ethernet NIC. Spark's base version is 2.4.0. Spark ran in the Standalone mode. Driver was in a standalone node. After the separation, the performance is improved a lot in 256 nodes and 512 nodes. see below test results of SimpleMapTask before and after the enhancement. You can view the tables in the [JIRA](https://issues.apache.org/jira/browse/SPARK-26632) too. ds: spark.driver.rpc.io.serverThreads dc: spark.driver.rpc.io.clientThreads dd: spark.driver.rpc.netty.dispatcher.numThreads ed: spark.executor.rpc.netty.dispatcher.numThreads time: Overall Time (s) old time: Overall Time without Separation (s) Before: nodes \| ds \| dc \| dd \| ed \| time -- \|-- \| -- \| -- \| -- \| -- 128 nodes \| 8 \| 8 \| 8 \| 8 \| 108 256 nodes \| 8 \| 8 \| 8 \| 8 \| 196 512 nodes \| 8 \| 8 \| 8 \| 8 \| 377 After: nodes \| ds \| dc \| dd \| ed \| time \| improvement -- \| -- \| -- \| -- \| -- \| -- \| -- 128 nodes \| 15 \| 15 \| 10 \| 30 \| 107 \| 0.9% 256 nodes \| 12 \| 15 \| 10 \| 30 \| 159 \| 18.8% 512 nodes \| 12 \| 15 \| 10 \| 30 \| 283 \| 24.9% Closes #23560 from zjf2012/thread_conf_separation. Authored-by: jiafu.zhang@intel.com <jiafu.zhang@intel.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-10 10:42:43 -07:00
Adi Muraru	8ef4da753d	[SPARK-27610][YARN] Shade netty native libraries ## What changes were proposed in this pull request? Fixed the `spark-<version>-yarn-shuffle.jar` artifact packaging to shade the native netty libraries: - shade the `META-INF/native/libnetty_*` native libraries when packagin the yarn shuffle service jar. This is required as netty library loader derives that based on shaded package name. - updated the `org/spark_project` shade package prefix to `org/sparkproject` (i.e. removed underscore) as the former breaks the netty native lib loading. This was causing the yarn external shuffle service to fail when spark.shuffle.io.mode=EPOLL ## How was this patch tested? Manual tests Closes #24502 from amuraru/SPARK-27610_master. Authored-by: Adi Muraru <amuraru@adobe.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-05-07 10:47:36 -07:00
Sean Owen	a6716d3f03	[SPARK-27571][CORE][YARN][EXAMPLES] Avoid scala.language.reflectiveCalls ## What changes were proposed in this pull request? This PR avoids usage of reflective calls in Scala. It removes the import that suppresses the warnings and rewrites code in small ways to avoid accessing methods that aren't technically accessible. ## How was this patch tested? Existing tests. Closes #24463 from srowen/SPARK-27571. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-29 11:16:45 -05:00
Rob Vesse	b1c6b60ce7	[SPARK-26729][K8S] Fix typo with default value for R image name ## What changes were proposed in this pull request? As discovered by users making use of this feature, there is a bug in the declaration of the `R_IMAGE_NAME` variable that causes the default name to not be properly set to `spark-r` but rather to just `-r` ## How was this patch tested? Verified that the image name for the R image is now appropriately populated in the integration test script via Bash debug output. NB - The fact that this wasn't spotted earlier highlights the fact that currently the K8S integration test suite does not have any tests for the R image as if it had this would have failed integration testing in the original PR #23846 Closes #24449 from rvesse/SPARK-26729. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-24 21:08:42 -07:00
gatorsmile	cd4a284030	[SPARK-27460][FOLLOW-UP][TESTS] Fix flaky tests ## What changes were proposed in this pull request? This patch makes several test flakiness fixes. ## How was this patch tested? N/A Closes #24434 from gatorsmile/fixFlakyTest. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-04-24 17:36:29 +08:00
Sean Owen	4ec7f631aa	[SPARK-27404][CORE][SQL][STREAMING][YARN] Fix build warnings for 3.0: postfixOps edition ## What changes were proposed in this pull request? Fix build warnings -- see some details below. But mostly, remove use of postfix syntax where it causes warnings without the `scala.language.postfixOps` import. This is mostly in expressions like "120000 milliseconds". Which, I'd like to simplify to things like "2.minutes" anyway. ## How was this patch tested? Existing tests. Closes #24314 from srowen/SPARK-27404. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-11 13:43:44 -05:00
LantaoJin	52838e74af	[SPARK-13704][CORE][YARN] Reduce rack resolution time ## What changes were proposed in this pull request? When you submit a stage on a large cluster, rack resolving takes a long time when initializing TaskSetManager because a script is invoked to resolve the rack of each host, one by one. Based on current implementation, it takes 30~40 seconds to resolve the racks in our 5000 nodes' cluster. After applied the patch, it decreased to less than 15 seconds. YARN-9332 has added an interface to handle multiple hosts in one invocation to save time. But before upgrading to the newest Hadoop, we could construct the same tool in Spark to resolve this issue. ## How was this patch tested? UT and manually testing on a 5000 node cluster. Closes #24245 from squito/SPARK-13704_update. Lead-authored-by: LantaoJin <jinlantao@gmail.com> Co-authored-by: Imran Rashid <irashid@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-04-08 10:47:06 -05:00
Yuming Wang	13c5c1fb4b	[SPARK-27180][BUILD][YARN] Fix testing issues with yarn module in Hadoop-3 ## What changes were proposed in this pull request? Fix testing issues with `yarn` module in Hadoop-3: 1. Upgrade jersey-1 to `1.19` to fix ```Cause: java.lang.NoClassDefFoundError: com/sun/jersey/spi/container/servlet/ServletContainer```. 2. Copy `ServerSocketUtil` from hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/net/ServerSocketUtil.java to fix ```java.lang.NoClassDefFoundError: org/apache/hadoop/net/ServerSocketUtil```. 3. Adapte `SessionHandler` from jetty-9.3.25.v20180904/jetty-server/src/main/java/org/eclipse/jetty/server/session/SessionHandler.java to fix ```java.lang.NoSuchMethodError: org.eclipse.jetty.server.session.SessionHandler.getSessionManager()Lorg/eclipse/jetty/server/SessionManager```. ## How was this patch tested? manual tests: ```shell build/sbt yarn/test -Pyarn build/sbt yarn/test -Phadoop-3.2 -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.deploy.yarn.YarnClusterSuite -pl resource-managers/yarn test -Pyarn -Phadoop-3.2 ``` Closes #24115 from wangyum/hadoop3-yarn. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-04-02 15:38:26 -05:00
Sean Owen	d4420b455a	[SPARK-27323][CORE][SQL][STREAMING] Use Single-Abstract-Method support in Scala 2.12 to simplify code ## What changes were proposed in this pull request? Use Single Abstract Method syntax where possible (and minor related cleanup). Comments below. No logic should change here. ## How was this patch tested? Existing tests. Closes #24241 from srowen/SPARK-27323. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-04-02 07:37:05 -07:00
Yuming Wang	b670f39fc6	[SPARK-24793][FOLLOW-UP][K8S] Remove duplicate declaration of mockito-core ## What changes were proposed in this pull request? ``` [WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-kubernetes_2.12🫙3.0.0-SNAPSHOT [WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.mockito:mockito-core:jar -> duplicate declaration of version (?) org.apache.spark:spark-kubernetes_2.12:[unknown-version], /Users/yumwang/spark/resource-managers/kubernetes/core/pom.xml, line 98, column 17 ``` This pr remove duplicate declaration of `mockito-core`. ## How was this patch tested? N/A Closes #24256 from wangyum/SPARK-24793-FOLLOW-UP. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-03-30 21:29:32 -07:00
Stavros Kontopoulos	39577a27a0	[SPARK-24902][K8S] Add PV integration tests ## What changes were proposed in this pull request? - Adds persistent volume integration tests - Adds a custom tag to the test to exclude it if it is run against a cloud backend. - Assumes default fs type for the host, AFAIK that is ext4. ## How was this patch tested? Manually run the tests against minikube as usual: ``` [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 192 milliseconds. Run starting. Expected test count is: 16 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - Test PVs with local storage ``` Closes #23514 from skonto/pvctests. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-03-27 13:00:56 -07:00
Stavros Kontopoulos	05168e725d	[SPARK-24793][K8S] Enhance spark-submit for app management - supports `--kill` & `--status` flags. - supports globs which is useful in general check this long standing [issue](https://github.com/kubernetes/kubernetes/issues/17144#issuecomment-272052461) for kubectl. Manually against running apps. Example output: Submission Id reported at launch time: ``` 2019-01-20 23:47:56 INFO Client:58 - Waiting for application spark-pi with submissionId spark:spark-pi-1548020873671-driver to finish... ``` Killing the app: ``` ./bin/spark-submit --kill spark:spark-pi-1548020873671-driver --master k8s://https://192.168.2.8:8443 2019-01-20 23:48:07 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 23:48:07 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address ``` App terminates with 143 (SIGTERM, since we have tiny this should lead to [graceful shutdown](https://cloud.google.com/solutions/best-practices-for-building-containers)): ``` 2019-01-20 23:48:08 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T21:48:00Z 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - State changed, new state: pod name: spark-pi-1548020873671-driver namespace: spark labels: spark-app-selector -> spark-e4730c80e1014b72aa77915a2203ae05, spark-role -> driver pod uid: 0ba9a794-1cfd-11e9-8215-a434d9270a65 creation time: 2019-01-20T21:47:55Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T21:47:55Z phase: Failed container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO LoggingPodStatusWatcherImpl:58 - Container final statuses: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T21:48:00Z container finished at: 2019-01-20T21:48:08Z exit code: 143 termination reason: Error 2019-01-20 23:48:09 INFO Client:58 - Application spark-pi with submissionId spark:spark-pi-1548020873671-driver finished. 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Shutdown hook called 2019-01-20 23:48:09 INFO ShutdownHookManager:58 - Deleting directory /tmp/spark-f114b2e0-5605-4083-9203-a4b1c1f6059e ``` Glob scenario: ``` ./bin/spark-submit --status spark:spark-pi* --master k8s://https://192.168.2.8:8443 2019-01-20 22:27:44 WARN Utils:70 - Your hostname, universe resolves to a loopback address: 127.0.0.1; using 192.168.2.8 instead (on interface wlp2s0) 2019-01-20 22:27:44 WARN Utils:70 - Set SPARK_LOCAL_IP if you need to bind to another address Application status (driver): pod name: spark-pi-1547948600328-driver namespace: spark labels: spark-app-selector -> spark-f13f01702f0b4503975ce98252d59b94, spark-role -> driver pod uid: c576e1c6-1c54-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:43:22Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:43:22Z phase: Running container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: running container started at: 2019-01-20T01:43:27Z Application status (driver): pod name: spark-pi-1547948792539-driver namespace: spark labels: spark-app-selector -> spark-006d252db9b24f25b5069df357c30264, spark-role -> driver pod uid: 38375b4b-1c55-11e9-8215-a434d9270a65 creation time: 2019-01-20T01:46:35Z service account name: spark-sa volumes: spark-local-dir-1, spark-conf-volume, spark-sa-token-b7wcm node name: minikube start time: 2019-01-20T01:46:35Z phase: Succeeded container status: container name: spark-kubernetes-driver container image: skonto/spark:k8s-3.0.0 container state: terminated container started at: 2019-01-20T01:46:39Z container finished at: 2019-01-20T01:46:56Z exit code: 0 termination reason: Completed ``` Closes #23599 from skonto/submit_ops_extension. Authored-by: Stavros Kontopoulos <stavros.kontopoulos@lightbend.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-26 11:55:03 -07:00
Sean Owen	8bc304f97e	[SPARK-26132][BUILD][CORE] Remove support for Scala 2.11 in Spark 3.0.0 ## What changes were proposed in this pull request? Remove Scala 2.11 support in build files and docs, and in various parts of code that accommodated 2.11. See some targeted comments below. ## How was this patch tested? Existing tests. Closes #23098 from srowen/SPARK-26132. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-25 10:46:42 -05:00
10087686	8204dc1e54	[SPARK-27141][YARN] Use ConfigEntry for hardcoded configs for Yarn ## What changes were proposed in this pull request? There is some hardcode configs in code, I think it best to modify。 ## How was this patch tested? Existing tests Closes #24103 from wangjiaochun/yarnHardCode. Authored-by: 10087686 <wang.jiaochun@zte.com.cn> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-22 05:29:29 -05:00
Rob Vesse	61d99462a0	[SPARK-26729][K8S] Make image names under test configurable ## What changes were proposed in this pull request? Allow specifying system properties to customise the image names for the images used in the integration testing. Useful if your CI/CD pipeline or policy requires using a different naming format. This is one part of addressing SPARK-26729, I plan to have a follow up patch that will also make the names configurable when using `docker-image-tool.sh` ## How was this patch tested? Ran integration tests against custom images generated by our CI/CD pipeline that do not follow Spark's existing hardcoded naming conventions using the new system properties to override the image names appropriately: ``` mvn clean integration-test -pl :spark-kubernetes-integration-tests_${SCALA_VERSION} \ -Pkubernetes -Pkubernetes-integration-tests \ -P${SPARK_HADOOP_PROFILE} -Dhadoop.version=${HADOOP_VERSION} \ -Dspark.kubernetes.test.sparkTgz=${TARBALL} \ -Dspark.kubernetes.test.imageTag=${TAG} \ -Dspark.kubernetes.test.imageRepo=${REPO} \ -Dspark.kubernetes.test.namespace=${K8S_NAMESPACE} \ -Dspark.kubernetes.test.kubeConfigContext=${K8S_CONTEXT} \ -Dspark.kubernetes.test.deployMode=${K8S_TEST_DEPLOY_MODE} \ -Dspark.kubernetes.test.jvmImage=apache-spark \ -Dspark.kubernetes.test.pythonImage=apache-spark-py \ -Dspark.kubernetes.test.rImage=apache-spark-r \ -Dtest.include.tags=k8s ... [INFO] --- scalatest-maven-plugin:1.0:test (integration-test) spark-kubernetes-integration-tests_2.12 --- Discovery starting. Discovery completed in 230 milliseconds. Run starting. Expected test count is: 15 KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template Run completed in 8 minutes, 33 seconds. Total number of tests run: 15 Suites: completed 2, aborted 0 Tests: succeeded 15, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #23846 from rvesse/SPARK-26729. Authored-by: Rob Vesse <rvesse@dotnetrdf.org> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-20 14:28:27 -07:00
Marcelo Vanzin	ec5e34205a	[SPARK-27094][YARN] Work around RackResolver swallowing thread interrupt. To avoid the case where the YARN libraries would swallow the exception and prevent YarnAllocator from shutting down, call the offending code in a separate thread, so that the parent thread can respond appropriately to the shut down. As a safeguard, also explicitly stop the executor launch thread pool when shutting down the application, to prevent new executors from coming up after the application started its shutdown. Tested with unit tests + some internal tests on real cluster. Closes #24017 from vanzin/SPARK-27094. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-03-20 11:48:06 -07:00
shane knapp	5564fe5151	[SPARK-27178][K8S] add nss to the spark/k8s Dockerfile ## What changes were proposed in this pull request? while performing some tests on our existing minikube and k8s infrastructure, i noticed that the integration tests were failing. i dug in and discovered the following message buried at the end of the stacktrace: ``` Caused by: java.io.FileNotFoundException: /usr/lib/libnss3.so at sun.security.pkcs11.Secmod.initialize(Secmod.java:193) at sun.security.pkcs11.SunPKCS11.<init>(SunPKCS11.java:218) ... 81 more ``` after i added the `nss` package to `resource-managers/kubernetes/docker/src/main/dockerfiles/spark/Dockerfile`, everything worked. this is also impacting current builds. see: https://amplab.cs.berkeley.edu/jenkins/job/testing-k8s-prb-make-spark-distribution-unified/8959/console ## How was this patch tested? i tested locally before pushing, and the build system will test the rest. Closes #24111 from shaneknapp/add-nss-package-to-dockerfile. Authored-by: shane knapp <incomplete@gmail.com> Signed-off-by: shane knapp <incomplete@gmail.com>	2019-03-18 16:38:42 -07:00
Ajith	c324e1da9d	[SPARK-27122][CORE] Jetty classes must not be return via getters in org.apache.spark.ui.WebUI ## What changes were proposed in this pull request? When we run YarnSchedulerBackendSuite, the class path seems to be made from the classes folder(resource-managers/yarn/target/scala-2.12/classes) instead of jar (resource-managers/yarn/target/spark-yarn_2.12-3.0.0-SNAPSHOT.jar) . ui.getHandlers is in spark-core and its loaded from spark-core.jar which is shaded and hence refers to org.spark_project.jetty.servlet.ServletContextHandler Here in org.apache.spark.scheduler.cluster.YarnSchedulerBackend, as its not shaded, it expects org.eclipse.jetty.servlet.ServletContextHandler Refer discussion https://issues.apache.org/jira/browse/SPARK-27122?focusedCommentId=16792318&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16792318 Hence as a fix, org.apache.spark.ui.WebUI must only return a wrapper class instance or references so that Jetty classes can be avoided in getters which are accessed outside spark-core ## How was this patch tested? Existing UT can pass Closes #24088 from ajithme/shadebug. Authored-by: Ajith <ajith2489@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-03-17 06:44:02 -05:00

1 2 3 4 5 ...

361 commits