ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Dongjoon Hyun	dd5464976f	[SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file ### What changes were proposed in this pull request? This PR aims to unify two K8s version variables in two `pom.xml`s into one. `kubernetes-client.version` is correct because the artifact ID is `kubernetes-client`. ``` kubernetes.client.version (kubernetes/core module) kubernetes-client.version (kubernetes/integration-test module) ``` ### Why are the changes needed? Having two variables for the same value is confusing and inconvenient when we upgrade K8s versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. (The compilation test passes are enough.) Closes #32531 from dongjoon-hyun/SPARK-35394. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-13 00:40:53 -07:00
“attilapiros”	8b94eff1ca	[SPARK-34736][K8S][TESTS] Kubernetes and Minikube version upgrade for integration tests ### What changes were proposed in this pull request? This PR upgrades Kubernetes and Minikube version for integration tests and removes/updates the old code for this new version. Details of this changes: - As [discussed in the mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html): updating Minikube version from v0.34.1 to v1.7.3 and kubernetes version from v1.15.12 to v1.17.3. - making Minikube version checked and fail with an explanation when the test is started with on a version < v1.7.3. - removing minikube status checking code related to old Minikube versions - in the Minikube backend using fabric8's `Config.autoConfigure()` method to configure the kubernetes client to use the `minikube` k8s context (like it was in [one of the Minikube's example](https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/kubectl/equivalents/ConfigUseContext.java#L36)) - Introducing `persistentVolume` test tag: this would be a temporary change to skip PVC tests in the Kubernetes integration test, as currently the PCV tests are blocking the move to Docker as Minikube's driver (for details please check https://issues.apache.org/jira/browse/SPARK-34738). ### Why are the changes needed? With the current suggestion one can run into several problems without noticing the Minikube/kubernetes version is the problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It was tested on Mac with [this script](https://gist.github.com/attilapiros/cd58a16bdde833c80c5803c337fffa94#file-check_minikube_versions-zsh) which installs each Minikube versions from v1.7.2 (including this version to test the negative case of the version check) and runs the integration tests. It was started with: ``` ./check_minikube_versions.zsh > test_log 2>&1 ``` And there was only one build failure the rest was successful: ``` $ grep "BUILD SUCCESS" test_log \| wc -l 26 $ grep "BUILD FAILURE" test_log \| wc -l 1 ``` It was for Minikube v1.7.2 and the log is: ``` KubernetesSuite: * RUN ABORTED * java.lang.AssertionError: assertion failed: Unsupported Minikube version is detected: minikube version: v1.7.2.For integration testing Minikube version 1.7.3 or greater is expected. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:52) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:43) at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273) at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271) ... ``` Moreover I made a test with having multiple k8s cluster contexts, too. Closes #31829 from attilapiros/SPARK-34736. Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>	2021-05-10 18:56:52 +02:00
Dongjoon Hyun	a0c76a8755	[SPARK-35319][K8S][BUILD] Upgrade K8s client to 5.3.1 ### What changes were proposed in this pull request? This PR aims to upgrade K8s client to 5.3.1. ### Why are the changes needed? This will bring the latest bug fixes. - https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. K8s IT is manually tested like the following. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 18 minutes, 33 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 3.959 s] [INFO] Spark Project Tags ................................. SUCCESS [ 7.830 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 3.457 s] [INFO] Spark Project Networking ........................... SUCCESS [ 5.496 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.239 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 9.006 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 2.422 s] [INFO] Spark Project Core ................................. SUCCESS [02:17 min] [INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 23:59 min [INFO] Finished at: 2021-05-05T11:59:19-07:00 [INFO] ------------------------------------------------------------------------ ``` Closes #32443 from dongjoon-hyun/SPARK-35319. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-05 19:50:37 -07:00
Dongjoon Hyun	4e8701a77d	[SPARK-35280][K8S] Promote KubernetesUtils to DeveloperApi ### What changes were proposed in this pull request? Since SPARK-22757, `KubernetesUtils` has been used as an important utility class by all K8s modules and `ExternalClusterManager`s. This PR aims to promote `KubernetesUtils` to `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.2.0. ### Why are the changes needed? Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. To have `ExternalClusterManager` for K8s environment, `KubernetesUtils` class is crucial and needs to be stable. By promoting to a subset of K8s developer API, we can maintain these more sustainable way and give a better and stable functionality to K8s users. In this PR, `Since` annotations denote the last function signature changes because these are going to become public at Apache Spark 3.2.0. \| Version \| Function Name \| \|-\|-\| \| 2.3.0 \| parsePrefixedKeyValuePairs \| \| 2.3.0 \| requireNandDefined \| \| 2.3.0 \| parsePrefixedKeyValuePairs \| \| 2.4.0 \| parseMasterUrl \| \| 3.0.0 \| requireBothOrNeitherDefined \| \| 3.0.0 \| requireSecondIfFirstIsDefined \| \| 3.0.0 \| selectSparkContainer \| \| 3.0.0 \| formatPairsBundle \| \| 3.0.0 \| formatPodState \| \| 3.0.0 \| containersDescription \| \| 3.0.0 \| containerStatusDescription \| \| 3.0.0 \| formatTime \| \| 3.0.0 \| uniqueID \| \| 3.0.0 \| buildResourcesQuantities \| \| 3.0.0 \| uploadAndTransformFileUris \| \| 3.0.0 \| uploadFileUri \| \| 3.0.0 \| requireBothOrNeitherDefined \| \| 3.0.0 \| buildPodWithServiceAccount \| \| 3.0.0 \| isLocalAndResolvable \| \| 3.1.1 \| renameMainAppResource \| \| 3.1.1 \| addOwnerReference \| \| 3.2.0 \| loadPodFromTemplate \| ### Does this PR introduce _any_ user-facing change? Yes, but this is new API additions. ### How was this patch tested? Pass the CIs. Closes #32406 from dongjoon-hyun/SPARK-35280. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-30 11:39:18 -07:00
Dongjoon Hyun	6ab00488d0	[SPARK-35182][K8S] Support driver-owned on-demand PVC ### What changes were proposed in this pull request? This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the `ownerReference` to `driver` pod instead of `executor` pod. ### Why are the changes needed? This allows K8s backend scheduler can reuse this later. BEFORE ``` $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: ... ownerReferences: - apiVersion: v1 controller: true kind: Pod name: tpcds-pvc-exec-1 ``` AFTER ``` $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: ... ownerReferences: - apiVersion: v1 controller: true kind: Pod name: tpcds-pvc ``` ### Does this PR introduce _any_ user-facing change? No. (The default is `false`) ### How was this patch tested? Manually check the above and pass K8s IT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 16 minutes, 40 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32288 from dongjoon-hyun/SPARK-35182. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-22 17:03:19 -07:00
Shardul Mahadik	83f753e4e1	[SPARK-34472][YARN] Ship ivySettings file to driver in cluster mode ### What changes were proposed in this pull request? In YARN, ship the `spark.jars.ivySettings` file to the driver when using `cluster` deploy mode so that `addJar` is able to find it in order to resolve ivy paths. ### Why are the changes needed? SPARK-33084 introduced support for Ivy paths in `sc.addJar` or Spark SQL `ADD JAR`. If we use a custom ivySettings file using `spark.jars.ivySettings`, it is loaded at `b26e7b510b/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala (L1280)`. However, this file is only accessible on the client machine. In YARN cluster mode, this file is not available on the driver and so `addJar` fails to find it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added unit tests to verify that the `ivySettings` file is localized by the YARN client and that a YARN cluster mode application is able to find to load the `ivySettings` file. Closes #31591 from shardulm94/SPARK-34472. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-04-20 13:35:57 -05:00
SaurabhChawla	1e64b4fa27	[SPARK-34877][CORE][YARN] Add the code change for adding the Spark AM log link in spark UI ### What changes were proposed in this pull request? On Running Spark job with yarn and deployment mode as client, Spark Driver and Spark Application master launch in two separate containers. In various scenarios there is need to see Spark Application master logs to see the resource allocation, Decommissioning status and other information shared between yarn RM and Spark Application master. In Cluster mode Spark driver and Spark AM is on same container, So Log link of the driver already there to see the logs in Spark UI This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI This change is only for showing the AM log links in the Client mode when resource manager is yarn. ### Why are the changes needed? Till now the only way to check this by finding the container id of the AM and check the logs either using Yarn utility or Yarn RM Application History server. This PR is for adding the spark AM log link for spark job running in the client mode for yarn. Instead of searching the container id and then find the logs. We can directly check in the Spark UI ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the unit test also checked the Spark UI In Yarn Client mode Before Change ![image](https://user-images.githubusercontent.com/34540906/112644861-e1733200-8e6b-11eb-939b-c76ca9902a4e.png) After the Change - The AM info is there ![image](https://user-images.githubusercontent.com/34540906/115264198-b7075280-a153-11eb-98f3-2aed66ffad2a.png) AM Log ![image](https://user-images.githubusercontent.com/34540906/112645680-c0f7a780-8e6c-11eb-8b82-4ccc0aee927b.png) In Yarn Cluster Mode - The AM log link will not be there ![image](https://user-images.githubusercontent.com/34540906/112649512-86900980-8e70-11eb-9b37-69d5c4b53ffa.png) Closes #31974 from SaurabhChawla100/SPARK-34877. Authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-04-20 08:56:07 -05:00
Dongjoon Hyun	00f06dd267	[SPARK-35131][K8S] Support early driver service clean-up during app termination ### What changes were proposed in this pull request? This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination. ### Why are the changes needed? The K8s service is one of the important resources and sometimes it's controlled by quota. ``` $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` Apache Spark creates a service for driver whose lifecycle is the same with driver pod. It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota. BEFORE ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver 0/1 Completed 0 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver 0/1 Completed 0 78s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 80m org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 80s $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 3 3 $ bin/spark-submit... Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services. Message: Forbidden! User minikube doesn't have permission. services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden: exceeded quota: service, requested: services=1, used: services=3, limited: services=3. ``` AFTER ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver 0/1 Completed 0 26s org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver 0/1 Completed 0 67s org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver 0/1 Completed 0 44s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 172m $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default. The change is documented at the migration guide. ### How was this patch tested? Pass the CIs. This is tested with K8s IT manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 9 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32226 from dongjoon-hyun/SPARK-35131. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-19 12:11:08 -07:00
Dongjoon Hyun	425dc58c02	[SPARK-35125][K8S] Upgrade K8s client to 5.3.0 to support K8s 1.20 ### What changes were proposed in this pull request? Although AS-IS master branch already works with K8s 1.20, this PR aims to upgrade K8s client to 5.3.0 to support K8s 1.20 officially. - https://github.com/fabric8io/kubernetes-client#compatibility-matrix The following are the notable breaking API changes. 1. Remove Doneable (5.0+): - https://github.com/fabric8io/kubernetes-client/pull/2571 2. Change Watcher.onClose signature (5.0+): - https://github.com/fabric8io/kubernetes-client/pull/2616 3. Change Readiness (5.1+) - https://github.com/fabric8io/kubernetes-client/pull/2796 ### Why are the changes needed? According to the compatibility matrix, this makes Apache Spark and its external cluster manager extension support all K8s 1.20 features officially for Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? Yes, this is a dev dependency change which affects K8s cluster extension users. ### How was this patch tested? Pass the CIs. This is manually tested with K8s IT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 17 minutes, 44 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32221 from dongjoon-hyun/SPARK-K8S-530. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-19 07:39:38 -07:00
“attilapiros”	8a3815f722	[SPARK-34789][TEST] Introduce Jetty based construct for integration tests where HTTP server is used ### What changes were proposed in this pull request? Introducing a new test construct: ``` withHttpServer() { baseURL => ... } ``` Which starts and stops a Jetty server to serve files via HTTP. Moreover this PR uses this new construct in the test `Run SparkRemoteFileTest using a remote data file`. ### Why are the changes needed? Before this PR github URLs was used like "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt". This connects two Spark version in an unhealthy way like connecting the "master" branch which is moving part with the committed test code which is a non-moving (as it might be even released). So this way a test running for an earlier version of Spark expects something (filename, content, path) from a the latter release and what is worse when the moving version is changed the earlier test will break. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31935 from attilapiros/SPARK-34789. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-14 21:22:52 -07:00
HyukjinKwon	a153efa643	[SPARK-35002][YARN][TESTS][FOLLOW-UP] Fix java.net.BindException in MiniYARNCluster ### What changes were proposed in this pull request? This PR fixes two tests below: https://github.com/apache/spark/runs/2320161984 ``` [info] YarnShuffleIntegrationSuite: [info] org.apache.spark.deploy.yarn.YarnShuffleIntegrationSuite * ABORTED * (228 milliseconds) [info] org.apache.hadoop.yarn.exceptions.YarnRuntimeException: org.apache.hadoop.yarn.webapp.WebAppException: Error starting http server [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:373) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:128) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:503) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:322) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:95) ... [info] Cause: java.net.BindException: Port in use: fv-az186-831:0 [info] at org.apache.hadoop.http.HttpServer2.constructBindException(HttpServer2.java:1231) [info] at org.apache.hadoop.http.HttpServer2.bindForSinglePort(HttpServer2.java:1253) [info] at org.apache.hadoop.http.HttpServer2.openListeners(HttpServer2.java:1316) [info] at org.apache.hadoop.http.HttpServer2.start(HttpServer2.java:1167) [info] at org.apache.hadoop.yarn.webapp.WebApps$Builder.start(WebApps.java:449) [info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.startWepApp(ResourceManager.java:1247) [info] at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.serviceStart(ResourceManager.java:1356) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.startResourceManager(MiniYARNCluster.java:365) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.access$300(MiniYARNCluster.java:128) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster$ResourceManagerWrapper.serviceStart(MiniYARNCluster.java:503) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.hadoop.service.CompositeService.serviceStart(CompositeService.java:121) [info] at org.apache.hadoop.yarn.server.MiniYARNCluster.serviceStart(MiniYARNCluster.java:322) [info] at org.apache.hadoop.service.AbstractService.start(AbstractService.java:194) [info] at org.apache.spark.deploy.yarn.BaseYarnClusterSuite.beforeAll(BaseYarnClusterSuite.scala:95) [info] at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) [info] at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) [info] at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) [info] at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61) ... ``` https://github.com/apache/spark/runs/2323342094 ``` [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret failed: java.lang.AssertionError: Connecting to /10.1.0.161:39895 timed out (120000 ms), took 120.081 sec [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadSecret(ExternalShuffleSecuritySuite.java:85) [error] ... [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId failed: java.lang.AssertionError: Connecting to /10.1.0.198:44633 timed out (120000 ms), took 120.08 sec [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testBadAppId(ExternalShuffleSecuritySuite.java:76) [error] ... [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid failed: java.io.IOException: Connecting to /10.1.0.119:43575 timed out (120000 ms), took 120.089 sec [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) [error] at org.apache.spark.network.shuffle.ExternalBlockStoreClient.registerWithShuffleServer(ExternalBlockStoreClient.java:211) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.validate(ExternalShuffleSecuritySuite.java:108) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testValid(ExternalShuffleSecuritySuite.java:68) [error] ... [info] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption started [error] Test org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption failed: java.io.IOException: Connecting to /10.1.0.248:35271 timed out (120000 ms), took 120.014 sec [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:285) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218) [error] at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230) [error] at org.apache.spark.network.shuffle.ExternalBlockStoreClient.registerWithShuffleServer(ExternalBlockStoreClient.java:211) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.validate(ExternalShuffleSecuritySuite.java:108) [error] at org.apache.spark.network.shuffle.ExternalShuffleSecuritySuite.testEncryption(ExternalShu ``` For Yarn cluster suites, its difficult to fix. This PR makes it skipped if it fails to bind. For shuffle related suites, it uses local host ### Why are the changes needed? To make the tests stable ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Its tested in GitHub Actions: https://github.com/HyukjinKwon/spark/runs/2340210765 Closes #32126 from HyukjinKwon/SPARK-35002-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-04-14 17:13:48 +08:00
Dongjoon Hyun	a42dc93a2a	[SPARK-34948][K8S] Add ownerReference to executor configmap to fix leakages ### What changes were proposed in this pull request? This PR aims to add `ownerReference` to the executor ConfigMap to fix leakage. ### Why are the changes needed? SPARK-30985 maintains the executor config map explicitly inside Spark. However, this config map can be leaked when Spark drivers die accidentally or are killed by K8s. We need to add `ownerReference` to make K8s do the garbage collection these automatically. The number of ConfigMap is one of the resource quota. So, the leaked configMaps currently cause Spark jobs submission failures. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and check manually. K8s IT is tested manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 2 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` BEFORE ``` $ k get cm spark-exec-450b417895b3b2c7-conf-map -oyaml \| grep ownerReferences ``` AFTER ``` $ k get cm spark-exec-bb37a27895b1c26c-conf-map -oyaml \| grep ownerReferences f:ownerReferences: ``` Closes #32042 from dongjoon-hyun/SPARK-34948. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-03 00:00:17 -07:00
Erik Krogen	9f065ff375	[SPARK-34828][YARN] Make shuffle service name configurable on client side and allow for classpath-based config override on server side ### What changes were proposed in this pull request? Add a new config, `spark.shuffle.service.name`, which allows for Spark applications to look for a YARN shuffle service which is defined at a name other than the default `spark_shuffle`. Add a new config, `spark.yarn.shuffle.service.metrics.namespace`, which allows for configuring the namespace used when emitting metrics from the shuffle service into the NodeManager's `metrics2` system. Add a new mechanism by which to override shuffle service configurations independently of the configurations in the NodeManager. When a resource `spark-shuffle-site.xml` is present on the classpath of the shuffle service, the configs present within it will be used to override the configs coming from `yarn-site.xml` (via the NodeManager). ### Why are the changes needed? There are two use cases which can benefit from these changes. One use case is to run multiple instances of the shuffle service side-by-side in the same NodeManager. This can be helpful, for example, when running a YARN cluster with a mixed workload of applications running multiple Spark versions, since a given version of the shuffle service is not always compatible with other versions of Spark (e.g. see SPARK-27780). With this PR, it is possible to run two shuffle services like `spark_shuffle` and `spark_shuffle_3.2.0`, one of which is "legacy" and one of which is for new applications. This is possible because YARN versions since 2.9.0 support the ability to run shuffle services within an isolated classloader (see YARN-4577), meaning multiple Spark versions can coexist. Besides this, the separation of shuffle service configs into `spark-shuffle-site.xml` can be useful for administrators who want to change and/or deploy Spark shuffle service configurations independently of the configurations for the NodeManager (e.g., perhaps they are owned by two different teams). ### Does this PR introduce _any_ user-facing change? Yes. There are two new configurations related to the external shuffle service, and a new mechanism which can optionally be used to configure the shuffle service. `docs/running-on-yarn.md` has been updated to provide user instructions; please see this guide for more details. ### How was this patch tested? In addition to the new unit tests added, I have deployed this to a live YARN cluster and successfully deployed two Spark shuffle services simultaneously, one running a modified version of Spark 2.3.0 (which supports some of the newer shuffle protocols) and one running Spark 3.1.1. Spark applications of both versions are able to communicate with their respective shuffle services without issue. Closes #31936 from xkrogen/xkrogen-SPARK-34828-shufflecompat-config-from-classpath. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Thomas Graves <tgraves@apache.org>	2021-03-30 10:09:00 -05:00
“attilapiros”	c8b7a09d39	[SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods output ### What changes were proposed in this pull request? Extending "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with `kubectl describe pods` output for the failed test. ### Why are the changes needed? PR builds frequently fails as the k8s integration tests are very flaky now in Amplab Jenkins environment. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally by making temporary one of the test fail. The output is: ``` 21/03/25 16:55:16.722 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: ===== EXTRA LOGS FOR THE FAILED TEST 21/03/25 16:55:17.167 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver DESCRIBE POD Name: spark-test-app-a2b03971b7c049e8a2629f6a3198842b Namespace: 35bdb17e308743afaec17538f89a7c3e Priority: 0 Node: minikube/192.168.64.119 Start Time: Thu, 25 Mar 2021 16:52:10 +0100 Labels: spark-app-locator=75f695685ae44314a99ec13bb39332bc spark-app-selector=spark-150230742d364a77927a08eed0222065 spark-role=driver Annotations: <none> Status: Succeeded IP: 172.17.0.4 Containers: spark-kubernetes-driver: Container ID: docker://d6d27b0551060d9b094f12d1e232dfb5ae78ce38559680c7126c548996da4d95 Image: docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5 Image ID: docker://sha256:3fc556c73a0d5187b5a14dbdc2f69ef292e60b544b4b4d3715f6749417c20918 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 25 Mar 2021 16:52:11 +0100 Finished: Thu, 25 Mar 2021 16:52:20 +0100 Ready: False Restart Count: 0 Limits: memory: 1408Mi Requests: cpu: 1 memory: 1408Mi Environment: SPARK_USER: attilazsoltpiros SPARK_APPLICATION_ID: spark-150230742d364a77927a08eed0222065 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume-driver (rw) /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-nmfwl (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> spark-conf-volume-driver: Type: ConfigMap (a volume populated by a ConfigMap) Name: spark-drv-c60832786a15ffbe-conf-map Optional: false default-token-nmfwl: Type: Secret (a volume populated by a Secret) SecretName: default-token-nmfwl Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3m7s default-scheduler Successfully assigned 35bdb17e308743afaec17538f89a7c3e/spark-test-app-a2b03971b7c049e8a2629f6a3198842b to minikube Normal Pulled 3m7s kubelet, minikube Container image "docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5" already present on machine Normal Created 3m7s kubelet, minikube Created container spark-kubernetes-driver Normal Started 3m6s kubelet, minikube Started container spark-kubernetes-driver 21/03/25 16:55:17.168 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver DESCRIBE POD ``` Closes #31962 from attilapiros/SPARK-34869. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-28 09:44:56 -07:00
hongdongdong	985c653b20	[SPARK-33720][K8S] Support submit to k8s only with token ### What changes were proposed in this pull request? Support submit to k8s only with token. ### Why are the changes needed? Now, sumbit to k8s always need oauth files. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? Before, submit job out of k8s cluster without correct ca.crt, we may get this exception: ``` Caused by: sun.security.validator.ValidatorException: PKIX path building failed: sun.security.provider.certpath.SunCertPathBuilderException: unable to find valid certification path to requested target at sun.security.validator.PKIXValidator.doBuild(PKIXValidator.java:439) at sun.security.validator.PKIXValidator.engineValidate(PKIXValidator.java:306) at sun.security.validator.Validator.validate(Validator.java:271) at sun.security.ssl.X509TrustManagerImpl.validate(X509TrustManagerImpl.java:312) ``` When set spark.kubernetes.trust.certificates = true, we can submit only with correct token, no need to config ca.crt in local env. Submit as: ``` bin/spark-submit \ --master $master \ --name pi \ --deploy-mode cluster \ --conf spark.kubernetes.container.image=$image \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.authenticate.submission.oauthToken=$clusterToken \ --conf spark.kubernetes.trust.certificates=true \ local:///opt/spark/examples/src/main/python/pi.py 200 ``` Closes #30684 from hddong/trust-certs. Authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-23 22:07:27 -07:00
Yikun Jiang	31da90762e	[SPARK-34820][K8S][R] add apt-update before gnupg install ### What changes were proposed in this pull request? We added the gnupg installation in https://github.com/apache/spark/pull/30130 , we should do apt update before gnupg isntallation, otherwise we will get a fetch error when package is updated. See more in: [1] http://apache-spark-developers-list.1001551.n3.nabble.com/K8s-Integration-test-is-unable-to-run-because-of-the-unavailable-libs-td30986.html ### Why are the changes needed? add a apt-update cmd before gnupg installation to avoid invaild package cache list. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? K8s Integration test passed Closes #31923 from Yikun/SPARK-34820. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-22 10:13:31 -07:00
Dongjoon Hyun	2fa792aa64	[SPARK-34783][K8S] Support remote template files ### What changes were proposed in this pull request? This PR aims to support remote driver/executor template files. ### Why are the changes needed? Currently, `KubernetesUtils.loadPodFromTemplate` supports only local files. With this PR, we can do the following. ```bash bin/spark-submit \ ... -c spark.kubernetes.driver.podTemplateFile=s3a://dongjoon/driver.yml \ -c spark.kubernetes.executor.podTemplateFile=s3a://dongjoon/executor.yml \ ... ``` ### Does this PR introduce _any_ user-facing change? Yes, this is an improvement. ### How was this patch tested? Manual testing. Closes #31877 from dongjoon-hyun/SPARK-34783-2. Lead-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-19 08:52:42 -07:00
“attilapiros”	124b5af114	[SPARK-34732][K8S][TESTS] Fix IndexOutOfBoundsException in logForFailedTest when driver is not started ### What changes were proposed in this pull request? Fixing `IndexOutOfBoundsException` in `logForFailedTest` method when driver is not started. ### Why are the changes needed? Before this PR when the driver is not started an `IndexOutOfBoundsException` as the first item is tried to be accessed from an empty list: ``` - PVs with local storage * FAILED * java.lang.IndexOutOfBoundsException: Index: 0, Size: 0 at java.util.ArrayList.rangeCheck(ArrayList.java:659) at java.util.ArrayList.get(ArrayList.java:435) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.logForFailedTest(KubernetesSuite.scala:83) at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:181) at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:188) at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:200) at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306) at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:200) at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:182) at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Running integration tests. After this changes the above error become: ``` - PVs with local storage * FAILED * java.io.IOException: No such file or directory at java.io.UnixFileSystem.createFileExclusively(Native Method) at java.io.File.createTempFile(File.java:2026) at org.apache.spark.deploy.k8s.integrationtest.Utils$.createTempFile(Utils.scala:103) at org.apache.spark.deploy.k8s.integrationtest.PVTestsSuite.$anonfun$$init$$1(PVTestsSuite.scala:135) at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) at org.scalatest.Transformer.apply(Transformer.scala:22) at org.scalatest.Transformer.apply(Transformer.scala:20) at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:190) ... ``` Closes #31824 from attilapiros/SPARK-34732. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-13 15:28:02 -08:00
“attilapiros”	6c5322de61	[SPARK-34361][K8S] In case of downscaling avoid killing of executors already known by the scheduler backend in the pod allocator ### What changes were proposed in this pull request? This PR modifies the POD allocator to use the scheduler backend to get the known executors and remove those from the pending and newly created list. This is different from the normal `ExecutorAllocationManager` requested killing of executors where the `spark.dynamicAllocation.executorIdleTimeout` is used. In this case POD allocator kills the executors which should be only responsible for terminating not satisfied POD allocations (new requests where no POD state is received yet and PODs in pending state). ### Why are the changes needed? Because there is race between executor POD allocator and cluster scheduler backend. Running several experiment during downscaling we experienced a lot of killed fresh executors wich has already running task on them. The pattern in the log was the following (see executor 312 and TID 2079): ``` 21/02/01 15:12:03 INFO ExecutorMonitor: New executor 312 has registered (new total is 138) ... 21/02/01 15:12:03 INFO TaskSetManager: Starting task 247.0 in stage 4.0 (TID 2079, 100.100.18.138, executor 312, partition 247, PROCESS_LOCAL, 8777 bytes) 21/02/01 15:12:03 INFO ExecutorPodsAllocator: Deleting 3 excess pod requests (408,312,307). ... 21/02/01 15:12:04 ERROR TaskSchedulerImpl: Lost executor 312 on 100.100.18.138: The executor with id 312 was deleted by a user or the framework. 21/02/01 15:12:04 INFO TaskSetManager: Task 2079 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task. ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? #### Manually With this change there was no executor lost with running task on it. ##### With unit test A new test is added and existing test is modified to check these cases. Closes #31513 from attilapiros/SPARK-34361. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-03-02 16:58:29 -08:00
Dongjoon Hyun	4d428a821b	Revert "[SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests" This reverts commit `b17754a8cb`.	2021-02-25 17:10:58 -08:00
HyukjinKwon	8a1e172b51	[SPARK-34520][CORE] Remove unused SecurityManager references ### What changes were proposed in this pull request? This is kind of a followup of https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945. Many of references in `SecurityManager` were introduced from SPARK-1189, and related usages were removed later from https://github.com/apache/spark/pull/24033 and https://github.com/apache/spark/pull/30945. This PR proposes to remove them out. ### Why are the changes needed? For better readability of codes. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually complied. GitHub Actions and Jenkins build should test it out as well. Closes #31636 from HyukjinKwon/SPARK-34520. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-24 20:38:03 -08:00
“attilapiros”	b17754a8cb	[SPARK-32617][K8S][TESTS] Configure kubernetes client based on kubeconfig settings in kubernetes integration tests ### What changes were proposed in this pull request? From [minikube version v1.1.0](https://github.com/kubernetes/minikube/blob/v1.1.0/CHANGELOG.md) kubectl is available as a command. So the kubeconfig settings can be accessed like: ``` $ minikube kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority: /Users/attilazsoltpiros/.minikube/ca.crt server: https://127.0.0.1:32788 name: minikube contexts: - context: cluster: minikube namespace: default user: minikube name: minikube current-context: minikube kind: Config preferences: {} users: - name: minikube user: client-certificate: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt client-key: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key ``` Here the vm-driver was docker and the server port (https://127.0.0.1:32788) is different from the hardcoded 8443. So the main part of this PR is introducing kubernetes client configuration based on the kubeconfig (output of `minikube kubectl config view`) in case of minikube versions after v1.1.0 and the old legacy way of configuration is also kept as minikube version should be supported back to v0.34.1 . Moreover as the old style of config parsing pattern wasn't sufficient in my case as when the `minikube kubectl config view` is called kubectl downloading message might be included before the first key I changed it even for the existent keys to be a consistent pattern in this file. The old parsing in an example: ``` private val HOST_PREFIX = "host:" val hostString = statusString.find(_.contains(s"$HOST_PREFIX ")) val status1 = hostString.get.split(HOST_PREFIX)(1) ``` The new parsing: ``` private val HOST_PREFIX = "host: " val hostString = statusString.find(_.contains(HOST_PREFIX)) hostString.get.split(HOST_PREFIX)(1) ``` So the PREFIX is extended with the extra space at the declaration (this way the two separate string operation are more safe and consistent with each other) and the replace is changed to split and getting the 2nd string from the result (which is guaranteed to contain only the text after the PREFIX when the PREFIX is a contained substring). Finally there is tiny change in `dev-run-integration-tests.sh` to introduce `--skip-building-dependencies` which switchs off building of maven dependencies of `kubernetes-integration-tests` from the Spark project. This could be used when only the `kubernetes-integration-tests` should be rebuilded as only the tests are modified. ### Why are the changes needed? Kubernetes client configuration based on kubeconfig settings is more reliable and provides a solution which is minikube version independent. ### Does this PR introduce _any_ user-facing change? No. This is only test code. ### How was this patch tested? tested manually on two minikube versions. Minikube v0.34.1: ``` $ minikube version minikube version: v0.34.1 $ grep "version\\|building" resource-managers/kubernetes/integration-tests/target/integration-tests.log 20/12/12 12:52:25.135 ScalaTest-main-running-DiscoverySuite INFO Minikube: minikube version: v0.34.1 20/12/12 12:52:25.761 ScalaTest-main-running-DiscoverySuite INFO Minikube: building kubernetes config with apiVersion: v1, masterUrl: https://192.168.99.103:8443, caCertFile: /Users/attilazsoltpiros/.minikube/ca.crt, clientCertFile: /Users/attilazsoltpiros/.minikube/apiserver.crt, clientKeyFile: /Users/attilazsoltpiros/.minikube/apiserver.key ``` Minikube v1.15.1 ``` $ minikube version minikube version: v1.15.1 commit: 23f40a012abb52eff365ff99a709501a61ac5876 $ grep "version\\|building" resource-managers/kubernetes/integration-tests/target/integration-tests.log 20/12/13 06:25:55.086 ScalaTest-main-running-DiscoverySuite INFO Minikube: minikube version: v1.15.1 20/12/13 06:25:55.597 ScalaTest-main-running-DiscoverySuite INFO Minikube: building kubernetes config with apiVersion: v1, masterUrl: https://192.168.64.4:8443, caCertFile: /Users/attilazsoltpiros/.minikube/ca.crt, clientCertFile: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt, clientKeyFile: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key $ minikube kubectl config view apiVersion: v1 clusters: - cluster: certificate-authority: /Users/attilazsoltpiros/.minikube/ca.crt server: https://192.168.64.4:8443 name: minikube contexts: - context: cluster: minikube namespace: default user: minikube name: minikube current-context: minikube kind: Config preferences: {} users: - name: minikube user: client-certificate: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt client-key: /Users/attilazsoltpiros/.minikube/profiles/minikube/client.key ``` Closes #30751 from attilapiros/SPARK-32617. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-24 11:46:27 -08:00
Dongjoon Hyun	9942548c37	[SPARK-34487][K8S][TESTS] Use the runtime Hadoop version in K8s IT ### What changes were proposed in this pull request? This PR aims to use the runtime Hadoop version in K8s integration test. ### Why are the changes needed? SPARK-33212 upgrades Hadoop dependency from 3.2.0 to 3.2.2 and we will upgrade to 3.3.x+. We had better use the runtime Hadoop version instead of having a static string. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the K8s IT. This is tested locally like the following. ``` KubernetesSuite: ... - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file ... ``` Closes #31604 from dongjoon-hyun/SPARK-34487. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-21 08:57:02 -08:00
Dongjoon Hyun	020e84e92f	[SPARK-34486][K8S] Upgrade kubernetes-client to 4.13.2 ### What changes were proposed in this pull request? This PR aims to upgrade `kubernetes-client` library from 4.12.0 to 4.13.2 for Apache Spark 3.2.0. ### Why are the changes needed? This will bring [K8s 1.19.1](https://github.com/fabric8io/kubernetes-client/pull/2541) models officially and the latest bug fixes. - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.0 - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.1 - https://github.com/fabric8io/kubernetes-client/releases/tag/v4.13.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the K8s IT and UT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 25 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #31602 from dongjoon-hyun/SPARK-34486. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 18:35:38 +09:00
yi.wu	546d2eb5d4	[SPARK-34384][CORE] Add missing docs for ResourceProfile APIs ### What changes were proposed in this pull request? This PR adds missing docs for ResourceProfile related APIs. Besides, it includes a few minor changes on API: * ResourceProfileBuilder.build -> ResourceProfileBuilder.builder() * Provides java specific API `allSupportedExecutorResourcesJList` * private `ResourceAllocator` since it was mistakenly exposed previously ### Why are the changes needed? Add missing API docs ### Does this PR introduce _any_ user-facing change? No, as Apache Spark 3.1 hasn't officially released. ### How was this patch tested? Updated unit tests due to the signature change of `build()`. Closes #31496 from Ngone51/resource-profile-api-cleanup. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-21 18:29:44 +09:00
Dongjoon Hyun	484a83e73e	[SPARK-34469][K8S] Ignore RegisterExecutor when SparkContext is stopped ### What changes were proposed in this pull request? This PR aims to make `KubernetesClusterSchedulerBackend` ignore `RegisterExecutor` message when `SparkContext` is stopped already. ### Why are the changes needed? If `SparkDriver` is terminated, the executors will be removed by K8s automatically. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the newly added test case. Closes #31587 from dongjoon-hyun/SPARK-34469. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-19 09:36:07 -08:00
“attilapiros”	76e5d75e36	[SPARK-33763] Add metrics for better tracking of dynamic allocation ### What changes were proposed in this pull request? This PR adds the following metrics to track executor remove reasons during dynamic allocation: - `numberExecutorsGracefullyDecommissioned`: number of executors which reached the finished decommissioning state and shut itself down cleanly - `numberExecutorsDecommissionUnfinished`: executors which requested to decommission but they stopped without reaching the finished decommissioning state - `numberExecutorsKilledByDriver`: executors killed by the driver (requested to stop) - `numberExecutorsExitedUnexpectedly`: executors exited without driver request ### Why are the changes needed? For supporting monitoring of dynamic allocation better with these metrics. ### Does this PR introduce _any_ user-facing change? Yes. The new metrics will be available for monitoring. ### How was this patch tested? With unit and integration tests. Finally manually checked the new metrics in jconsole: <img width="1054" alt="jmx" src="https://user-images.githubusercontent.com/2017933/107458686-de8adf00-6b54-11eb-86f7-41faf2fb638f.png"> Closes #31450 from attilapiros/SPARK-33763-final. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-17 13:44:36 -08:00
“attilapiros”	5f91245cc2	[SPARK-34426][K8S][TESTS] Add driver and executors POD logs to integration tests log when the test fails ### What changes were proposed in this pull request? This PR introduces a new protected method in `SparkFunSuite` which is only called when the test failed and can be used to collect logs for failed test. By this PR it is implemented in the Kubernetes tests by `KubernetesSuite` class where it collects all the POD logs and logs them out. This unfortunately cannot be realized with a simple "after" method as in the "after" method the test outcome is not available. Moreover this PR removes the `appLocator` as a method argument as `appLocator` is available as a member variable. ### Why are the changes needed? Currently both the driver and executors logs are lost. In [developer-tools](https://spark.apache.org/developer-tools.html) there is a hint: "Getting logs from the pods and containers directly is an exercise left to the reader." But when the test is executed by Jenkins and a failure happened we really need the POD logs to analyze problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By integration testing. I have checked what would happen if one test fails, the output would be: ``` 21/02/14 11:05:34.261 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: ===== EXTRA LOGS FOR THE FAILED TEST 21/02/14 11:05:34.278 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver POD log ++ id -u + myuid=185 ++ id -g + mygid=0 + set +e ++ getent passwd 185 + uidentry= + set -e + '[' -z '' ']' + '[' -w /etc/passwd ']' + echo '185❌185:0:anonymous uid:/opt/spark:/bin/false' + SPARK_CLASSPATH=':/opt/spark/jars/' + env + grep SPARK_JAVA_OPT_ + sort -t_ -k4 -n + sed 's/[^=]=$.$/\1/g' + readarray -t SPARK_EXECUTOR_JAVA_OPTS + '[' -n '' ']' + '[' -z ']' + '[' -z ']' + '[' -n '' ']' + '[' -z ']' + '[' -z x ']' + SPARK_CLASSPATH='/opt/spark/conf::/opt/spark/jars/' + case "$1" in + shift 1 + CMD=("$SPARK_HOME/bin/spark-submit" --conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS" --deploy-mode client "$") + exec /usr/bin/tini -s -- /opt/spark/bin/spark-submit --conf spark.driver.bindAddress=172.17.0.3 --deploy-mode client --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.deploy.PythonRunner local:///opt/spark/tests/decommissioning.py 21/02/14 10:02:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Starting decom test Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/02/14 10:02:29 INFO SparkContext: Running Spark version 3.2.0-SNAPSHOT 21/02/14 10:02:29 INFO ResourceUtils: ============================================================== 21/02/14 10:02:29 INFO ResourceUtils: No custom resources configured for spark.driver. 21/02/14 10:02:29 INFO ResourceUtils: ============================================================== ... 21/02/14 10:03:17 INFO ShutdownHookManager: Deleting directory /var/data/spark-fa6961ed-a2c1-444c-bfeb-20e63ba0b5cf/spark-ab4b0287-6e24-4b39-837e-9b0b62c1f26f 21/02/14 10:03:17 INFO ShutdownHookManager: Deleting directory /tmp/spark-d6b11e7d-6a03-4a1d-8559-37cb853319bf 21/02/14 11:05:34.279 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver POD log ``` Closes #31561 from attilapiros/SPARK-34426. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-02-17 05:49:16 +09:00
Holden Karau	5248ecb5ab	[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes ### What changes were proposed in this pull request? Allow users to have Spark attempt to decommission excluded executors. Since excluded executors may be flaky, this also adds the ability for users to specify a time limit after which a decommissioning executor will be killed by Spark. ### Why are the changes needed? This may help prevent fetch failures from excluded executors, and also handle the situation in which executors ### Does this PR introduce _any_ user-facing change? Yes, two new configuration flags for the behaviour. ### How was this patch tested? Extended unit and integration tests. Closes #31539 from holdenk/re=enable-SPARK-34104-SPARK-34105. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 18:16:09 -08:00
HyukjinKwon	c8628c943c	Revert "[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes" This reverts commit `50641d2e3d`.	2021-02-10 08:00:03 +09:00
Holden Karau	50641d2e3d	[SPARK-34104][SPARK-34105][CORE][K8S] Maximum decommissioning time & allow decommissioning for excludes ### What changes were proposed in this pull request? Allow users to have Spark attempt to decommission excluded executors. Since excluded executors may be flaky, this also adds the ability for users to specify a time limit after which a decommissioning executor will be killed by Spark. ### Why are the changes needed? This may help prevent fetch failures from excluded executors, and also handle the situation in which executors ### Does this PR introduce _any_ user-facing change? Yes, two new configuration flags for the behaviour. ### How was this patch tested? Extended unit and integration tests. Closes #31249 from holdenk/configure-inaccessibleList-kill-to-use-decommissioning. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 14:21:24 -08:00
“attilapiros”	b2dc38b654	[SPARK-34334][K8S] Correctly identify timed out pending pod requests as excess request ### What changes were proposed in this pull request? Fixing identification of timed-out pending pod requests as excess requests to delete when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests too. ### Why are the changes needed? After https://github.com/apache/spark/pull/29981 only timed out newly created requests and timed out pending requests are taken as excess request. But there is small bug when the excess is higher than the newly created timed out requests and there is some non-timed out newly created requests as well. Because all the newly created requests are counted as excess request when items are chosen from the timed out pod pending requests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? There is new unit test added: `SPARK-34334: correctly identify timed out pending pod requests as excess`. Closes #31445 from attilapiros/SPARK-34334. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-02-09 10:06:55 -08:00
Dongjoon Hyun	ea339c38b4	[SPARK-34407][K8S] KubernetesClusterSchedulerBackend.stop should clean up K8s resources ### What changes were proposed in this pull request? This PR aims to fix `KubernetesClusterSchedulerBackend.stop` to wrap `super.stop` with `Utils.tryLogNonFatalError`. ### Why are the changes needed? [CoarseGrainedSchedulerBackend.stop](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L559) may throw `SparkException` and this causes K8s resource (pod and configmap) leakage. ### Does this PR introduce _any_ user-facing change? No. This is a bug fix. ### How was this patch tested? Pass the CI with the newly added test case. Closes #31533 from dongjoon-hyun/SPARK-34407. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-08 21:47:23 -08:00
yangjie01	b344e91368	[SPARK-34375][CORE][K8S][TEST] Replaces 'Mockito.initMocks' with 'Mockito.openMocks' ### What changes were proposed in this pull request? `Mockito.initMocks(Object)` is a deprecated api, should use `Mockito.openMocks(Object).close()` instead. ### Why are the changes needed? Cleanup deprecation api usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31487 from LuciferYang/mockito-api. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-08 15:13:00 +09:00
Dongjoon Hyun	f66e38c963	[SPARK-34316][K8S] Support spark.kubernetes.executor.disableConfigMap ### What changes were proposed in this pull request? This PR aims to add a new configuration `spark.kubernetes.executor.disableConfigMap`. ### Why are the changes needed? This can be use to disable config map creating for executor pods due to https://github.com/apache/spark/pull/27735 . ### Does this PR introduce _any_ user-facing change? No. By default, this doesn't change AS-IS behavior. This is a new feature to add an ability to disable SPARK-30985. ### How was this patch tested? Pass the newly added UT. Closes #31428 from dongjoon-hyun/SPARK-34316. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-01 22:26:07 -08:00
“attilapiros”	d3f049cbc2	[SPARK-34154][YARN][FOLLOWUP] Fix flaky LocalityPlacementStrategySuite test ### What changes were proposed in this pull request? Fixing the flaky `handle large number of containers and tasks (SPARK-18750)` by avoiding to use `DNSToSwitchMapping` as in some situation DNS lookup could be extremely slow. ### Why are the changes needed? After https://github.com/apache/spark/pull/31363 was merged the flaky `handle large number of containers and tasks (SPARK-18750)` test failed again in some other PRs but now we have the exact place where the test is stuck. It is in the DNS lookup: ``` [info] - handle large number of containers and tasks (SPARK-18750) * FAILED * (30 seconds, 4 milliseconds) [info] Failed with an exception or a timeout at thread join: [info] [info] java.lang.RuntimeException: Timeout at waiting for thread to stop (its stack trace is added to the exception) [info] at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method) [info] at java.net.InetAddress$2.lookupAllHostAddr(InetAddress.java:929) [info] at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1324) [info] at java.net.InetAddress.getAllByName0(InetAddress.java:1277) [info] at java.net.InetAddress.getAllByName(InetAddress.java:1193) [info] at java.net.InetAddress.getAllByName(InetAddress.java:1127) [info] at java.net.InetAddress.getByName(InetAddress.java:1077) [info] at org.apache.hadoop.net.NetUtils.normalizeHostName(NetUtils.java:568) [info] at org.apache.hadoop.net.NetUtils.normalizeHostNames(NetUtils.java:585) [info] at org.apache.hadoop.net.CachedDNSToSwitchMapping.resolve(CachedDNSToSwitchMapping.java:109) [info] at org.apache.spark.deploy.yarn.SparkRackResolver.coreResolve(SparkRackResolver.scala:75) [info] at org.apache.spark.deploy.yarn.SparkRackResolver.resolve(SparkRackResolver.scala:66) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$localityOfRequestedContainers$3(LocalityPreferredContainerPlacementStrategy.scala:142) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy$$Lambda$658/1080992036.apply$mcVI$sp(Unknown Source) [info] at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:158) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:138) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40) [info] at java.lang.Thread.run(Thread.java:748) (LocalityPlacementStrategySuite.scala:61) ... ``` This could be because of the DNS servers used by those build machines are not configured to handle IPv6 queries and the client has to wait for the IPv6 query to timeout before falling back to IPv4. This even make the tests more consistent. As when a single host was given to lookup via `resolve(hostName: String)` it gave a different answer from calling `resolve(hostNames: Seq[String])` with a `Seq` containing that single host. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #31397 from attilapiros/SPARK-34154-2nd. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 23:54:40 +09:00
Dongjoon Hyun	78244bafe8	[SPARK-34281][K8S] Promote spark.kubernetes.executor.podNamePrefix to the public conf ### What changes were proposed in this pull request? This PR aims to remove `internal()` from `spark.kubernetes.executor.podNamePrefix` in order to make it the configuration public. ### Why are the changes needed? In line with K8s GA, this will allow some users control the full executor pod names officially. This is useful when we want a custom executor pod name pattern independently from the app name. ### Does this PR introduce _any_ user-facing change? No, this has been there since Apache Spark 2.3.0. ### How was this patch tested? N/A. Closes #31386 from dongjoon-hyun/SPARK-34281. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-28 13:01:18 -08:00
“attilapiros”	0dedf24cd0	[SPARK-34154][YARN] Extend LocalityPlacementStrategySuite's test with a timeout ### What changes were proposed in this pull request? This PR extends the `handle large number of containers and tasks (SPARK-18750)` test with a time limit and in case of timeout it saves the stack trace of the running thread to provide extra information about the reason why it got stuck. ### Why are the changes needed? This is a flaky test which sometime runs for hours without stopping. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I checked it with a temporary code change: by adding a `Thread.sleep` to `LocalityPreferredContainerPlacementStrategy#expectedHostToContainerCount`. The stack trace showed the correct method: ``` [info] LocalityPlacementStrategySuite: [info] - handle large number of containers and tasks (SPARK-18750) * FAILED * (30 seconds, 26 milliseconds) [info] Failed with an exception or a timeout at thread join: [info] [info] java.lang.RuntimeException: Timeout at waiting for thread to stop (its stack trace is added to the exception) [info] at java.lang.Thread.sleep(Native Method) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.$anonfun$expectedHostToContainerCount$1(LocalityPreferredContainerPlacementStrategy.scala:198) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy$$Lambda$281/381161906.apply(Unknown Source) [info] at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) [info] at scala.collection.TraversableLike$$Lambda$16/322836221.apply(Unknown Source) [info] at scala.collection.immutable.HashMap$HashMap1.foreach(HashMap.scala:234) [info] at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) [info] at scala.collection.immutable.HashMap$HashTrieMap.foreach(HashMap.scala:468) [info] at scala.collection.TraversableLike.map(TraversableLike.scala:238) [info] at scala.collection.TraversableLike.map$(TraversableLike.scala:231) [info] at scala.collection.AbstractTraversable.map(Traversable.scala:108) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.expectedHostToContainerCount(LocalityPreferredContainerPlacementStrategy.scala:188) [info] at org.apache.spark.deploy.yarn.LocalityPreferredContainerPlacementStrategy.localityOfRequestedContainers(LocalityPreferredContainerPlacementStrategy.scala:112) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite.org$apache$spark$deploy$yarn$LocalityPlacementStrategySuite$$runTest(LocalityPlacementStrategySuite.scala:94) [info] at org.apache.spark.deploy.yarn.LocalityPlacementStrategySuite$$anon$1.run(LocalityPlacementStrategySuite.scala:40) [info] at java.lang.Thread.run(Thread.java:748) (LocalityPlacementStrategySuite.scala:61) ... ``` Closes #31363 from attilapiros/SPARK-34154. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 08:04:25 +09:00
yangjie01	8999e8805d	[SPARK-34224][CORE][SQL][SS][DSTREAM][YARN][TEST][EXAMPLES] Ensure all resource opened by `Source.fromXXX` are closed ### What changes were proposed in this pull request? Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed. ### Why are the changes needed? Avoid file handle leak. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31323 from LuciferYang/source-not-closed. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 19:06:37 +09:00
Dongjoon Hyun	134a7d7eb9	[SPARK-34206][K8S] Make Guava Cache as ExecutorPodsLifecycleManager private field ### What changes were proposed in this pull request? `KubernetesClusterManager` and `ExecutorPodsLifecycleManager` are private Spark classes. This PR aims to move `Guava Cache` from a constructor parameter to private field of `ExecutorPodsLifecycleManager`. ### Why are the changes needed? 1. Although `KubernetesClusterManager` creates `Guava Cache`, only `ExecutorPodsLifecycleManager` uses it. 2. Although `ExecutorPodsLifecycleManager` is a Spark private class, when some users implement a new cluster manager with `ExternalClusterManager` for K8s, they can reuse `ExecutorPodsLifecycleManager`. In this case, `Guava Cache` is not good as an interface because it's a shaded class. ### Does this PR introduce _any_ user-facing change? No. This is an Spark private. ### How was this patch tested? Pass the existing UTs. Closes #31297 from dongjoon-hyun/SPARK-34206. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-22 19:36:07 -08:00
Chao Sun	b6f46ca297	[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This: 1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. 2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2 Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #30701 from sunchao/test-hadoop-3.2.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 14:06:50 -08:00
yangjie01	8b1ba233f1	[SPARK-34068][CORE][SQL][MLLIB][GRAPHX] Remove redundant collection conversion ### What changes were proposed in this pull request? There are some redundant collection conversion can be removed, for version compatibility, clean up these with Scala-2.13 profile. ### Why are the changes needed? Remove redundant collection conversion ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test `core`, `graphx`, `mllib`, `mllib-local`, `sql`, `yarn`,`kafka-0-10` in Scala 2.13 passed Closes #31125 from LuciferYang/SPARK-34068. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-13 18:07:02 -06:00
“attilapiros”	6bd7a6200f	[SPARK-33711][K8S] Avoid race condition between POD lifecycle manager and scheduler backend ### What changes were proposed in this pull request? Missing POD detection is extended by timestamp (and time limit) based check to avoid wrongfully detection of missing POD detection. The two new timestamps: - `fullSnapshotTs` is introduced for the `ExecutorPodsSnapshot` which only updated by the pod polling snapshot source - `registrationTs` is introduced for the `ExecutorData` and it is initialized at the executor registration at the scheduler backend Moreover a new config `spark.kubernetes.executor.missingPodDetectDelta` is used to specify the accepted delta between the two. ### Why are the changes needed? Watching a POD (`ExecutorPodsWatchSnapshotSource`) only inform about single POD changes. This could wrongfully lead to detecting of missing PODs (PODs known by scheduler backend but missing from POD snapshots) by the executor POD lifecycle manager. A key indicator of this error is seeing this log message: > "The executor with ID [some_id] was not found in the cluster but we didn't get a reason why. Marking the executor as failed. The executor may have been deleted but the driver missed the deletion event." So one of the problem is running the missing POD detection check even when a single POD is changed without having a full consistent snapshot about all the PODs (see `ExecutorPodsPollingSnapshotSource`). The other problem could be the race between the executor POD lifecycle manager and the scheduler backend: so even in case of a having a full snapshot the registration at the scheduler backend could precede the snapshot polling (and processing of those polled snapshots). ### Does this PR introduce _any_ user-facing change? Yes. When the POD is missing then the reason message explaining the executor's exit is extended with both timestamps (the polling time and the executor registration time) and even the new config is mentioned. ### How was this patch tested? The existing unit tests are extended. Closes #30675 from attilapiros/SPARK-33711. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-01-11 14:25:12 -08:00
HyukjinKwon	830249284d	[SPARK-34059][SQL][CORE] Use for/foreach rather than map to make sure execute it eagerly ### What changes were proposed in this pull request? This PR is basically a followup of https://github.com/apache/spark/pull/14332. Calling `map` alone might leave it not executed due to lazy evaluation, e.g.) ``` scala> val foo = Seq(1,2,3) foo: Seq[Int] = List(1, 2, 3) scala> foo.map(println) 1 2 3 res0: Seq[Unit] = List((), (), ()) scala> foo.view.map(println) res1: scala.collection.SeqView[Unit,Seq[_]] = SeqViewM(...) scala> foo.view.foreach(println) 1 2 3 ``` We should better use `foreach` to make sure it's executed where the output is unused or `Unit`. ### Why are the changes needed? To prevent the potential issues by not executing `map`. ### Does this PR introduce _any_ user-facing change? No, the current codes look not causing any problem for now. ### How was this patch tested? I found these item by running IntelliJ inspection, double checked one by one, and fixed them. These should be all instances across the codebase ideally. Closes #31110 from HyukjinKwon/SPARK-34059. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-01-10 15:22:24 -08:00
Holden Karau	8e11ce5378	[SPARK-34018][K8S] NPE in ExecutorPodsSnapshot ### What changes were proposed in this pull request? Label both the statuses and ensure the ExecutorPodSnapshot starts with the default config to match. ### Why are the changes needed? The current test depends on the order rather than testing the desired property. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Labeled the containers statuses, observed failures, added the default label as the initialization point, tests passed again. Built Spark, ran on K8s cluster verified no NPE in driver log. Closes #31071 from holdenk/SPARK-34018-finishedExecutorWithRunningSidecar-doesnt-correctly-constructt-the-test-case. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-07 16:47:37 -08:00
Prashant Sharma	f64dfa8727	[SPARK-32221][K8S] Avoid possible errors due to incorrect file size or type supplied in spark conf ### What changes were proposed in this pull request? Skip files if they are binary or very large to fit the configMap's max size. ### Why are the changes needed? Config map cannot hold binary files and there is also a limit on how much data a configMap can hold. This limit can be configured by the k8s cluster admin. This PR, skips such files (with a warning) instead of failing with weird runtime errors. If such files are not skipped, then it would result in mount errors or encoding errors (if binary files are submitted). ### Does this PR introduce _any_ user-facing change? yes, in simple words avoids possible errors due to negligence (for example, placing a large file or a binary file in SPARK_CONF_DIR) and thus improves user experience. ### How was this patch tested? Added relevant tests and improved existing tests. Closes #30472 from ScrapCodes/SPARK-32221/avoid-conf-propagate-errors. Lead-authored-by: Prashant Sharma <prashsh1@in.ibm.com> Co-authored-by: Prashant Sharma <prashant@apache.org> Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>	2021-01-06 14:55:40 +05:30
Holden Karau	171db85aa2	[SPARK-33874][K8S][FOLLOWUP] Handle long lived sidecars - clean up logging ### What changes were proposed in this pull request? Switch log level from warn to debug when the spark container is not present in the pod's container statuses. ### Why are the changes needed? There are many non-critical situations where the Spark container may not be present, and the warning log level is too high. ### Does this PR introduce _any_ user-facing change? Log message change. ### How was this patch tested? N/A Closes #31047 from holdenk/SPARK-33874-follow-up. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-05 13:48:52 -08:00
Holden Karau	448494ebcf	[SPARK-33874][K8S] Handle long lived sidecars ### What changes were proposed in this pull request? For liveness check when checkAllContainers is not set, we check the liveness status of the Spark container if we can find it. ### Why are the changes needed? Some environments may deploy long lived logs collecting side cars which outlive the Spark application. Just because they remain alive does not mean the Spark executor should keep running. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Extended the existing pod status tests. Closes #30892 from holdenk/SPARK-33874-handle-long-lived-sidecars. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-30 14:06:34 +09:00
David McWhorter	87c58367cd	[SPARK-22256][MESOS] Introduce spark.mesos.driver.memoryOverhead ### What changes were proposed in this pull request? This is a simple change to support allocating a specified amount of overhead memory for the driver's mesos container. This is already supported for executors. ### Why are the changes needed? This is needed to keep the driver process from exceeding memory limits and being killed off when running on mesos. ### Does this PR introduce _any_ user-facing change? Yes, it adds a `spark.mesos.driver.memoryOverhead` configuration option. Documentation changes for this option are included in the PR. ### How was this patch tested? Test cases covering allocation of driver memory overhead are included in the changes. ### Other notes This is a second attempt to get this change reviewed, accepted and merged. The original pull request was closed as stale back in January: https://github.com/apache/spark/pull/21006. For this pull request, I took the original change by pmackles, rebased it onto the current master branch, and added a test case that was requested in the original code review. I'm happy to make any further edits or do anything needed so that this can be included in a future spark release. I keep having to build custom spark distributions so that we can use spark within our mesos clusters. Closes #30739 from dmcwhorter/dmcwhorter-SPARK-22256. Lead-authored-by: David McWhorter <david_mcwhorter@premierinc.com> Co-authored-by: Paul Mackles <pmackles@adobe.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-15 14:00:38 -08:00
HyukjinKwon	a99a47ca1d	[SPARK-33748][K8S] Respect environment variables and configurations for Python executables ### What changes were proposed in this pull request? This PR proposes: - Respect `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations in Kubernates just like other cluster types in Spark. - Depreate `spark.kubernetes.pyspark.pythonVersion` and guide users to set the environment variables and configurations for Python executables. NOTE that `spark.kubernetes.pyspark.pythonVersion` is already a no-op configuration without this PR. Default is `3` and other values are disallowed. - In order for Python executable settings to be consistently used, fix `spark.archives` option to unpack into the current working directory in the driver of Kubernates' cluster mode. This behaviour is identical with Yarn's cluster mode. By doing this, users can leverage Conda or virtuenenv in cluster mode as below: ```python conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gz PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py ``` - Removed several unused or useless codes such as `extractS3Key` and `renameResourcesToLocalFS` ### Why are the changes needed? - To provide a consistent support of PySpark by using `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, or `spark.pyspark.python` and `spark.pyspark.driver.python` configurations. - To provide Conda and virtualenv support via `spark.archives` options. ### Does this PR introduce _any_ user-facing change? Yes: - `spark.kubernetes.pyspark.pythonVersion` is deprecated. - `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` environment variables, and `spark.pyspark.python` and `spark.pyspark.driver.python` configurations are respected. ### How was this patch tested? Manually tested via: ```bash minikube delete minikube start --cpus 12 --memory 16384 kubectl create namespace spark-integration-test cat <<EOF \| kubectl apply -f - apiVersion: v1 kind: ServiceAccount metadata: name: spark namespace: spark-integration-test EOF kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=spark-integration-test:spark --namespace=spark-integration-test dev/make-distribution.sh --pip --tgz -Pkubernetes resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz `pwd`/spark-3.2.0-SNAPSHOT-bin-3.2.0.tgz --service-account spark --namespace spark-integration-test ``` Unittests were also added. Closes #30735 from HyukjinKwon/SPARK-33748. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-15 08:56:45 +09:00

1 2 3 4 5 ...

586 commits