ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Gengliang Wang	5d45a415f3	Preparing Spark release v3.2.0-rc7	2021-10-06 11:45:26 +00:00
Gengliang Wang	4bd358474b	Preparing development version 3.2.1-SNAPSHOT	2021-09-28 10:53:42 +00:00
Gengliang Wang	dde73e2e1c	Preparing Spark release v3.2.0-rc6	2021-09-28 10:53:35 +00:00
Gengliang Wang	0c57bb8f7f	Preparing development version 3.2.1-SNAPSHOT	2021-09-27 08:24:50 +00:00
Gengliang Wang	49aea14c5a	Preparing Spark release v3.2.0-rc5	2021-09-27 08:24:44 +00:00
Gengliang Wang	2348cce37e	Preparing development version 3.2.1-SNAPSHOT	2021-09-26 12:28:46 +00:00
Gengliang Wang	2ed8c08c5b	Preparing Spark release v3.2.0-rc5	2021-09-26 12:28:40 +00:00
Gengliang Wang	da722d43cb	Preparing development version 3.2.1-SNAPSHOT	2021-09-24 10:03:23 +00:00
Gengliang Wang	9e35703211	Preparing Spark release v3.2.0-rc5	2021-09-24 10:03:16 +00:00
Gengliang Wang	0fb7127f85	Preparing development version 3.2.1-SNAPSHOT	2021-09-23 08:46:28 +00:00
Gengliang Wang	b609f2fe0c	Preparing Spark release v3.2.0-rc4	2021-09-23 08:46:22 +00:00
Dongjoon Hyun	5d0e51e943	[SPARK-36806][K8S][R] Use R 4.0.4 in K8s R image ### What changes were proposed in this pull request? This PR aims to upgrade R from 3.6.3 to 4.0.4 in K8s R Docker image. ### Why are the changes needed? `openjdk:11-jre-slim` image is upgraded to `Debian 11`. ``` $ docker run -it openjdk:11-jre-slim cat /etc/os-release PRETTY_NAME="Debian GNU/Linux 11 (bullseye)" NAME="Debian GNU/Linux" VERSION_ID="11" VERSION="11 (bullseye)" VERSION_CODENAME=bullseye ID=debian HOME_URL="https://www.debian.org/" SUPPORT_URL="https://www.debian.org/support" BUG_REPORT_URL="https://bugs.debian.org/" ``` It causes `R 3.5` installation failures in our K8s integration test environment. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47953/ ``` The following packages have unmet dependencies: r-base-core : Depends: libicu63 (>= 63.1-1~) but it is not installable Depends: libreadline7 (>= 6.0) but it is not installable E: Unable to correct problems, you have held broken packages. The command '/bin/sh -c apt-get update && apt install -y gnupg && echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list && apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' && apt-get update && apt install -y -t buster-cran35 r-base r-base-dev && rm -rf ``` ### Does this PR introduce _any_ user-facing change? Yes, this will recover the installation. ### How was this patch tested? Succeed to build SparkR docker image in the K8s integration test in Jenkins CI. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/47959/ ``` Successfully built 32e1a0cd5ff8 Successfully tagged kubespark/spark-r:3.3.0-SNAPSHOT_6e4f7e2d-054d-4978-812f-4f32fc546b51 ``` Closes #34048 from dongjoon-hyun/SPARK-36806. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `a178752540`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-09-20 14:38:32 -07:00
Gengliang Wang	b0249851f6	Preparing development version 3.2.1-SNAPSHOT	2021-09-18 11:30:12 +00:00
Gengliang Wang	96044e9735	Preparing Spark release v3.2.0-rc3	2021-09-18 11:30:06 +00:00
Gengliang Wang	1bad04d028	Preparing development version 3.2.1-SNAPSHOT	2021-08-31 17:04:14 +00:00
Gengliang Wang	03f5d23e96	Preparing Spark release v3.2.0-rc2	2021-08-31 17:04:08 +00:00
Gengliang Wang	69be513c5e	Preparing development version 3.2.1-SNAPSHOT	2021-08-20 12:40:47 +00:00
Gengliang Wang	6bb3523d8e	Preparing Spark release v3.2.0-rc1	2021-08-20 12:40:40 +00:00
Gengliang Wang	fafdc1482b	Revert "Preparing Spark release v3.2.0-rc1" This reverts commit `8e58fafb05`.	2021-08-20 20:07:02 +08:00
Gengliang Wang	c829ed53ff	Revert "Preparing development version 3.2.1-SNAPSHOT" This reverts commit `4f1d21571d`.	2021-08-20 20:07:01 +08:00
Gengliang Wang	4f1d21571d	Preparing development version 3.2.1-SNAPSHOT	2021-08-19 14:08:32 +00:00
Gengliang Wang	8e58fafb05	Preparing Spark release v3.2.0-rc1	2021-08-19 14:08:26 +00:00
attilapiros	eb09be9e68	[SPARK-36052][K8S] Introducing a limit for pending PODs Introducing a limit for pending PODs (newly created/requested executors included). This limit is global for all the resource profiles. So first we have to count all the newly created and pending PODs (decreased by the ones which requested to be deleted) then we can share the remaining pending POD slots among the resource profiles. Without this PR dynamic allocation could request too many PODs and the K8S scheduler could be overloaded and scheduling of PODs will be affected by the load. No. With new unit tests. Closes #33492 from attilapiros/SPARK-36052. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org> (cherry picked from commit `1dced492fb`) Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-08-16 16:06:29 -07:00
Kent Yao	94c1e3c38c	[SPARK-35969][K8S] Make the pod prefix more readable and tallied with K8S DNS Label Names ### What changes were proposed in this pull request? By default, the executor pod prefix is generated by the app name. It handles characters that match [^a-z0-9\\-] differently. The '.' and all whitespaces will be converted to '-', but other ones to empty string. Especially, characters like '_', '\|' are commonly used as a word separator in many languages. According to the K8S DNS Label Names, see https://kubernetes.io/docs/concepts/overview/working-with-objects/names/#dns-label-names, we can convert all special characters to `-`. For example, ``` scala> "xyz_abc_i_am_a_app_name_w/_some_abbrs".replaceAll("[^a-z0-9\\-]", "-").replaceAll("-+", "-") res11: String = xyz-abc-i-am-a-app-name-w-some-abbrs scala> "xyz_abc_i_am_a_app_name_w/_some_abbrs".replaceAll("\\s+", "-").replaceAll("\\.", "-").replaceAll("[^a-z0-9\\-]", "").replaceAll("-+", "-") res12: String = xyzabciamaappnamewsomeabbrs ``` ```scala scala> "time.is%the￥most$valuable_——————thing,it's about time.".replaceAll("[^a-z0-9\\-]", "-").replaceAll("-+", "-") res9: String = time-is-the-most-valuable-thing-it-s-about-time- scala> "time.is%the￥most$valuable_——————thing,it's about time.".replaceAll("\\s+", "-").replaceAll("\\.", "-").replaceAll("[^a-z0-9\\-]", "").replaceAll("-+", "-") res10: String = time-isthemostvaluablethingits-about-time- ``` ### Why are the changes needed? For better UX ### Does this PR introduce _any_ user-facing change? yes, the executor pod name might look better ### How was this patch tested? add new ones Closes #33171 from yaooqinn/SPARK-35969. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-01 08:15:00 -07:00
Dongjoon Hyun	59ec7a20b0	[SPARK-35885][K8S][R] Use keyserver.ubuntu.com as a keyserver for CRAN ### What changes were proposed in this pull request? This PR aims to use `keyserver.ubuntu.com` as a keyserver for CRAN. ### Why are the changes needed? Currently, both servers fail and K8s IT fails at SparkR image building phase. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20K8s%20Builds/job/spark-master-test-k8s/801/console ``` $ docker run -it --rm openjdk:11 /bin/bash root3e89a8d05378:/# echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list root3e89a8d05378:/# (apt-key adv --keyserver keys.gnupg.net --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' \|\| apt-key adv --keyserver keys.openpgp.org --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF') Executing: /tmp/apt-key-gpghome.8lNIiUuhoE/gpg.1.sh --keyserver keys.gnupg.net --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF gpg: keyserver receive failed: No name Executing: /tmp/apt-key-gpghome.stxb8XUlx8/gpg.1.sh --keyserver keys.openpgp.org --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF gpg: key AD5F960A256A04AF: new key but contains no user ID - skipped gpg: Total number processed: 1 gpg: w/o user IDs: 1 root3e89a8d05378:/# apt-get update ... Err:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FCAE2A0E115C3D8A ... W: GPG error: http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease: The following signatures couldn't be verified because the public key is not available: NO_PUBKEY FCAE2A0E115C3D8A E: The repository 'http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease' is not signed. N: Updating from such a repository can't be done securely, and is therefore disabled by default. N: See apt-secure(8) manpage for repository creation and user configuration details. ``` `keyserver.ubuntu.com` is a recommended backup server in CRAN document. - http://cloud.r-project.org/bin/linux/debian/ ``` $ docker run -it --rm openjdk:11 /bin/bash rootc9b183e45ffe:/# echo "deb http://cloud.r-project.org/bin/linux/debian buster-cran35/" >> /etc/apt/sources.list rootc9b183e45ffe:/# apt-key adv --keyserver keyserver.ubuntu.com --recv-key 'E19F5F87128899B192B1A2C2AD5F960A256A04AF' Executing: /tmp/apt-key-gpghome.P6cxYkOge7/gpg.1.sh --keyserver keyserver.ubuntu.com --recv-key E19F5F87128899B192B1A2C2AD5F960A256A04AF gpg: key AD5F960A256A04AF: public key "Johannes Ranke (Wissenschaftlicher Berater) <johannes.rankejrwb.de>" imported gpg: Total number processed: 1 gpg: imported: 1 rootc9b183e45ffe:/# apt-get update Get:1 http://deb.debian.org/debian buster InRelease [122 kB] Get:2 http://security.debian.org/debian-security buster/updates InRelease [65.4 kB] Get:3 http://cloud.r-project.org/bin/linux/debian buster-cran35/ InRelease [4375 B] Get:4 http://deb.debian.org/debian buster-updates InRelease [51.9 kB] Get:5 http://cloud.r-project.org/bin/linux/debian buster-cran35/ Packages [53.3 kB] Get:6 http://security.debian.org/debian-security buster/updates/main arm64 Packages [287 kB] Get:7 http://deb.debian.org/debian buster/main arm64 Packages [7735 kB] Get:8 http://deb.debian.org/debian buster-updates/main arm64 Packages [14.5 kB] Fetched 8334 kB in 2s (4537 kB/s) Reading package lists... Done ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass K8s IT Jenkins SparkR image building. Or, manually do the following. ``` $ bin/docker-image-tool.sh -R kubernetes/dockerfiles/spark/bindings/R/Dockerfile build ``` Closes #33071 from dongjoon-hyun/SPARK-35885. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-24 19:51:49 -07:00
Kevin Su	765106cb80	[SPARK-35699][K8S] Improve error message when creating k8s pod failed ### What changes were proposed in this pull request? Improve error message when clients use wrong master URL to submit a job to k8s. ### Why are the changes needed? Current error messages are not clear for users. ``` (base) ➜ spark git:(master) ./bin/spark-submit \ --master k8s://https://192.168.49.3:8443 \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=pingsutw/spark:testing \ local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar 21/06/09 20:50:37 WARN Utils: Your hostname, kobe-pc resolves to a loopback address: 127.0.1.1; using 192.168.103.20 instead (on interface ens160) 21/06/09 20:50:37 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address 21/06/09 20:50:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 21/06/09 20:50:38 INFO SparkKubernetesClientFactory: Auto-configuring K8S client using current context from users K8S config file 21/06/09 20:50:39 INFO KerberosConfDriverFeatureStep: You have not specified a krb5.conf file locally or via a ConfigMap. Make sure that you have the krb5.conf locally on the driver image. Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Operation: [create] for kind: [Pod] with name: [null] in namespace: [default] failed. at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:72) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:380) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) at io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:86) ``` Below command to reproduce; ``` ./bin/spark-submit \ --master k8s://https://192.168.49.2:8443 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=3 \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \ --conf spark.kubernetes.container.image=pingsutw/spark:testing \ local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar ``` ### Does this PR introduce _any_ user-facing change? Yes, users will see more clear error messages. ### How was this patch tested? Pass the CIs. Closes #32874 from pingsutw/SPARK-35699. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-21 19:10:11 -07:00
Dongjoon Hyun	4f51e0045e	[SPARK-35832][CORE][ML][K8S][TESTS] Add LocalRootDirsTest trait ### What changes were proposed in this pull request? To make the test suite more robust, this PR aims to add a new trait, `LocalRootDirsTest`, by refactoring `SortShuffleSuite`'s helper functions and applying it to the following: - ShuffleNettySuite - ShuffleOldFetchProtocolSuite - ExternalShuffleServiceSuite - KubernetesLocalDiskShuffleDataIOSuite - LocalDirsSuite - RDDCleanerSuite - ALSCleanerSuite In addition, this fixes a UT in `KubernetesLocalDiskShuffleDataIOSuite`. ### Why are the changes needed? `ShuffleSuite` is extended by four classes but only `SortShuffleSuite` does the clean-up correctly. ``` ShuffleSuite - SortShuffleSuite - ShuffleNettySuite - ShuffleOldFetchProtocolSuite - ExternalShuffleServiceSuite ``` Since `KubernetesLocalDiskShuffleDataIOSuite` is looking for the other storage directory, the leftover of `ShuffleSuite` causes flakiness. - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2/2649/testReport/junit/org.apache.spark.shuffle/KubernetesLocalDiskShuffleDataIOSuite/recompute_is_not_blocked_by_the_recovery/ ``` org.apache.spark.SparkException: Job aborted due to stage failure: task 0.0 in stage 1.0 (TID 3) had a not serializable result: org.apache.spark.ShuffleSuite$NonJavaSerializableClass ... org.apache.spark.shuffle.KubernetesLocalDiskShuffleDataIOSuite.$anonfun$new$2(KubernetesLocalDiskShuffleDataIOSuite.scala:52) ``` For the other suites, the clean-up implementation is used but not complete. So, they are refactored to use new trait. ### Does this PR introduce _any_ user-facing change? No, this is a test-only change. ### How was this patch tested? Pass the CIs. Closes #32986 from dongjoon-hyun/SPARK-35832. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-20 10:53:53 -07:00
Dongjoon Hyun	b9d6473e89	[SPARK-35593][K8S][TESTS][FOLLOWUP] Increase timeout in KubernetesLocalDiskShuffleDataIOSuite ### What changes were proposed in this pull request? This increases the timeout from 10 seconds to 60 seconds in KubernetesLocalDiskShuffleDataIOSuite to reduce the flakiness. ### Why are the changes needed? - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/140003/testReport/ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs Closes #32967 from dongjoon-hyun/SPARK-35593-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-06-19 15:22:29 +09:00
Kent Yao	1125afd462	[MINOR][K8S] Print the driver pod name instead of Some(name) if absent Print the driver pod name instead of Some(name) if absent ### What changes were proposed in this pull request? ### Why are the changes needed? fix error hint ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new test Closes #32889 from yaooqinn/minork8s. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-13 09:11:14 -07:00
Dongjoon Hyun	cf07036d9b	[SPARK-35593][K8S][CORE] Support shuffle data recovery on the reused PVCs ### What changes were proposed in this pull request? Previously, the following two commits allow driver-owned on-demand PVC reuse. - SPARK-35182 Support driver-owned on-demand PVC - SPARK-35416 Support PersistentVolumeClaim Reuse This PR aims to recover the shuffle data on those remounted PVCs. The lifecycle of PVCs are tied to the one of Spark jobs. Since this is K8s specific feature, `ShuffleDataIO` plugin is used. ### Why are the changes needed? Although Pod is killed, we can remount PVCs and recover some data from it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the newly added test cases. Closes #32730 from dongjoon-hyun/SPARK-RECOVER-SHUFFLE-DATA. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 16:06:58 -07:00
Kent Yao	bc1edba8f6	[SPARK-35692][K8S] Use AtomicInteger for executor id generating ### What changes were proposed in this pull request? AtomicInteger is enough for executor ids, in this PR, we use it to replace AtomicLong like other cluster managers, e.g. yarn, standalone ### Why are the changes needed? See the discussion here https://github.com/apache/spark/pull/32610#discussion_r648007320 ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? pass CI with existing tests Closes #32837 from yaooqinn/SPARK-35692. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 13:42:07 -07:00
Kent Yao	b4b78ce265	[SPARK-32975][K8S][FOLLOWUP] Avoid None.get exception ### What changes were proposed in this pull request? A follow-up for SPARK-32975 to avoid unexpected the `None.get` exception Run SparkPi with docker desktop, as podName is an option, we will got ```logtalk 21/06/09 01:09:12 ERROR Utils: Uncaught exception in thread main java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.$anonfun$start$1(ExecutorPodsAllocator.scala:110) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1417) at org.apache.spark.scheduler.cluster.k8s.ExecutorPodsAllocator.start(ExecutorPodsAllocator.scala:111) at org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend.start(KubernetesClusterSchedulerBackend.scala:99) at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:220) at org.apache.spark.SparkContext.<init>(SparkContext.scala:581) at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2686) at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$2(SparkSession.scala:948) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:942) at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:30) at org.apache.spark.examples.SparkPi.main(SparkPi.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:955) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1043) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1052) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) ``` ### Why are the changes needed? fix a regression ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Manual. Closes #32830 from yaooqinn/SPARK-32975. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-10 13:39:39 -07:00
Chris Wu	497c80a1ad	[SPARK-32975][K8S] Add config for driver readiness timeout before executors start ### What changes were proposed in this pull request? Add a new config that controls the timeout of waiting for driver pod's readiness before allocating executor pods. This wait only happens once on application start. ### Why are the changes needed? The driver's headless service can be resolved by DNS only after the driver pod is ready. If the executor tries to connect to the headless service before driver pod is ready, it will hit UnkownHostException and get into error state but will not be restarted. This case usually happens when the driver pod has sidecar containers but hasn't finished their creation when executors start. So basically there is a race condition. This issue can be mitigated by tweaking this config. ### Does this PR introduce _any_ user-facing change? A new config `spark.kubernetes.allocation.driver.readinessTimeout` added. ### How was this patch tested? Exisiting tests. Closes #32752 from cchriswu/SPARK-32975-fix. Lead-authored-by: Chris Wu <wucaowei19@gmail.com> Co-authored-by: Chris Wu <wcaowei@vmware.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-04 06:59:49 -07:00
Dongjoon Hyun	4f0db872a0	[SPARK-35416][K8S][FOLLOWUP] Use Set instead of ArrayBuffer ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/32564 . ### Why are the changes needed? To use Set instead of ArrayBuffer and add a return type. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #32758 from dongjoon-hyun/SPARK-35416-2. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2021-06-03 10:41:11 -05:00
Kousuke	e04883880f	[SPARK-35586][K8S][TESTS] Set a default value for spark.kubernetes.test.sparkTgz in pom.xml for Kubernetes integration tests ### What changes were proposed in this pull request? This PR set a default value for `spark.kubernetes.test.sparkTgz` in `kubernetes/integration-tests/pom.xml` for Kubernetes integration tests. ### Why are the changes needed? In the current master, running the integration tests with the following command will fail because there is no default value set for the property. ``` build/mvn -Dspark.kubernetes.test.namespace=default -Pkubernetes -Pkubernetes-integration-tests -Psparkr -pl resource-managers/kubernetes/integration-tests integration-test ``` ``` + mkdir -p /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked + tar -xzvf --test-exclude-tags --strip-components=1 -C /home/kou/work/oss/spark/resource-managers/kubernetes/integration-tests/target/spark-dist-unpacked tar (child): --test-exclude-tags: Cannot open: No such file or directory tar (child): Error is not recoverable: exiting now tar: Child returned status 2 tar: Error is not recoverable: exiting now [ERROR] Command execution failed. ``` According to `setup-integration-test-env.sh`, `N/A` is intended as the default value so this PR choose it. ``` SPARK_TGZ="N/A" MVN="$TEST_ROOT_DIR/build/mvn" EXCLUDE_TAGS="" ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Build and tests successfully finish with the command shown above. Closes #32722 from sarutak/fix-pom-for-kube-integ. Authored-by: Kousuke <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-06-01 00:40:02 -07:00
Kent Yao	96b0548ab6	[SPARK-35493][K8S] make `spark.blockManager.port` fallback for `spark.driver.blockManager.port` as same as other cluster managers ### What changes were proposed in this pull request? `spark.blockManager.port` does not work for k8s driver pods now, we should make it work as other cluster managers. ### Why are the changes needed? `spark.blockManager.port` should be able to work for spark driver pod ### Does this PR introduce _any_ user-facing change? yes, `spark.blockManager.port` will be respect iff it is present && `spark.driver.blockManager.port` is absent ### How was this patch tested? new tests Closes #32639 from yaooqinn/SPARK-35493. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-23 08:07:57 -07:00
Kent Yao	d957426351	[SPARK-35482][K8S] Use `spark.blockManager.port` not the wrong `spark.blockmanager.port` in BasicExecutorFeatureStep ### What changes were proposed in this pull request? most spark conf keys are case sensitive, including `spark.blockManager.port`, we can not get the correct port number with `spark.blockmanager.port`. This PR changes the wrong key to `spark.blockManager.port` in `BasicExecutorFeatureStep`. This PR also ensures a fast fail when the port value is invalid for executor containers. When 0 is specified(it is valid as random port, but invalid as a k8s request), it should not be put in the `containerPort` field of executor pod desc. We do not expect executor pods to continuously fail to create because of invalid requests. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new tests Closes #32621 from yaooqinn/SPARK-35482. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-21 08:27:49 -07:00
Ashray Jain	de59e01aa4	[SPARK-35443][K8S] Mark K8s ConfigMaps and Secrets created by Spark as immutable Kubernetes supports marking secrets and config maps as immutable to gain performance. https://kubernetes.io/docs/concepts/configuration/configmap/#configmap-immutable https://kubernetes.io/docs/concepts/configuration/secret/#secret-immutable For K8s clusters that run many thousands of Spark applications, this can yield significant reduction in load on the kube-apiserver. From the K8s docs: > For clusters that extensively use Secrets (at least tens of thousands of unique Secret to Pod mounts), preventing changes to their data has the following advantages: > - protects you from accidental (or unwanted) updates that could cause applications outages > - improves performance of your cluster by significantly reducing load on kube-apiserver, by closing watches for secrets marked as immutable. For any secrets and config maps we create in Spark that are immutable, we could mark them as immutable by including the following when building the secret/config map ``` .withImmutable(true) ``` This feature has been supported in K8s as beta since K8s 1.19 and as GA since K8s 1.21 ### What changes were proposed in this pull request? All K8s secrets and config maps created by Spark are marked "immutable". ### Why are the changes needed? See description above. ### Does this PR introduce _any_ user-facing change? Don't think so ### How was this patch tested? Augmented existing unit tests. Closes #32588 from ashrayjain/patch-1. Authored-by: Ashray Jain <ashrayjain@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-19 21:25:33 -07:00
Dongjoon Hyun	4c015555da	[SPARK-35416][K8S] Support PersistentVolumeClaim Reuse ### What changes were proposed in this pull request? This PR aims to add a new configuration, `spark.kubernetes.driver.reusePersistentVolumeClaim`, to reuse driver-owned `PersistentVolumeClaims` of the deleted executor pods. Note also that `driver-owned PersistentVolumeClaims` is controlled by `spark.kubernetes.driver.ownPersistentVolumeClaim` which is recently added. ### Why are the changes needed? PVC creations take some times. This feature can reduce it by reusing it. For example, we can start `Pi` app with two executors with PVCs. ``` $ k logs -f pi \| grep ExecutorPodsAllocator 21/05/16 23:36:32 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 0. 21/05/16 23:36:32 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs 21/05/16 23:36:32 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-1-pvc-0 with StorageClass scaleio 21/05/16 23:36:33 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-2-pvc-0 with StorageClass scaleio ``` After killing one executor, Spark is trying to look up the reusable PVCs, but the dead-executor's PVC may not returned yet because K8s works asynchronously. In this case, Spark is trying to create a new PVC as a normal operation. ``` 21/05/16 23:38:51 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1. 21/05/16 23:38:51 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 2 PVCs 21/05/16 23:38:51 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-3-pvc-0 with StorageClass scaleio ``` After killing another executor, Spark found one reusable PVC, `pi-exec-1-pvc-0`, and reuse it. ``` 21/05/16 23:39:18 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1. 21/05/16 23:39:18 INFO ExecutorPodsAllocator: Found 1 reusable PVCs from 3 PVCs 21/05/16 23:39:18 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim pi-exec-1-pvc-0 ``` In this case, we can easily notice the remounted PVC because `ClaimName`, `pi-exec-1-pvc-0`, doesn't have the prefix of pod name, `pi-exec-4`. ``` $ k describe pod pi-exec-4 \| grep pi-exec-1-pvc-0 ClaimName: pi-exec-1-pvc-0 ``` ### Does this PR introduce _any_ user-facing change? Yes, but this is a new feature which is disabled by the new conf. ### How was this patch tested? Pass the CIs with the newly added test case. K8S IT test also passed. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 17 minutes, 7 seconds. Total number of tests run: 26 Suites: completed 2, aborted 0 Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ... [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 24:14 min [INFO] Finished at: 2021-05-16T17:24:40-07:00 [INFO] ------------------------------------------------------------------------ ``` Closes #32564 from dongjoon-hyun/SPARK-35416. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-17 00:20:48 -07:00
Holden Karau	160b3bee71	[SPARK-34764][CORE][K8S][UI] Propagate reason for exec loss to Web UI ### What changes were proposed in this pull request? Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core. UI change: ![image](https://user-images.githubusercontent.com/59893/117045762-b975ba80-acc4-11eb-9679-8edab3cfadc2.png) ### Why are the changes needed? Debugging Spark jobs is hard, making it clearer why executors have exited could help. ### Does this PR introduce _any_ user-facing change? Yes a new column on the executor page. ### How was this patch tested? K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI. Closes #32436 from holdenk/SPARK-34764-propegate-reason-for-exec-loss. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2021-05-13 16:02:31 -07:00
Dongjoon Hyun	dd5464976f	[SPARK-35394][K8S][BUILD] Move kubernetes-client.version to root pom file ### What changes were proposed in this pull request? This PR aims to unify two K8s version variables in two `pom.xml`s into one. `kubernetes-client.version` is correct because the artifact ID is `kubernetes-client`. ``` kubernetes.client.version (kubernetes/core module) kubernetes-client.version (kubernetes/integration-test module) ``` ### Why are the changes needed? Having two variables for the same value is confusing and inconvenient when we upgrade K8s versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. (The compilation test passes are enough.) Closes #32531 from dongjoon-hyun/SPARK-35394. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-13 00:40:53 -07:00
“attilapiros”	8b94eff1ca	[SPARK-34736][K8S][TESTS] Kubernetes and Minikube version upgrade for integration tests ### What changes were proposed in this pull request? This PR upgrades Kubernetes and Minikube version for integration tests and removes/updates the old code for this new version. Details of this changes: - As [discussed in the mailing list](http://apache-spark-developers-list.1001551.n3.nabble.com/minikube-and-kubernetes-cluster-versions-for-integration-testing-td30856.html): updating Minikube version from v0.34.1 to v1.7.3 and kubernetes version from v1.15.12 to v1.17.3. - making Minikube version checked and fail with an explanation when the test is started with on a version < v1.7.3. - removing minikube status checking code related to old Minikube versions - in the Minikube backend using fabric8's `Config.autoConfigure()` method to configure the kubernetes client to use the `minikube` k8s context (like it was in [one of the Minikube's example](https://github.com/fabric8io/kubernetes-client/blob/master/kubernetes-examples/src/main/java/io/fabric8/kubernetes/examples/kubectl/equivalents/ConfigUseContext.java#L36)) - Introducing `persistentVolume` test tag: this would be a temporary change to skip PVC tests in the Kubernetes integration test, as currently the PCV tests are blocking the move to Docker as Minikube's driver (for details please check https://issues.apache.org/jira/browse/SPARK-34738). ### Why are the changes needed? With the current suggestion one can run into several problems without noticing the Minikube/kubernetes version is the problem. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It was tested on Mac with [this script](https://gist.github.com/attilapiros/cd58a16bdde833c80c5803c337fffa94#file-check_minikube_versions-zsh) which installs each Minikube versions from v1.7.2 (including this version to test the negative case of the version check) and runs the integration tests. It was started with: ``` ./check_minikube_versions.zsh > test_log 2>&1 ``` And there was only one build failure the rest was successful: ``` $ grep "BUILD SUCCESS" test_log \| wc -l 26 $ grep "BUILD FAILURE" test_log \| wc -l 1 ``` It was for Minikube v1.7.2 and the log is: ``` KubernetesSuite: * RUN ABORTED * java.lang.AssertionError: assertion failed: Unsupported Minikube version is detected: minikube version: v1.7.2.For integration testing Minikube version 1.7.3 or greater is expected. at scala.Predef$.assert(Predef.scala:223) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.Minikube$.getKubernetesClient(Minikube.scala:52) at org.apache.spark.deploy.k8s.integrationtest.backend.minikube.MinikubeTestBackend$.initialize(MinikubeTestBackend.scala:33) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.beforeAll(KubernetesSuite.scala:163) at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:212) at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210) at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208) at org.apache.spark.deploy.k8s.integrationtest.KubernetesSuite.org$scalatest$BeforeAndAfter$$super$run(KubernetesSuite.scala:43) at org.scalatest.BeforeAndAfter.run(BeforeAndAfter.scala:273) at org.scalatest.BeforeAndAfter.run$(BeforeAndAfter.scala:271) ... ``` Moreover I made a test with having multiple k8s cluster contexts, too. Closes #31829 from attilapiros/SPARK-34736. Lead-authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Co-authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>	2021-05-10 18:56:52 +02:00
Dongjoon Hyun	a0c76a8755	[SPARK-35319][K8S][BUILD] Upgrade K8s client to 5.3.1 ### What changes were proposed in this pull request? This PR aims to upgrade K8s client to 5.3.1. ### Why are the changes needed? This will bring the latest bug fixes. - https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. K8s IT is manually tested like the following. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 18 minutes, 33 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 3.959 s] [INFO] Spark Project Tags ................................. SUCCESS [ 7.830 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 3.457 s] [INFO] Spark Project Networking ........................... SUCCESS [ 5.496 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.239 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 9.006 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 2.422 s] [INFO] Spark Project Core ................................. SUCCESS [02:17 min] [INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 23:59 min [INFO] Finished at: 2021-05-05T11:59:19-07:00 [INFO] ------------------------------------------------------------------------ ``` Closes #32443 from dongjoon-hyun/SPARK-35319. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-05 19:50:37 -07:00
Dongjoon Hyun	4e8701a77d	[SPARK-35280][K8S] Promote KubernetesUtils to DeveloperApi ### What changes were proposed in this pull request? Since SPARK-22757, `KubernetesUtils` has been used as an important utility class by all K8s modules and `ExternalClusterManager`s. This PR aims to promote `KubernetesUtils` to `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.2.0. ### Why are the changes needed? Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. To have `ExternalClusterManager` for K8s environment, `KubernetesUtils` class is crucial and needs to be stable. By promoting to a subset of K8s developer API, we can maintain these more sustainable way and give a better and stable functionality to K8s users. In this PR, `Since` annotations denote the last function signature changes because these are going to become public at Apache Spark 3.2.0. \| Version \| Function Name \| \|-\|-\| \| 2.3.0 \| parsePrefixedKeyValuePairs \| \| 2.3.0 \| requireNandDefined \| \| 2.3.0 \| parsePrefixedKeyValuePairs \| \| 2.4.0 \| parseMasterUrl \| \| 3.0.0 \| requireBothOrNeitherDefined \| \| 3.0.0 \| requireSecondIfFirstIsDefined \| \| 3.0.0 \| selectSparkContainer \| \| 3.0.0 \| formatPairsBundle \| \| 3.0.0 \| formatPodState \| \| 3.0.0 \| containersDescription \| \| 3.0.0 \| containerStatusDescription \| \| 3.0.0 \| formatTime \| \| 3.0.0 \| uniqueID \| \| 3.0.0 \| buildResourcesQuantities \| \| 3.0.0 \| uploadAndTransformFileUris \| \| 3.0.0 \| uploadFileUri \| \| 3.0.0 \| requireBothOrNeitherDefined \| \| 3.0.0 \| buildPodWithServiceAccount \| \| 3.0.0 \| isLocalAndResolvable \| \| 3.1.1 \| renameMainAppResource \| \| 3.1.1 \| addOwnerReference \| \| 3.2.0 \| loadPodFromTemplate \| ### Does this PR introduce _any_ user-facing change? Yes, but this is new API additions. ### How was this patch tested? Pass the CIs. Closes #32406 from dongjoon-hyun/SPARK-35280. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-30 11:39:18 -07:00
Dongjoon Hyun	6ab00488d0	[SPARK-35182][K8S] Support driver-owned on-demand PVC ### What changes were proposed in this pull request? This PR aims to support driver-owned on-demand PVC(Persistent Volume Claim)s. It means dynamically-created PVCs will have the `ownerReference` to `driver` pod instead of `executor` pod. ### Why are the changes needed? This allows K8s backend scheduler can reuse this later. BEFORE ``` $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: ... ownerReferences: - apiVersion: v1 controller: true kind: Pod name: tpcds-pvc-exec-1 ``` AFTER ``` $ k get pvc tpcds-pvc-exec-1-pvc-0 -oyaml apiVersion: v1 kind: PersistentVolumeClaim metadata: ... ownerReferences: - apiVersion: v1 controller: true kind: Pod name: tpcds-pvc ``` ### Does this PR introduce _any_ user-facing change? No. (The default is `false`) ### How was this patch tested? Manually check the above and pass K8s IT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 16 minutes, 40 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32288 from dongjoon-hyun/SPARK-35182. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-22 17:03:19 -07:00
Dongjoon Hyun	00f06dd267	[SPARK-35131][K8S] Support early driver service clean-up during app termination ### What changes were proposed in this pull request? This PR aims to support a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, to clean up `Driver Service` resource during app termination. ### Why are the changes needed? The K8s service is one of the important resources and sometimes it's controlled by quota. ``` $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` Apache Spark creates a service for driver whose lifecycle is the same with driver pod. It means a new Spark job submission fails if the number of completed Spark jobs equals the number of service quota. BEFORE ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver 0/1 Completed 0 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver 0/1 Completed 0 78s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 80m org-apache-spark-examples-sparkpi-a32c9278e7061b4d-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 31m org-apache-spark-examples-sparkpi-a9f1f578e721ef62-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 80s $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 3 3 $ bin/spark-submit... Exception in thread "main" io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST at: https://192.168.64.50:8443/api/v1/namespaces/default/services. Message: Forbidden! User minikube doesn't have permission. services "org-apache-spark-examples-sparkpi-843f6978e722819c-driver-svc" is forbidden: exceeded quota: service, requested: services=1, used: services=3, limited: services=3. ``` AFTER ``` $ k get pod NAME READY STATUS RESTARTS AGE org-apache-spark-examples-sparkpi-23d5f278e77731a7-driver 0/1 Completed 0 26s org-apache-spark-examples-sparkpi-d1292278e7768ed4-driver 0/1 Completed 0 67s org-apache-spark-examples-sparkpi-e5bedf78e776ea9d-driver 0/1 Completed 0 44s $ k get svc NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE kubernetes ClusterIP 10.96.0.1 <none> 443/TCP 172m $ k describe quota Name: service Namespace: default Resource Used Hard -------- ---- ---- services 1 3 ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a new configuration, `spark.kubernetes.driver.service.deleteOnTermination`, and enables it by default. The change is documented at the migration guide. ### How was this patch tested? Pass the CIs. This is tested with K8s IT manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 9 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32226 from dongjoon-hyun/SPARK-35131. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-19 12:11:08 -07:00
Dongjoon Hyun	425dc58c02	[SPARK-35125][K8S] Upgrade K8s client to 5.3.0 to support K8s 1.20 ### What changes were proposed in this pull request? Although AS-IS master branch already works with K8s 1.20, this PR aims to upgrade K8s client to 5.3.0 to support K8s 1.20 officially. - https://github.com/fabric8io/kubernetes-client#compatibility-matrix The following are the notable breaking API changes. 1. Remove Doneable (5.0+): - https://github.com/fabric8io/kubernetes-client/pull/2571 2. Change Watcher.onClose signature (5.0+): - https://github.com/fabric8io/kubernetes-client/pull/2616 3. Change Readiness (5.1+) - https://github.com/fabric8io/kubernetes-client/pull/2796 ### Why are the changes needed? According to the compatibility matrix, this makes Apache Spark and its external cluster manager extension support all K8s 1.20 features officially for Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? Yes, this is a dev dependency change which affects K8s cluster extension users. ### How was this patch tested? Pass the CIs. This is manually tested with K8s IT. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 17 minutes, 44 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #32221 from dongjoon-hyun/SPARK-K8S-530. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-19 07:39:38 -07:00
“attilapiros”	8a3815f722	[SPARK-34789][TEST] Introduce Jetty based construct for integration tests where HTTP server is used ### What changes were proposed in this pull request? Introducing a new test construct: ``` withHttpServer() { baseURL => ... } ``` Which starts and stops a Jetty server to serve files via HTTP. Moreover this PR uses this new construct in the test `Run SparkRemoteFileTest using a remote data file`. ### Why are the changes needed? Before this PR github URLs was used like "https://raw.githubusercontent.com/apache/spark/master/data/mllib/pagerank_data.txt". This connects two Spark version in an unhealthy way like connecting the "master" branch which is moving part with the committed test code which is a non-moving (as it might be even released). So this way a test running for an earlier version of Spark expects something (filename, content, path) from a the latter release and what is worse when the moving version is changed the earlier test will break. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test. Closes #31935 from attilapiros/SPARK-34789. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-14 21:22:52 -07:00
Dongjoon Hyun	a42dc93a2a	[SPARK-34948][K8S] Add ownerReference to executor configmap to fix leakages ### What changes were proposed in this pull request? This PR aims to add `ownerReference` to the executor ConfigMap to fix leakage. ### Why are the changes needed? SPARK-30985 maintains the executor config map explicitly inside Spark. However, this config map can be leaked when Spark drivers die accidentally or are killed by K8s. We need to add `ownerReference` to make K8s do the garbage collection these automatically. The number of ConfigMap is one of the resource quota. So, the leaked configMaps currently cause Spark jobs submission failures. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and check manually. K8s IT is tested manually. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 19 minutes, 2 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` BEFORE ``` $ k get cm spark-exec-450b417895b3b2c7-conf-map -oyaml \| grep ownerReferences ``` AFTER ``` $ k get cm spark-exec-bb37a27895b1c26c-conf-map -oyaml \| grep ownerReferences f:ownerReferences: ``` Closes #32042 from dongjoon-hyun/SPARK-34948. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-03 00:00:17 -07:00
“attilapiros”	c8b7a09d39	[SPARK-34869][K8S][TEST] Extend "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with the describe pods output ### What changes were proposed in this pull request? Extending "EXTRA LOGS FOR THE FAILED TEST" section of k8s integration test log with `kubectl describe pods` output for the failed test. ### Why are the changes needed? PR builds frequently fails as the k8s integration tests are very flaky now in Amplab Jenkins environment. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Locally by making temporary one of the test fail. The output is: ``` 21/03/25 16:55:16.722 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: ===== EXTRA LOGS FOR THE FAILED TEST 21/03/25 16:55:17.167 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: BEGIN driver DESCRIBE POD Name: spark-test-app-a2b03971b7c049e8a2629f6a3198842b Namespace: 35bdb17e308743afaec17538f89a7c3e Priority: 0 Node: minikube/192.168.64.119 Start Time: Thu, 25 Mar 2021 16:52:10 +0100 Labels: spark-app-locator=75f695685ae44314a99ec13bb39332bc spark-app-selector=spark-150230742d364a77927a08eed0222065 spark-role=driver Annotations: <none> Status: Succeeded IP: 172.17.0.4 Containers: spark-kubernetes-driver: Container ID: docker://d6d27b0551060d9b094f12d1e232dfb5ae78ce38559680c7126c548996da4d95 Image: docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5 Image ID: docker://sha256:3fc556c73a0d5187b5a14dbdc2f69ef292e60b544b4b4d3715f6749417c20918 Ports: 7078/TCP, 7079/TCP, 4040/TCP Host Ports: 0/TCP, 0/TCP, 0/TCP Args: driver --properties-file /opt/spark/conf/spark.properties --class org.apache.spark.examples.SparkPi local:///opt/spark/examples/jars/spark-examples_2.12-3.2.0-SNAPSHOT.jar State: Terminated Reason: Completed Exit Code: 0 Started: Thu, 25 Mar 2021 16:52:11 +0100 Finished: Thu, 25 Mar 2021 16:52:20 +0100 Ready: False Restart Count: 0 Limits: memory: 1408Mi Requests: cpu: 1 memory: 1408Mi Environment: SPARK_USER: attilazsoltpiros SPARK_APPLICATION_ID: spark-150230742d364a77927a08eed0222065 SPARK_DRIVER_BIND_ADDRESS: (v1:status.podIP) SPARK_LOCAL_DIRS: /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa SPARK_CONF_DIR: /opt/spark/conf Mounts: /opt/spark/conf from spark-conf-volume-driver (rw) /var/data/spark-dab6f1c9-e538-40c8-a7d9-3e88f9b82cfa from spark-local-dir-1 (rw) /var/run/secrets/kubernetes.io/serviceaccount from default-token-nmfwl (ro) Conditions: Type Status Initialized True Ready False ContainersReady False PodScheduled True Volumes: spark-local-dir-1: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: SizeLimit: <unset> spark-conf-volume-driver: Type: ConfigMap (a volume populated by a ConfigMap) Name: spark-drv-c60832786a15ffbe-conf-map Optional: false default-token-nmfwl: Type: Secret (a volume populated by a Secret) SecretName: default-token-nmfwl Optional: false QoS Class: Burstable Node-Selectors: <none> Tolerations: node.kubernetes.io/not-ready:NoExecute for 300s node.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled 3m7s default-scheduler Successfully assigned 35bdb17e308743afaec17538f89a7c3e/spark-test-app-a2b03971b7c049e8a2629f6a3198842b to minikube Normal Pulled 3m7s kubelet, minikube Container image "docker.io/kubespark/spark:3.2.0-SNAPSHOT_9575B805-9CB0-4A16-8A31-AA2F8DDA8EE5" already present on machine Normal Created 3m7s kubelet, minikube Created container spark-kubernetes-driver Normal Started 3m6s kubelet, minikube Started container spark-kubernetes-driver 21/03/25 16:55:17.168 ScalaTest-main-running-KubernetesSuite INFO KubernetesSuite: END driver DESCRIBE POD ``` Closes #31962 from attilapiros/SPARK-34869. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-28 09:44:56 -07:00

1 2 3 4 5 ...

342 commits