spark-instrumented-optimizer/resource-managers
attilapiros 03e48c87f5 [SPARK-35334][K8S] Make Spark more resilient to intermittent K8s flakiness
### What changes were proposed in this pull request?

Setting `kubernetes.request.retry.backoffLimit` by default to 3 when the user haven't specified  any value for it.

This way when k8s API servers gives back HTTP status code >= 500 then an exponential backoff will be triggered (where `kubernetes.request.retry.backoffInterval` is 1000ms by default).

For details please check https://github.com/fabric8io/kubernetes-client/issues/3087.

### Why are the changes needed?

We experienced some internal K8s errors for example when the `etcdserver` leader election was ongoing the error was propagated to the API client and caused an issue in Spark:

```
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at:
https://kubernetes.default.svc/api/v1/namespaces/dex-app-bl24w4z9/pods/sparkpi-10-fcd3f6781a874212-driver. Message: etcdserver:
leader changed. Received status: Status(apiVersion=v1, code=500, details=null, kind=Status, message=etcdserver: leader changed,
metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=null,
status=Failure, additionalProperties={}).
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Running the integration tests along with `log4j.logger.org.apache.spark.deploy.k8s.SparkKubernetesClientFactory=DEBUG` the log4j config. It produced the following log:

```
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: 21/07/08 11:01:14 DEBUG org.apache.spark.deploy.k8s.SparkKubernetesClientFactory: Kubernetes client config: {
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "requestConfig" : {
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "username" : null,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "password" : null,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "oauthToken" : null,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "oauthTokenProvider" : null,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "impersonateUsername" : null,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "impersonateGroups" : [ null ],
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "impersonateExtras" : { },
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "watchReconnectInterval" : 1000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "watchReconnectLimit" : -1,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "connectionTimeout" : 10000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "uploadConnectionTimeout" : 10000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "uploadRequestTimeout" : 120000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "requestRetryBackoffLimit" : 3,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "requestRetryBackoffInterval" : 1000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "requestTimeout" : 10000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "rollingTimeout" : 900000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "scaleTimeout" : 600000,
21/07/08 11:01:14.873 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "loggingInterval" : 20000,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "websocketTimeout" : 5000,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "websocketPingInterval" : 0,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "maxConcurrentRequests" : 64,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "maxConcurrentRequestsPerHost" : 5,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "impersonateGroup" : null
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "contexts" : [ {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "context" : {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "cluster" : "talos-default",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "namespace" : "default",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "user" : "admintalos-default"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "name" : "admintalos-default"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   }, {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "context" : {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "cluster" : "arn:aws:eks:us-west-2:392479084068:cluster/mow",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "user" : "arn:aws:eks:us-west-2:392479084068:cluster/mow"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "name" : "arn:aws:eks:us-west-2:392479084068:cluster/mow"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   }, {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "context" : {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "cluster" : "minikube",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "extensions" : [ {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:         "name" : "context_info"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       } ],
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "namespace" : "default",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "user" : "minikube"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "name" : "minikube"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   }, {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "context" : {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "cluster" : "",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "user" : ""
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "name" : "mow"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   } ],
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "currentContext" : {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "context" : {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "cluster" : "minikube",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "extensions" : [ {
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:         "name" : "context_info"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       } ],
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "namespace" : "default",
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:       "user" : "minikube"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "name" : "minikube"
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   },
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "maxConcurrentRequests" : 64,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "maxConcurrentRequestsPerHost" : 5,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "autoConfigure" : false,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "trustCerts" : false,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "disableHostnameVerification" : false,
21/07/08 11:01:14.874 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "masterUrl" : "https://192.168.64.127:8443/",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "apiVersion" : "v1",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "namespace" : "a0993113b8084cd3868b3052e698b17f",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "caCertFile" : "/Users/attilazsoltpiros/.minikube/ca.crt",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "clientCertFile" : "/Users/attilazsoltpiros/.minikube/profiles/minikube/client.crt",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "clientKeyFile" : "/Users/attilazsoltpiros/.minikube/profiles/minikube/client.key",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "clientKeyAlgo" : "RSA",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "clientKeyPassphrase" : "changeit",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "watchReconnectInterval" : 1000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "watchReconnectLimit" : -1,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "connectionTimeout" : 10000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "uploadConnectionTimeout" : 10000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "uploadRequestTimeout" : 120000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "requestRetryBackoffLimit" : 3,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "requestRetryBackoffInterval" : 1000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "requestTimeout" : 10000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "rollingTimeout" : 900000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "scaleTimeout" : 600000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "loggingInterval" : 20000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "websocketTimeout" : 5000,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "websocketPingInterval" : 0,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "impersonateGroups" : [ null ],
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "impersonateExtras" : { },
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "http2Disable" : false,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "noProxy" : [ ],
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "tlsVersions" : [ "TLS_1_2" ],
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "errorMessages" : {
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "401" : "Unauthorized! Token may have expired! Please log-in again.",
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:     "403" : "Forbidden! User minikube doesn't have permission."
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   }
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils: }
```

Which contains the expected values:
```
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "requestRetryBackoffLimit" : 3,
21/07/08 11:01:14.875 ScalaTest-main-running-KubernetesSuite INFO ProcessUtils:   "requestRetryBackoffInterval" : 1000,
```

Closes #33261 from attilapiros/SPARK-35334.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>
2021-07-13 13:46:18 +02:00
..
kubernetes [SPARK-35334][K8S] Make Spark more resilient to intermittent K8s flakiness 2021-07-13 13:46:18 +02:00
mesos [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00
yarn [SPARK-36067][BUILD][TEST][YARN] YarnClusterSuite fails due to NoClassDefFoundError unless hadoop-3.2 profile is activated explicitly 2021-07-09 15:18:52 +09:00