Commit graph

122 commits

Author SHA1 Message Date
beliefer 1cd80fa9fa [SPARK-31109][MESOS][DOC] Add version information to the configuration of Mesos
### What changes were proposed in this pull request?
Add version information to the configuration of `Mesos`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.mesos.$taskType.secret.names | 2.3.0 | SPARK-22131 | 5415963d2caaf95604211419ffc4e29fff38e1d7#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.$taskType.secret.values | 2.3.0 | SPARK-22131 | 5415963d2caaf95604211419ffc4e29fff38e1d7#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.$taskType.secret.envkeys | 2.3.0 | SPARK-22131 | 5415963d2caaf95604211419ffc4e29fff38e1d7#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.$taskType.secret.filenames | 2.3.0 | SPARK-22131 | 5415963d2caaf95604211419ffc4e29fff38e1d7#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.principal | 1.5.0 | SPARK-6284 | d86bbb4e286f16f77ba125452b07827684eafeed#diff-02a6d899f7a529eb7cfbb12182a110b0 |  
spark.mesos.principal.file | 2.4.0 | SPARK-16501 | 7f10cf83f311526737fc96d5bb8281d12e41932f#diff-daf48dabbe58afaeed8787751750b01d |  
spark.mesos.secret | 1.5.0 | SPARK-6284 | d86bbb4e286f16f77ba125452b07827684eafeed#diff-02a6d899f7a529eb7cfbb12182a110b0 |  
spark.mesos.secret.file | 2.4.0 | SPARK-16501 | 7f10cf83f311526737fc96d5bb8281d12e41932f#diff-daf48dabbe58afaeed8787751750b01d |  
spark.shuffle.cleaner.interval | 2.0.0 | SPARK-12583 | 310981d49a332bd329303f610b150bbe02cf5f87#diff-2fafefee94f2a2023ea9765536870258 |  
spark.mesos.dispatcher.webui.url | 2.0.0 | SPARK-13492 | a4a0addccffb7cd0ece7947d55ce2538afa54c97#diff-f541460c7a74cee87cbb460b3b01665e |  
spark.mesos.dispatcher.historyServer.url | 2.1.0 | SPARK-16809 | 62e62124419f3fa07b324f5e42feb2c5b4fde715#diff-3779e2035d9a09fa5f6af903925b9512 |  
spark.mesos.driver.labels | 2.3.0 | SPARK-21000 | 8da3f7041aafa71d7596b531625edb899970fec2#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.driver.webui.url | 2.0.0 | SPARK-13492 | a4a0addccffb7cd0ece7947d55ce2538afa54c97#diff-e3a5e67b8de2069ce99801372e214b8e |  
spark.mesos.driver.failoverTimeout | 2.3.0 | SPARK-21456 | c42ef953343073a50ef04c5ce848b574ff7f2238#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.network.name | 2.1.0 | SPARK-18232 | d89bfc92302424406847ac7a9cfca714e6b742fc#diff-ab5bf34f1951a8f7ea83c9456a6c3ab7 |  
spark.mesos.network.labels | 2.3.0 | SPARK-21694 | ce0d3bb377766bdf4df7852272557ae846408877#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.driver.constraints | 2.2.1 | SPARK-19606 | f6ee3d90d5c299e67ae6e2d553c16c0d9759d4b5#diff-91e6e5f871160782dc50d4060d6faea3 |  
spark.mesos.driver.frameworkId | 2.1.0 | SPARK-16809 | 62e62124419f3fa07b324f5e42feb2c5b4fde715#diff-02a6d899f7a529eb7cfbb12182a110b0 |  
spark.executor.uri | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-a885e7df97790e9b59c21c63353e7476 |  
spark.mesos.proxy.baseURL | 2.3.0 | SPARK-13041 | 663f30d14a0c9219e07697af1ab56e11a714d9a6#diff-0b9b4e122eb666155aa189a4321a6ca8 |  
spark.mesos.coarse | 0.6.0 | None | 63051dd2bcc4bf09d413ff7cf89a37967edc33ba#diff-eaf125f56ce786d64dcef99cf446a751 |  
spark.mesos.coarse.shutdownTimeout | 2.0.0 | SPARK-12330 | c756bda477f458ba4aad7fdb2026263507e0ad9b#diff-d425d35aa23c47a62fbb538554f2f2cf |  
spark.mesos.maxDrivers | 1.4.0 | SPARK-5338 | 53befacced828bbac53c6e3a4976ec3f036bae9e#diff-b964c449b99c51f0a5fd77270b2951a4 |  
spark.mesos.retainedDrivers | 1.4.0 | SPARK-5338 | 53befacced828bbac53c6e3a4976ec3f036bae9e#diff-b964c449b99c51f0a5fd77270b2951a4 |  
spark.mesos.cluster.retry.wait.max | 1.4.0 | SPARK-5338 | 53befacced828bbac53c6e3a4976ec3f036bae9e#diff-b964c449b99c51f0a5fd77270b2951a4 |  
spark.mesos.fetcherCache.enable | 2.1.0 | SPARK-15994 | e34b4e12673fb76c92f661d7c03527410857a0f8#diff-772ea7311566edb25f11a4c4f882179a |  
spark.mesos.appJar.local.resolution.mode | 2.4.0 | SPARK-24326 | 22df953f6bb191858053eafbabaa5b3ebca29f56#diff-6e4d0a0445975f03f975fdc1e3d80e49 |  
spark.mesos.rejectOfferDuration | 2.2.0 | SPARK-19702 | 2e30c0b9bcaa6f7757bd85d1f1ec392d5f916f83#diff-daf48dabbe58afaeed8787751750b01d |  
spark.mesos.rejectOfferDurationForUnmetConstraints | 1.6.0 | SPARK-10471 | 74f50275e429e649212928a9f36552941b862edc#diff-02a6d899f7a529eb7cfbb12182a110b0 |  
spark.mesos.rejectOfferDurationForReachedMaxCores | 2.0.0 | SPARK-13001 | 1e7d9bfb5a41f5c2479ab3b4d4081f00bf00bd31#diff-02a6d899f7a529eb7cfbb12182a110b0 |  
spark.mesos.uris | 1.5.0 | SPARK-8798 | a2f805729b401c68b60bd690ad02533b8db57b58#diff-e3a5e67b8de2069ce99801372e214b8e |  
spark.mesos.executor.home | 1.1.1 | SPARK-3264 | 069ecfef02c4af69fc0d3755bd78be321b68b01d#diff-e3a5e67b8de2069ce99801372e214b8e |  
spark.mesos.mesosExecutor.cores | 1.4.0 | SPARK-6350 | 6fbeb82e13db7117d8f216e6148632490a4bc5be#diff-e3a5e67b8de2069ce99801372e214b8e |  
spark.mesos.extra.cores | 0.6.0 | None | 2d761e3353651049f6707c74bb5ffdd6e86f6f35#diff-37af8c6e3634f97410ade813a5172621 |  
spark.mesos.executor.memoryOverhead | 1.1.1 | SPARK-3535 | 6f150978477830bbc14ba983786dd2bce12d1fe2#diff-6b498f5407d10e848acac4a1b182457c |  
spark.mesos.executor.docker.image | 1.4.0 | SPARK-2691 | 8f50a07d2188ccc5315d979755188b1e5d5b5471#diff-e3a5e67b8de2069ce99801372e214b8e |  
spark.mesos.executor.docker.forcePullImage | 2.1.0 | SPARK-15271 | 978cd5f125eb5a410bad2e60bf8385b11cf1b978#diff-0dd025320c7ecda2ea310ed7172d7f5a |  
spark.mesos.executor.docker.portmaps | 1.4.0 | SPARK-7373 | 226033cfffa2f37ebaf8bc2c653f094e91ef0c9b#diff-b964c449b99c51f0a5fd77270b2951a4 |  
spark.mesos.executor.docker.parameters | 2.2.0 | SPARK-19740 | a888fed3099e84c2cf45e9419f684a3658ada19d#diff-4139e6605a8c7f242f65cde538770c99 |  
spark.mesos.executor.docker.volumes | 1.4.0 | SPARK-7373 | 226033cfffa2f37ebaf8bc2c653f094e91ef0c9b#diff-b964c449b99c51f0a5fd77270b2951a4 |  
spark.mesos.gpus.max | 2.1.0 | SPARK-14082 | 29f186bfdf929b1e8ffd8e33ee37b76d5dc5af53#diff-d427ee890b913c5a7056be21eb4f39d7 |  
spark.mesos.task.labels | 2.2.0 | SPARK-20085 | c8fc1f3badf61bcfc4bd8eeeb61f73078ca068d1#diff-387c5d0c916278495fc28420571adf9e |  
spark.mesos.constraints | 1.5.0 | SPARK-6707 | 1165b17d24cdf1dbebb2faca14308dfe5c2a652c#diff-e3a5e67b8de2069ce99801372e214b8e |  
spark.mesos.containerizer | 2.1.0 | SPARK-16637 | 266b92faffb66af24d8ed2725beb80770a2d91f8#diff-0dd025320c7ecda2ea310ed7172d7f5a |  
spark.mesos.role | 1.5.0 | SPARK-6284 | d86bbb4e286f16f77ba125452b07827684eafeed#diff-02a6d899f7a529eb7cfbb12182a110b0 |  
The following appears in the document |   |   |   |  
spark.mesos.driverEnv.[EnvironmentVariableName] | 2.1.0 | SPARK-16194 | 235cb256d06653bcde4c3ed6b081503a94996321#diff-b964c449b99c51f0a5fd77270b2951a4 |  
spark.mesos.dispatcher.driverDefault.[PropertyName] | 2.1.0 | SPARK-16927 and SPARK-16923 | eca58755fbbc11937b335ad953a3caff89b818e6#diff-b964c449b99c51f0a5fd77270b2951a4 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Exists UT

Closes #27863 from beliefer/add-version-to-mesos-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-12 11:02:29 +09:00
gatorsmile 28b8713036 [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
N/A

### How was this patch tested?
N/A

Closes #27698 from gatorsmile/updateVersion.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-25 19:44:31 -08:00
Thomas Graves 496f6ac860 [SPARK-29148][CORE] Add stage level scheduling dynamic allocation and scheduler backend changes
### What changes were proposed in this pull request?

This is another PR for stage level scheduling. In particular this adds changes to the dynamic allocation manager and the scheduler backend to be able to track what executors are needed per ResourceProfile.  Note the api is still private to Spark until the entire feature gets in, so this functionality will be there but only usable by tests for profiles other then the DefaultProfile.

The main changes here are simply tracking things on a ResourceProfile basis as well as sending the executor requests to the scheduler backend for all ResourceProfiles.

I introduce a ResourceProfileManager in this PR that will track all the actual ResourceProfile objects so that we can keep them all in a single place and just pass around and use in datastructures the resource profile id. The resource profile id can be used with the ResourceProfileManager to get the actual ResourceProfile contents.

There are various places in the code that use executor "slots" for things.  The ResourceProfile adds functionality to keep that calculation in it.   This logic is more complex then it should due to standalone mode and mesos coarse grained not setting the executor cores config. They default to all cores on the worker, so calculating slots is harder there.
This PR keeps the functionality to make the cores the limiting resource because the scheduler still uses that for "slots" for a few things.

This PR does also add the resource profile id to the Stage and stage info classes to be able to test things easier.   That full set of changes will come with the scheduler PR that will be after this one.

The PR stops at the scheduler backend pieces for the cluster manager and the real YARN support hasn't been added in this PR, that again will be in a separate PR, so this has a few of the API changes up to the cluster manager and then just uses the default profile requests to continue.

The code for the entire feature is here for reference: https://github.com/apache/spark/pull/27053/files although it needs to be upmerged again as well.

### Why are the changes needed?

Needed for stage level scheduling feature.

### Does this PR introduce any user-facing change?

No user facing api changes added here.

### How was this patch tested?

Lots of unit tests and manually testing. I tested on yarn, k8s, standalone, local modes. Ran both failure and success cases.

Closes #27313 from tgravescs/SPARK-29148.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-02-12 16:45:42 -06:00
Thomas Graves 3d2b8d8b13 [SPARK-30638][CORE] Add resources allocated to PluginContext
### What changes were proposed in this pull request?

Add the allocated resources to parameters to the PluginContext so that any plugins in driver or executor could use this information to initialize devices or use this information in a useful manner.

### Why are the changes needed?

To allow users to initialize/track devices once at the executor level before each task runs to use them.

### Does this PR introduce any user-facing change?

Yes to the people using the Executor/Driver plugin interface.

### How was this patch tested?

Unit tests and manually by writing a plugin that initialized GPU's using this interface.

Closes #27367 from tgravescs/pluginWithResources.

Lead-authored-by: Thomas Graves <tgraves@nvidia.com>
Co-authored-by: Thomas Graves <tgraves@apache.org>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-01-31 08:25:32 -06:00
Thomas Graves 6dbfa2bb9c [SPARK-29306][CORE] Stage Level Sched: Executors need to track what ResourceProfile they are created with
### What changes were proposed in this pull request?

This is the second PR for the Stage Level Scheduling. This is adding in the necessary executor side changes:
1) executors to know what ResourceProfile they should be using
2) handle parsing the resource profile settings - these are not in the global configs
3) then reporting back to the driver what resource profile it was started with.

This PR adds all the piping for YARN to pass the information all the way to executors, but it just uses the default ResourceProfile (which is the global applicatino level configs).

At a high level these changes include:
1) adding a new --resourceProfileId option to the CoarseGrainedExecutorBackend
2) Add the ResourceProfile settings to new internal confs that gets passed into the Executor
3) Executor changes that use the resource profile id passed in to read the corresponding ResourceProfile confs and then parse those requests and discover resources as necessary
4) Executor registers to Driver with the Resource profile id so that the ExecutorMonitor can track how many executor with each profile are running
5) YARN side changes to show that passing the resource profile id and confs actually works. Just uses the DefaultResourceProfile for now.

I also removed a check from the CoarseGrainedExecutorBackend that used to check to make sure there were task requirements before parsing any custom resource executor requests.  With the resource profiles this becomes much more expensive because we would then have to pass the task requests to each executor and the check was just a short cut and not really needed. It was much cleaner just to remove it.

Note there were some changes to the ResourceProfile, ExecutorResourceRequests, and TaskResourceRequests in this PR as well because I discovered some issues with things not being immutable. That api now look like:

val rpBuilder = new ResourceProfileBuilder()
val ereq = new ExecutorResourceRequests()
val treq = new TaskResourceRequests()

ereq.cores(2).memory("6g").memoryOverhead("2g").pysparkMemory("2g").resource("gpu", 2, "/home/tgraves/getGpus")
treq.cpus(2).resource("gpu", 2)

val resourceProfile = rpBuilder.require(ereq).require(treq).build

This makes is so that ResourceProfile is immutable and Spark can use it directly without worrying about the user changing it.

### Why are the changes needed?

These changes are needed for the executor to report which ResourceProfile they are using so that ultimately the dynamic allocation manager can use that information to know how many with a profile are running and how many more it needs to request.  Its also needed to get the resource profile confs to the executor so that it can run the appropriate discovery script if needed.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests and manually on YARN.

Closes #26682 from tgravescs/SPARK-29306.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-01-17 08:15:25 -06:00
Yuming Wang 696288f623 [INFRA] Reverts commit 56dcd79 and c216ef1
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79

2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1

### Why are the changes needed?
Shouldn't change master.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-master

Closes #26915 from wangyum/revert-master.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-12-16 19:57:44 -07:00
Yuming Wang 56dcd79992 Preparing development version 3.0.1-SNAPSHOT 2019-12-17 01:57:27 +00:00
Yuming Wang c216ef1d03 Preparing Spark release v3.0.0-preview2-rc2 2019-12-17 01:57:21 +00:00
Sean Owen 1febd373ea [MINOR][TESTS] Replace JVM assert with JUnit Assert in tests
### What changes were proposed in this pull request?

Use JUnit assertions in tests uniformly, not JVM assert() statements.

### Why are the changes needed?

assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.

### Does this PR introduce any user-facing change?

No. The assertion logic should be identical.

### How was this patch tested?

Existing tests.

Closes #26581 from srowen/assertToJUnit.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-11-20 14:04:15 -06:00
Xingbo Jiang 8207c835b4 Revert "Prepare Spark release v3.0.0-preview-rc2"
This reverts commit 007c873ae3.
2019-10-30 17:45:44 -07:00
Xingbo Jiang 007c873ae3 Prepare Spark release v3.0.0-preview-rc2
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A
2019-10-30 17:42:59 -07:00
Xingbo Jiang b33a58c0c6 Revert "Prepare Spark release v3.0.0-preview-rc1"
This reverts commit 5eddbb5f1d.
2019-10-28 22:32:34 -07:00
Xingbo Jiang 5eddbb5f1d Prepare Spark release v3.0.0-preview-rc1
### What changes were proposed in this pull request?

To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.

Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`

**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**

We shall revert the changes after 3.0.0-preview release passed.

### Why are the changes needed?

To make the maven release repository to accept the built jars.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #26243 from jiangxb1987/3.0.0-preview-prepare.

Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-10-28 22:31:29 -07:00
Sean Owen cc7493fa21 [SPARK-29416][CORE][ML][SQL][MESOS][TESTS] Use .sameElements to compare arrays, instead of .deep (gone in 2.13)
### What changes were proposed in this pull request?

Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place.

### Why are the changes needed?

To compile with 2.13.

### Does this PR introduce any user-facing change?

None.

### How was this patch tested?

Existing tests.

Closes #26073 from srowen/SPARK-29416.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-10-09 17:00:48 -07:00
Dongjoon Hyun bd031c2173 [SPARK-29307][BUILD][TESTS] Remove scalatest deprecation warnings
### What changes were proposed in this pull request?

This PR aims to remove `scalatest` deprecation warnings with the following changes.
- `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar`
- `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser`
- `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers`
- `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks`

### Why are the changes needed?

According to the Jenkins logs, there are 118 warnings about this.
```
 grep "is deprecated" ~/consoleText | grep scalatest | wc -l
     118
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

After Jenkins passes, we need to check the Jenkins log.

Closes #25982 from dongjoon-hyun/SPARK-29307.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 21:00:11 -07:00
Sean Owen e1ea806b30 [SPARK-29291][CORE][SQL][STREAMING][MLLIB] Change procedure-like declaration to function + Unit for 2.13
### What changes were proposed in this pull request?

Scala 2.13 emits a deprecation warning for procedure-like declarations:

```
def foo() {
 ...
```

This is equivalent to the following, so should be changed to avoid a warning:

```
def foo(): Unit = {
  ...
```

### Why are the changes needed?

It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.

Unfortunately, that makes this quite a big change.

### Does this PR introduce any user-facing change?

No behavior change at all.

### How was this patch tested?

Existing tests.

Closes #25968 from srowen/SPARK-29291.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-30 10:03:23 -07:00
turbofei eef5e6d348 [SPARK-29113][DOC] Fix some annotation errors and remove meaningless annotations in project
### What changes were proposed in this pull request?

In this PR, I fix some annotation errors and remove meaningless annotations in project.
### Why are the changes needed?
There are some annotation errors and meaningless annotations in project.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Verified manually.

Closes #25809 from turboFei/SPARK-29113.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-18 13:12:18 +09:00
Anton Kirillov f17f1d01e2 [SPARK-28778][MESOS] Fixed executors advertised address when running in virtual network
### What changes were proposed in this pull request?
Resolves [SPARK-28778: Shuffle jobs fail due to incorrect advertised address when running in a virtual network on Mesos](https://issues.apache.org/jira/browse/SPARK-28778).

This patch fixes a bug which occurs when shuffle jobs are launched by Mesos in a virtual network. Mesos scheduler sets executor `--hostname` parameter to `0.0.0.0` in the case when `spark.mesos.network.name` is provided. This makes executors use `0.0.0.0` as their advertised address and, in the presence of shuffle, executors fail to fetch shuffle blocks from each other using `0.0.0.0` as the origin. When a virtual network is used the hostname or IP address is not known upfront and assigned to a container at its start time so the executor process needs to advertise the correct dynamically assigned address to be reachable by other executors.

Changes:
- added a fallback to `Utils.localHostName()` in Spark Executors when `--hostname` is not provided
- removed setting executor address to `0.0.0.0` from Mesos scheduler
- refactored the code related to building executor command in Mesos scheduler
- added network configuration support to Docker containerizer
- added unit tests

### Why are the changes needed?
The bug described above prevents Mesos users from running any jobs which involve shuffle due to the inability of executors to fetch shuffle blocks because of incorrect advertised address when virtual network is used.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- added unit test to `MesosCoarseGrainedSchedulerBackendSuite` which verifies the absence of `--hostname` parameter  when `spark.mesos.network.name` is provided and its presence otherwise
- added unit test to `MesosSchedulerBackendUtilSuite` which verifies that `MesosSchedulerBackendUtil.buildContainerInfo` sets network-related properties for Docker containerizer
- unit tests from this repo launched with profiles: `./build/mvn test -Pmesos -Pnetlib-lgpl -Psparkr -Phive -Phive-thriftserver`, build log attached: [mvn.test.log](https://github.com/apache/spark/files/3516891/mvn.test.log)
- integration tests from [DCOS Spark repo](https://github.com/mesosphere/spark-build), more specifically - [test_spark_cni.py](https://github.com/mesosphere/spark-build/blob/master/tests/test_spark_cni.py) which runs a specific [shuffle job](https://github.com/mesosphere/spark-build/blob/master/tests/jobs/scala/src/main/scala/ShuffleApp.scala) and verifies its successful completion, Mesos task network configuration, and IP addresses for both Mesos and Docker containerizers

Closes #25500 from akirillov/DCOS-45840-fix-advertised-ip-in-virtual-networks.

Authored-by: Anton Kirillov <akirillov@mesosophere.io>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-23 18:30:05 -07:00
Jungtaek Lim (HeartSaVioR) 128ea37bda [SPARK-28601][CORE][SQL] Use StandardCharsets.UTF_8 instead of "UTF-8" string representation, and get rid of UnsupportedEncodingException
## What changes were proposed in this pull request?

This patch tries to keep consistency whenever UTF-8 charset is needed, as using `StandardCharsets.UTF_8` instead of using "UTF-8". If the String type is needed, `StandardCharsets.UTF_8.name()` is used.

This change also brings the benefit of getting rid of `UnsupportedEncodingException`, as we're providing `Charset` instead of `String` whenever possible.

This also changes some private Catalyst helper methods to operate on encodings as `Charset` objects rather than strings.

## How was this patch tested?

Existing unit tests.

Closes #25335 from HeartSaVioR/SPARK-28601.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-08-05 20:45:54 -07:00
Yuanjian Li db39f45baf [SPARK-28593][CORE] Rename ShuffleClient to BlockStoreClient which more close to its usage
## What changes were proposed in this pull request?

After SPARK-27677, the shuffle client not only handles the shuffle block but also responsible for local persist RDD blocks. For better code scalability and precise semantics(as the [discussion](https://github.com/apache/spark/pull/24892#discussion_r300173331)), here we did several changes:

- Rename ShuffleClient to BlockStoreClient.
- Correspondingly rename the ExternalShuffleClient to ExternalBlockStoreClient, also change the server-side class from ExternalShuffleBlockHandler to ExternalBlockHandler.
- Move MesosExternalBlockStoreClient to Mesos package.

Note, we still keep the name of BlockTransferService, because the `Service` contains both client and server, also the name of BlockTransferService is not referencing shuffle client only.

## How was this patch tested?

Existing UT.

Closes #25327 from xuanyuanking/SPARK-28593.

Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-08-05 14:54:45 +08:00
Gabor Somogyi f83000597f [SPARK-23472][CORE] Add defaultJavaOptions for driver and executor.
## What changes were proposed in this pull request?

This PR adds two new config properties: `spark.driver.defaultJavaOptions` and `spark.executor.defaultJavaOptions`. These are intended to be set by administrators in a file of defaults for options like JVM garbage collection algorithm. Users will still set `extraJavaOptions` properties, and both sets of JVM options will be added to start a JVM (default options are prepended to extra options).

## How was this patch tested?

Existing + additional unit tests.
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #24804 from gaborgsomogyi/SPARK-23472.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-07-11 09:37:26 -07:00
Xiangrui Meng 7056e004ee [SPARK-27823][CORE] Refactor resource handling code
## What changes were proposed in this pull request?

Continue the work from https://github.com/apache/spark/pull/24821. Refactor resource handling code to make the code more readable. Major changes:

* Moved resource-related classes to `spark.resource` from `spark`.
* Added ResourceUtils and helper classes so we don't need to directly deal with Spark conf.
 * ResourceID: resource identifier and it provides conf keys
 * ResourceRequest/Allocation: abstraction for requested and allocated resources
* Added `TestResourceIDs` to reference commonly used resource IDs in tests like `spark.executor.resource.gpu`.

cc: tgravescs jiangxb1987 Ngone51

## How was this patch tested?

Unit tests for added utils and existing unit tests.

Closes #24856 from mengxr/SPARK-27823.

Lead-authored-by: Xiangrui Meng <meng@databricks.com>
Co-authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-06-18 17:18:17 -07:00
Xingbo Jiang ac808e2a02 [SPARK-27366][CORE] Support GPU Resources in Spark job scheduling
## What changes were proposed in this pull request?

This PR adds support to schedule tasks with extra resource requirements (eg. GPUs) on executors with available resources. It also introduce a new method `TaskContext.resources()` so tasks can access available resource addresses allocated to them.

## How was this patch tested?

* Added new end-to-end test cases in `SparkContextSuite`;
* Added new test case in `CoarseGrainedSchedulerBackendSuite`;
* Added new test case in `CoarseGrainedExecutorBackendSuite`;
* Added new test case in `TaskSchedulerImplSuite`;
* Added new test case in `TaskSetManagerSuite`;
* Updated existing tests.

Closes #24374 from jiangxb1987/gpu.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
2019-06-04 16:57:47 -07:00
wenxuanguan e7443d6412 [SPARK-27774][CORE][MLLIB] Avoid hardcoded configs
## What changes were proposed in this pull request?

avoid hardcoded configs in `SparkConf` and `SparkSubmit` and test

## How was this patch tested?

N/A

Closes #24631 from wenxuanguan/minor-fix.

Authored-by: wenxuanguan <choose_home@126.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-05-22 10:45:11 +09:00
Thomas Graves db2e3c4341 [SPARK-27024] Executor interface for cluster managers to support GPU and other resources
## What changes were proposed in this pull request?

Add in GPU and generic resource type allocation to the executors.

Note this is part of a bigger feature for gpu-aware scheduling and is just how the executor find the resources. The general flow :

   - users ask for a certain set of resources, for instance number of gpus - each cluster manager has a specific way to do this.
  -  cluster manager allocates a container or set of resources (standalone mode)
-    When spark launches the executor in that container, the executor either has to be told what resources it has or it has to auto discover them.
  -  Executor has to register with Driver and tell the driver the set of resources it has so the scheduler can use that to schedule tasks that requires a certain amount of each of those resources

In this pr I added configs and arguments to the executor to be able discover resources. The argument to the executor is intended to be used by standalone mode or other cluster managers that don't have isolation so that it can assign specific resources to specific executors in case there are multiple executors on a node. The argument is a file contains JSON Array of ResourceInformation objects.

The discovery script is meant to be used in an isolated environment where the executor only sees the resources it should use.

Note that there will be follow on PRs to add other parts like the scheduler part. See the epic high level jira: https://issues.apache.org/jira/browse/SPARK-24615

## How was this patch tested?

Added unit tests and manually tested.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #24406 from tgravescs/gpu-sched-executor-clean.

Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2019-05-14 08:41:41 -05:00
Sam Tran bcd3b61c4b [SPARK-27347][MESOS] Fix supervised driver retry logic for outdated tasks
## What changes were proposed in this pull request?

This patch fixes a bug where `--supervised` Spark jobs would retry multiple times whenever an agent would crash, come back, and re-register even when those jobs had already relaunched on a different agent.

That is:
```
- supervised driver is running on agent1
- agent1 crashes
- driver is relaunched on another agent as `<task-id>-retry-1`
- agent1 comes back online and re-registers with scheduler
- spark relaunches the same job as `<task-id>-retry-2`
- now there are two jobs running simultaneously
```

This is because when an agent would come back and re-register it would send a status update `TASK_FAILED` for its old driver-task. Previous logic would indiscriminately remove the `submissionId` from Zookeeper's `launchedDrivers` node and add it to `retryList` node. Then, when a new offer came in, it would relaunch another `-retry-`  task even though one was previously running.

For example logs, scroll to bottom

## How was this patch tested?

- Added a unit test to simulate behavior described above
- Tested manually on a DC/OS cluster by
  ```
  - launching a --supervised spark job
  - dcos node ssh <to the agent with the running spark-driver>
  - systemctl stop dcos-mesos-slave
  - docker kill <driver-container-id>
  - [ wait until spark job is relaunched ]
  - systemctl start dcos-mesos-slave
  - [ observe spark driver is not relaunched as `-retry-2` ]
  ```

Log snippets included below. Notice the `-retry-1` task is running when status update for the old task comes in afterward:
```
19/01/15 19:21:38 TRACE MesosClusterScheduler: Received offers from Mesos:
... [offers] ...
19/01/15 19:21:39 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2532 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001"
...
19/01/15 19:21:42 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_STARTING message=''
19/01/15 19:21:43 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_RUNNING message=''
...
19/01/15 19:29:12 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_LOST message='health check timed out' reason=REASON_SLAVE_REMOVED
...
19/01/15 19:31:12 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2681 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-1"
...
19/01/15 19:31:15 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_STARTING message=''
19/01/15 19:31:16 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-1 state=TASK_RUNNING message=''
...
19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Unreachable agent re-reregistered'
...
19/01/15 19:33:45 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001 state=TASK_FAILED message='Abnormal executor termination: unknown container' reason=REASON_EXECUTOR_TERMINATED
19/01/15 19:33:45 ERROR MesosClusterScheduler: Unable to find driver with driver-20190115192138-0001 in status update
...
19/01/15 19:33:47 TRACE MesosClusterScheduler: Using offer 5d421001-0630-4214-9ecb-d5838a2ec149-O2729 to launch driver driver-20190115192138-0001 with taskId: value: "driver-20190115192138-0001-retry-2"
...
19/01/15 19:33:50 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_STARTING message=''
19/01/15 19:33:51 INFO MesosClusterScheduler: Received status update: taskId=driver-20190115192138-0001-retry-2 state=TASK_RUNNING message=''
```

Closes #24276 from samvantran/SPARK-27347-duplicate-retries.

Authored-by: Sam Tran <stran@mesosphere.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-05-10 10:53:31 -07:00
jiafu.zhang@intel.com fa5dc0a45a [SPARK-26632][CORE] Separate Thread Configurations of Driver and Executor
## What changes were proposed in this pull request?

For the below three thread configuration items applied to both driver and executor,
spark.rpc.io.serverThreads
spark.rpc.io.clientThreads
spark.rpc.netty.dispatcher.numThreads,
we separate them to driver specifics and executor specifics.
spark.driver.rpc.io.serverThreads                     < - > spark.executor.rpc.io.serverThreads
spark.driver.rpc.io.clientThreads                      < - > spark.executor.rpc.io.clientThreads
spark.driver.rpc.netty.dispatcher.numThreads < - > spark.executor.rpc.netty.dispatcher.numThreads

Spark reads these specifics first and fall back to the common configurations.

## How was this patch tested?
We ran the SimpleMap app without shuffle for benchmark purpose to test Spark's scalability in HPC with omini-path NIC which has higher bandwidth than normal ethernet NIC.

Spark's base version is 2.4.0.
Spark ran in the Standalone mode. Driver was in a standalone node.
After the separation, the performance is improved a lot in 256 nodes and 512 nodes. see below test results of SimpleMapTask before and after the enhancement. You can view the tables in the  [JIRA](https://issues.apache.org/jira/browse/SPARK-26632) too.

ds: spark.driver.rpc.io.serverThreads
dc: spark.driver.rpc.io.clientThreads
dd: spark.driver.rpc.netty.dispatcher.numThreads
ed: spark.executor.rpc.netty.dispatcher.numThreads
time: Overall Time (s)
old time: Overall Time without Separation (s)

**Before:**

 nodes | ds | dc | dd | ed | time
-- |-- | -- | -- | -- | --
128 nodes | 8 | 8 | 8 | 8 | 108
256 nodes | 8 | 8 | 8 | 8 | 196
512 nodes | 8 | 8 | 8 | 8 | 377

**After:**

nodes | ds | dc | dd | ed | time | improvement
-- | -- | -- | -- | -- | -- | --
128 nodes | 15 | 15 | 10 | 30 | 107 | 0.9%
256 nodes | 12 | 15 | 10 | 30 | 159 | 18.8%
512 nodes | 12 | 15 | 10 | 30 | 283 | 24.9%

Closes #23560 from zjf2012/thread_conf_separation.

Authored-by: jiafu.zhang@intel.com <jiafu.zhang@intel.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-05-10 10:42:43 -07:00
Sean Owen a6716d3f03 [SPARK-27571][CORE][YARN][EXAMPLES] Avoid scala.language.reflectiveCalls
## What changes were proposed in this pull request?

This PR avoids usage of reflective calls in Scala. It removes the import that suppresses the warnings and rewrites code in small ways to avoid accessing methods that aren't technically accessible.

## How was this patch tested?

Existing tests.

Closes #24463 from srowen/SPARK-27571.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-04-29 11:16:45 -05:00
Yuming Wang d70b6a39e1 [MINOR][BUILD] Add 2 maven properties(hive.classifier and hive.parquet.group)
## What changes were proposed in this pull request?

This pr adds 2 maven properties to help us upgrade the built-in Hive.

| Property Name | Default | In future |
| ------ | ------ | ------ |
| hive.classifier | (none) | core |
| hive.parquet.group | com.twitter | org.apache.parquet |

## How was this patch tested?

existing tests

Closes #23996 from wangyum/add_2_maven_properties.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-07 16:46:07 -06:00
mwlon 0ba19543d2 [SPARK-27015][MESOS] properly escape mesos scheduler arguments
## What changes were proposed in this pull request?

Escape arguments for submissions sent to a Mesos dispatcher; analogous change to https://issues.apache.org/jira/browse/SPARK-24380 for confs.

Since this changes behavior than some users are undoubtedly already working around, probably best to only only merge into master.

## How was this patch tested?

Added a new unit test, covering some existing behavior as well.

Closes #23967 from mwlon/SPARK-27015.

Authored-by: mwlon <mloncaric@hmc.edu>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-05 13:05:37 -08:00
mwlon 5fd4d7499c [SPARK-26192][MESOS] Retrieve enableFetcherCache option from submission for driver URIs
## What changes were proposed in this pull request?

Retrieve enableFetcherCache option from submission conf rather than dispatcher conf. This resolves some confusing behavior where Spark drivers currently get this conf from the dispatcher, whereas Spark executors get this conf from the submission. After this change, the conf will only need to be specified once.

## How was this patch tested?

With (updated) existing tests.

Closes #23924 from mwlon/SPARK-26192.

Authored-by: mwlon <mloncaric@hmc.edu>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-03-04 12:10:48 -08:00
SongYadong 86b25c4350 [SPARK-26967][CORE] Put MetricsSystem instance names together for clearer management
## What changes were proposed in this pull request?

`MetricsSystem` instance creations have a scattered distribution in the project code. So do their names. It may cause some inconvenience for browsing and management.
This PR tries to put them together. In this way, we can have a uniform location for adding or removing them, and have a overall view of `MetircsSystem `instances in current project.
It's also helpful for maintaining user documents by avoiding missing something.

## How was this patch tested?

Existing unit tests.

Closes #23869 from SongYadong/metrics_system_inst_manage.

Lead-authored-by: SongYadong <song.yadong1@zte.com.cn>
Co-authored-by: walter2001 <ydsong2007@163.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-01 11:49:43 -06:00
Maxim Gekk a829234df3 [SPARK-26817][CORE] Use System.nanoTime to measure time intervals
## What changes were proposed in this pull request?

In the PR, I propose to use `System.nanoTime()` instead of `System.currentTimeMillis()` in measurements of time intervals.

`System.currentTimeMillis()` returns current wallclock time and will follow changes to the system clock. Thus, negative wallclock adjustments can cause timeouts to "hang" for a long time (until wallclock time has caught up to its previous value again). This can happen when ntpd does a "step" after the network has been disconnected for some time. The most canonical example is during system bootup when DHCP takes longer than usual. This can lead to failures that are really hard to understand/reproduce. `System.nanoTime()` is guaranteed to be monotonically increasing irrespective of wallclock changes.

## How was this patch tested?

By existing test suites.

Closes #23727 from MaxGekk/system-nanotime.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <maxim.gekk@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-13 13:12:16 -06:00
Jungtaek Lim (HeartSaVioR) af4c59c0fb [SPARK-26843][MESOS] Use ConfigEntry for hardcoded configs for "mesos" resource manager
## What changes were proposed in this pull request?

This patch makes hardcoded configs in "mesos" module to use ConfigEntry, avoiding issues on mistake like SPARK-26082.

Please note that there're some changes on type while migrating to ConfigEntry: specifically "comma-separated list on a string" becomes "sequence of strings". While SparkConf smoothly handles on the change (comma-separated list on a string is still supported so backward compatible), there're some methods in utility class (`mesos` package private) to depend on the type change, so this patch also modifies the method signature for them a bit.

## How was this patch tested?

Existing tests.

Closes #23743 from HeartSaVioR/SPARK-26843.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-10 16:34:33 -08:00
Jungtaek Lim (HeartSaVioR) b8d666940b [SPARK-26082][MESOS][FOLLOWUP] Fix Scala-2.11 build
## What changes were proposed in this pull request?

#23744 added a UT to prevent a future regression. However, it breaks Scala-2.11 build. This fixes that.

## How was this patch tested?

Manual test with Scala-2.11 profile.

Closes #23755 from HeartSaVioR/SPARK-26082-FOLLOW-UP-V2.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-09 13:46:52 -08:00
Jungtaek Lim (HeartSaVioR) b4e1d14513 [SPARK-26082][MESOS][FOLLOWUP] Add UT on fetcher cache option on MesosClusterScheduler
## What changes were proposed in this pull request?

This patch adds UT on testing SPARK-26082 to avoid regression. While #23743 reduces the possibility to make a similar mistake, the needed lines of code for adding tests are not that huge, so I guess it might be worth to add them.

## How was this patch tested?

Newly added UTs. Test "supports setting fetcher cache" fails when #23743 is not applied and succeeds when #23743 is applied.

Closes #23744 from HeartSaVioR/SPARK-26082-add-unit-test.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-07 08:51:55 -08:00
mwlon c0811e8b4d [SPARK-26082][MESOS] Fix mesos fetch cache config name
## What changes were proposed in this pull request?

* change MesosClusterScheduler to use correct argument name for Mesos fetch cache (spark.mesos.fetchCache.enable -> spark.mesos.fetcherCache.enable)

## How was this patch tested?

Not sure this requires a test, since it's just a string change.

Closes #23734 from mwlon/SPARK-26082.

Authored-by: mwlon <mloncaric@hmc.edu>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-02-07 01:21:31 -08:00
Jungtaek Lim (HeartSaVioR) ae5b2a6a92 [SPARK-26311][CORE] New feature: apply custom log URL pattern for executor log URLs in SHS
## What changes were proposed in this pull request?

This patch proposes adding a new configuration on SHS: custom executor log URL pattern. This will enable end users to replace executor logs to other than RM provide, like external log service, which enables to serve executor logs when NodeManager becomes unavailable in case of YARN.

End users can build their own of custom executor log URLs with pre-defined patterns which would be vary on each resource manager. This patch adds some patterns to YARN resource manager. (For others, there's even no executor log url available so cannot define patterns as well.)

Please refer the doc change as well as added UTs in this patch to see how to set up the feature.

## How was this patch tested?

Added UT, as well as manual test with YARN cluster

Closes #23260 from HeartSaVioR/SPARK-26311.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-30 11:52:30 -08:00
Marcelo Vanzin 2a67dbfbd3 [SPARK-26595][CORE] Allow credential renewal based on kerberos ticket cache.
This change addes a new mode for credential renewal that does not require
a keytab; it uses the local ticket cache instead, so it works while the
user keeps the cache valid.

This can be useful for, e.g., people running long spark-shell sessions where
their kerberos login is kept up-to-date.

The main change to enable this behavior is in HadoopDelegationTokenManager,
with a small change in the HDFS token provider. The other changes are to avoid
creating duplicate tokens when submitting the application to YARN; they allow
the tokens from the scheduler to be sent to the YARN AM, reducing the round trips
to HDFS.

For that, the scheduler initialization code was changed a little bit so that
the tokens are available when the YARN client is initialized. That basically
takes care of a long-standing TODO that was in the code to clean up configuration
propagation to the driver's RPC endpoint (in CoarseGrainedSchedulerBackend).

Tested with an app designed to stress this functionality, with both keytab and
cache-based logins. Some basic kerberos tests on k8s also.

Closes #23525 from vanzin/SPARK-26595.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-28 13:32:34 -08:00
Kazuaki Ishizaki 7bf0794651 [SPARK-26463][CORE] Use ConfigEntry for hardcoded configs for scheduler categories.
## What changes were proposed in this pull request?

The PR makes hardcoded `spark.dynamicAllocation`, `spark.scheduler`, `spark.rpc`, `spark.task`, `spark.speculation`, and `spark.cleaner` configs to use `ConfigEntry`.

## How was this patch tested?

Existing tests

Closes #23416 from kiszk/SPARK-26463.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-22 07:44:36 -06:00
Jungtaek Lim (HeartSaVioR) 38f030725c [SPARK-26466][CORE] Use ConfigEntry for hardcoded configs for submit categories.
## What changes were proposed in this pull request?

The PR makes hardcoded configs below to use `ConfigEntry`.

* spark.kryo
* spark.kryoserializer
* spark.serializer
* spark.jars
* spark.files
* spark.submit
* spark.deploy
* spark.worker

This patch doesn't change configs which are not relevant to SparkConf (e.g. system properties).

## How was this patch tested?

Existing tests.

Closes #23532 from HeartSaVioR/SPARK-26466-v2.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-16 20:57:21 -06:00
Devaraj K 1b75f3bcff [SPARK-17928][MESOS] No driver.memoryOverhead setting for mesos cluster mode
## What changes were proposed in this pull request?

Added a new configuration 'spark.mesos.driver.memoryOverhead' for providing the driver memory overhead in mesos cluster mode.

## How was this patch tested?
Verified it manually, Resource Scheduler allocates (drivermemory+ driver memoryOverhead) for driver in mesos cluster mode.

Closes #17726 from devaraj-kavali/SPARK-17928.

Authored-by: Devaraj K <devaraj@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-01-15 15:45:20 -06:00
Marcelo Vanzin 2783e4c45f [SPARK-24522][UI] Create filter to apply HTTP security checks consistently.
Currently there is code scattered in a bunch of places to do different
things related to HTTP security, such as access control, setting
security-related headers, and filtering out bad content. This makes it
really easy to miss these things when writing new UI code.

This change creates a new filter that does all of those things, and
makes sure that all servlet handlers that are attached to the UI get
the new filter and any user-defined filters consistently. The extent
of the actual features should be the same as before.

The new filter is added at the end of the filter chain, because authentication
is done by custom filters and thus needs to happen first. This means that
custom filters see unfiltered HTTP requests - which is actually the current
behavior anyway.

As a side-effect of some of the code refactoring, handlers added after
the initial set also get wrapped with a GzipHandler, which didn't happen
before.

Tested with added unit tests and in a history server with SPNEGO auth
configured.

Closes #23302 from vanzin/SPARK-24522.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2019-01-08 11:25:33 -06:00
Marco Gaido 1a641525e6 [SPARK-26491][CORE][TEST] Use ConfigEntry for hardcoded configs for test categories
## What changes were proposed in this pull request?

The PR makes hardcoded `spark.test` and `spark.testing` configs to use `ConfigEntry` and put them in the config package.

## How was this patch tested?

existing UTs

Closes #23413 from mgaido91/SPARK-26491.

Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-01-07 15:35:33 -08:00
Marcelo Vanzin 669e8a1559 [SPARK-25689][YARN] Make driver, not AM, manage delegation tokens.
This change modifies the behavior of the delegation token code when running
on YARN, so that the driver controls the renewal, in both client and cluster
mode. For that, a few different things were changed:

* The AM code only runs code that needs DTs when DTs are available.

In a way, this restores the AM behavior to what it was pre-SPARK-23361, but
keeping the fix added in that bug. Basically, all the AM code is run in a
"UGI.doAs()" block; but code that needs to talk to HDFS (basically the
distributed cache handling code) was delayed to the point where the driver
is up and running, and thus when valid delegation tokens are available.

* SparkSubmit / ApplicationMaster now handle user login, not the token manager.

The previous AM code was relying on the token manager to keep the user
logged in when keytabs are used. This required some odd APIs in the token
manager and the AM so that the right UGI was exposed and used in the right
places.

After this change, the logged in user is handled separately from the token
manager, so the API was cleaned up, and, as explained above, the whole AM
runs under the logged in user, which also helps with simplifying some more code.

* Distributed cache configs are sent separately to the AM.

Because of the delayed initialization of the cached resources in the AM, it
became easier to write the cache config to a separate properties file instead
of bundling it with the rest of the Spark config. This also avoids having
to modify the SparkConf to hide things from the UI.

* Finally, the AM doesn't manage the token manager anymore.

The above changes allow the token manager to be completely handled by the
driver's scheduler backend code also in YARN mode (whether client or cluster),
making it similar to other RMs. To maintain the fix added in SPARK-23361 also
in client mode, the AM now sends an extra message to the driver on initialization
to fetch delegation tokens; and although it might not really be needed, the
driver also keeps the running AM updated when new tokens are created.

Tested in a kerberized cluster with the same tests used to validate SPARK-23361,
in both client and cluster mode. Also tested with a non-kerberized cluster.

Closes #23338 from vanzin/SPARK-25689.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2019-01-07 14:40:08 -06:00
Dongjoon Hyun e15a319ccd
[SPARK-26536][BUILD][TEST] Upgrade Mockito to 2.23.4
## What changes were proposed in this pull request?

This PR upgrades Mockito from 1.10.19 to 2.23.4. The following changes are required.

- Replace `org.mockito.Matchers` with `org.mockito.ArgumentMatchers`
- Replace `anyObject` with `any`
- Replace `getArgumentAt` with `getArgument` and add type annotation.
- Use `isNull` matcher in case of `null` is invoked.
```scala
     saslHandler.channelInactive(null);
-    verify(handler).channelInactive(any(TransportClient.class));
+    verify(handler).channelInactive(isNull());
```

- Make and use `doReturn` wrapper to avoid [SI-4775](https://issues.scala-lang.org/browse/SI-4775)
```scala
private def doReturn(value: Any) = org.mockito.Mockito.doReturn(value, Seq.empty: _*)
```

## How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #23452 from dongjoon-hyun/SPARK-26536.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2019-01-04 19:23:38 -08:00
Takuya UESHIN 4419e1daca [SPARK-26445][CORE] Use ConfigEntry for hardcoded configs for driver/executor categories.
## What changes were proposed in this pull request?

The PR makes hardcoded spark.driver, spark.executor, and spark.cores.max configs to use `ConfigEntry`.

Note that some config keys are from `SparkLauncher` instead of defining in the config package object because the string is already defined in it and it does not depend on core module.

## How was this patch tested?

Existing tests.

Closes #23415 from ueshin/issues/SPARK-26445/hardcoded_driver_executor_configs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-01-04 22:12:35 +08:00
DB Tsai ad853c5678
[SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0
## What changes were proposed in this pull request?

This PR makes Spark's default Scala version as 2.12, and Scala 2.11 will be the alternative version. This implies that Scala 2.12 will be used by our CI builds including pull request builds.

We'll update the Jenkins to include a new compile-only jobs for Scala 2.11 to ensure the code can be still compiled with Scala 2.11.

## How was this patch tested?

existing tests

Closes #22967 from dbtsai/scala2.12.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2018-11-14 16:22:23 -08:00
Takuya UESHIN 78fa1be29b [SPARK-25926][CORE] Move config entries in core module to internal.config.
## What changes were proposed in this pull request?

Currently definitions of config entries in `core` module are in several files separately. We should move them into `internal/config` to be easy to manage.

## How was this patch tested?

Existing tests.

Closes #22928 from ueshin/issues/SPARK-25926/single_config_file.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2018-11-06 09:18:17 +08:00
Marcelo Vanzin 68dde3481e [SPARK-23781][CORE] Merge token renewer functionality into HadoopDelegationTokenManager.
This avoids having two classes to deal with tokens; now the above
class is a one-stop shop for dealing with delegation tokens. The
YARN backend extends that class instead of doing composition like
before, resulting in a bit less code there too.

The renewer functionality is basically the same code that used to
be in YARN's AMCredentialRenewer. That is also the reason why the
public API of HadoopDelegationTokenManager is a little bit odd;
the YARN AM has some odd requirements for how this all should be
initialized, and the weirdness is needed currently to support that.

Tested:
- YARN with stress app for DT renewal
- Mesos and K8S with basic kerberos tests (both tgt and keytab)

Closes #22624 from vanzin/SPARK-23781.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2018-10-31 13:00:10 -05:00