Commit graph

247 commits

Author SHA1 Message Date
Marcelo Vanzin 397d3aae5b Bumping version to 1.3.0-SNAPSHOT.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3277 from vanzin/version-1.3 and squashes the following commits:

7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
5f404ff [Marcelo Vanzin] Add another exclusion.
19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
2014-11-18 21:24:18 -08:00
Kousuke Saruta 7f3718842c [SPARK-4282][YARN] Stopping flag in YarnClientSchedulerBackend should be volatile
In YarnClientSchedulerBackend, a variable "stopping" is used as a flag and it's accessed by some threads so it should be volatile.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #3143 from sarutak/stopping-flag-volatile and squashes the following commits:

58fdcc9 [Kousuke Saruta] Marked stoppig flag as volatile
2014-11-11 12:33:53 -06:00
Andrew Or 96136f222a [SPARK-3797] Minor addendum to Yarn shuffle service
I did not realize there was a `network.util.JavaUtils` when I wrote this code. This PR moves the `ByteBuffer` string conversion to the appropriate place. I tested the changes on a stable yarn cluster.

Author: Andrew Or <andrew@databricks.com>

Closes #3144 from andrewor14/yarn-shuffle-util and squashes the following commits:

b6c08bf [Andrew Or] Remove unused import
94e205c [Andrew Or] Use netty Unpooled
85202a5 [Andrew Or] Use guava Charsets
057135b [Andrew Or] Reword comment
adf186d [Andrew Or] Move byte buffer String conversion logic to JavaUtils
2014-11-06 17:18:49 -08:00
Andrew Or 61a5cced04 [SPARK-3797] Run external shuffle service in Yarn NM
This creates a new module `network/yarn` that depends on `network/shuffle` recently created in #3001. This PR introduces a custom Yarn auxiliary service that runs the external shuffle service. As of the changes here this shuffle service is required for using dynamic allocation with Spark.

This is still WIP mainly because it doesn't handle security yet. I have tested this on a stable Yarn cluster.

Author: Andrew Or <andrew@databricks.com>

Closes #3082 from andrewor14/yarn-shuffle-service and squashes the following commits:

ef3ddae [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
0ee67a2 [Andrew Or] Minor wording suggestions
1c66046 [Andrew Or] Remove unused provided dependencies
0eb6233 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
6489db5 [Andrew Or] Try catch at the right places
7b71d8f [Andrew Or] Add detailed java docs + reword a few comments
d1124e4 [Andrew Or] Add security to shuffle service (INCOMPLETE)
5f8a96f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
9b6e058 [Andrew Or] Address various feedback
f48b20c [Andrew Or] Fix tests again
f39daa6 [Andrew Or] Do not make network-yarn an assembly module
761f58a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-shuffle-service
15a5b37 [Andrew Or] Fix build for Hadoop 1.x
baff916 [Andrew Or] Fix tests
5bf9b7e [Andrew Or] Address a few minor comments
5b419b8 [Andrew Or] Add missing license header
804e7ff [Andrew Or] Include the Yarn shuffle service jar in the distribution
cd076a4 [Andrew Or] Require external shuffle service for dynamic allocation
ea764e0 [Andrew Or] Connect to Yarn shuffle service only if it's enabled
1bf5109 [Andrew Or] Use the shuffle service port specified through hadoop config
b4b1f0c [Andrew Or] 4 tabs -> 2 tabs
43dcb96 [Andrew Or] First cut integration of shuffle service with Yarn aux service
b54a0c4 [Andrew Or] Initial skeleton for Yarn shuffle service
2014-11-05 15:42:05 -08:00
andrewor14 087e31a7a0 [HOT FIX] Yarn stable tests don't compile
This is caused by this commit: acd4ac7c9a

Author: andrewor14 <andrew@databricks.com>
Author: Andrew Or <andrew@databricks.com>

Closes #3041 from andrewor14/yarn-hot-fix and squashes the following commits:

e5deba1 [andrewor14] Add new line at the end (minor)
aa998e8 [Andrew Or] Compilation hot fix
2014-10-31 14:36:55 -07:00
Sandy Ryza acd4ac7c9a SPARK-3837. Warn when YARN kills containers for exceeding memory limits
I triggered the issue and verified the message gets printed on a pseudo-distributed cluster.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2744 from sryza/sandy-spark-3837 and squashes the following commits:

858a268 [Sandy Ryza] Review feedback
c937f00 [Sandy Ryza] SPARK-3837. Warn when YARN kills containers for exceeding memory limits
2014-10-31 08:43:06 -05:00
Andrew Or 5231a3f228 [Minor] A few typos in comments and log messages
Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #3021 from andrewor14/typos and squashes the following commits:

daaf417 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
4838ae4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into typos
026d426 [Andrew Or] Merge branch 'master' of github.com:andrewor14/spark into typos
a81ae8f [Andrew Or] Some typos
2014-10-30 15:32:11 -07:00
Andrew Or 26f092d4e3 [SPARK-4138][SPARK-4139] Improve dynamic allocation settings
This should be merged after #2746 (SPARK-3795).

**SPARK-4138**. If the user sets both the number of executors and `spark.dynamicAllocation.enabled`, we should throw an exception.

**SPARK-4139**. If the user sets `spark.dynamicAllocation.enabled`, we should use the max number of executors as the starting number of executors because the first job is likely to run immediately after application startup. If the latter is not set, throw an exception.

Author: Andrew Or <andrew@databricks.com>

Closes #3002 from andrewor14/yarn-set-executors and squashes the following commits:

c528fce [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-set-executors
55d4699 [Andrew Or] Bug fix: `isDynamicAllocationEnabled` was always false
2b0ccec [Andrew Or] Start the number of executors at the max
022bfde [Andrew Or] Guard against incompatible settings of number of executors
2014-10-30 15:31:23 -07:00
GuoQiang Li cd739bd756 [SPARK-1720][SPARK-1719] use LD_LIBRARY_PATH instead of -Djava.library.path
- [X] Standalone
- [X] YARN
- [X] Mesos
- [X]  Mac OS X
- [X] Linux
- [ ]  Windows

This is another implementation about #1031

Author: GuoQiang Li <witgo@qq.com>

Closes #2711 from witgo/SPARK-1719 and squashes the following commits:

c7b26f6 [GuoQiang Li] review commits
4488e41 [GuoQiang Li] Refactoring CommandUtils
a444094 [GuoQiang Li] review commits
40c0b4a [GuoQiang Li] Add buildLocalCommand method
c1a0ddd [GuoQiang Li] fix comments
156ce88 [GuoQiang Li] review commit
38aa377 [GuoQiang Li] Refactor CommandUtils.scala
4269e00 [GuoQiang Li] Refactor SparkSubmitDriverBootstrapper.scala
7a1d634 [GuoQiang Li] use LD_LIBRARY_PATH instead of -Djava.library.path
2014-10-29 23:02:58 -07:00
Andrew Or 8d59b37b02 [SPARK-3795] Heuristics for dynamically scaling executors
This is part of a bigger effort to provide elastic scaling of executors within a Spark application ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This PR does not provide any functionality by itself; it is a skeleton that is missing a mechanism to be added later in [SPARK-3822](https://issues.apache.org/jira/browse/SPARK-3822).

Comments and feedback are most welcome. For those of you reviewing this in detail, I highly recommend doing it through your favorite IDE instead of through the diff here.

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Or <andrew@databricks.com>

Closes #2746 from andrewor14/scaling-heuristics and squashes the following commits:

8a4fdaa [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
e045df8 [Andrew Or] Add warning message (minor)
dfa31ec [Andrew Or] Fix tests
c0becc4 [Andrew Or] Merging with SPARK-3822
4784f93 [Andrew Or] Reword an awkward log message
181f27f [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
c79e907 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
4672b90 [Andrew Or] It's nano time.
a6a30f2 [Andrew Or] Do not allow min/max executors of 0
c60ec33 [Andrew Or] Rewrite test logic with clocks
b00b680 [Andrew Or] Fix style
c3caa65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
7f9da14 [Andrew Or] Factor out logic to verify bounds on # executors (minor)
f279019 [Andrew Or] Add time mocking tests for polling loop
685e347 [Andrew Or] Factor out clock in polling loop to facilitate testing
3cea7f7 [Andrew Or] Use PrivateMethodTester to keep original class private
3156d81 [Andrew Or] Update comments and exception messages
92f36f9 [Andrew Or] Address minor review comments
abdea61 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
2aefd09 [Andrew Or] Correct listener behavior
9fe6e44 [Andrew Or] Rename variables and configs + update comments and log messages
149cc32 [Andrew Or] Fix style
254c958 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5ff829b [Andrew Or] Add tests for ExecutorAllocationManager
19c6c4b [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5896515 [Andrew Or] Move ExecutorAllocationManager out of scheduler package
9ca8945 [Andrew Or] Rewrite callbacks through the listener interface
5e336b9 [Andrew Or] Remove code from backend to avoid conflict with SPARK-3822
092d1fd [Andrew Or] Remove timeout logic for pending requests
1309fab [Andrew Or] Request executors by specifying the number pending
8bc0e9d [Andrew Or] Add logic to expire pending requests after timeouts
b750ee1 [Andrew Or] Express timers in terms of expiration times + remove retry logic
7f8dd47 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
9d516cc [Andrew Or] Bug fix: Actually trigger the add timer / add retry timer
44f1832 [Andrew Or] Rename configs to include time units
eaae7ef [Andrew Or] Address various review comments
6f8be6c [Andrew Or] Beef up comments on what each of the timers mean
baaa403 [Andrew Or] Simplify variable names (minor)
42beec8 [Andrew Or] Reset whether the add threshold is crossed on cancellation
9bcc0bc [Andrew Or] ExecutorScalingManager -> ExecutorAllocationManager
2784398 [Andrew Or] Merge branch 'master' of github.com:apache/spark into scaling-heuristics
5a97d9e [Andrew Or] Log retry attempts in INFO + clean up logging
2f55c9f [Andrew Or] Do not keep requesting executors even after max attempts
0acd1cb [Andrew Or] Rewrite timer logic with polling
b3c7d44 [Andrew Or] Start the retry timer for adding executors at the right time
9b5f2ea [Andrew Or] Wording changes in comments and log messages
c2203a5 [Andrew Or] Simplify code to access the scheduler backend
e519d08 [Andrew Or] Simplify initialization code
2cc87a7 [Andrew Or] Add retry logic for removing executors
d0b34a6 [Andrew Or] Add retry logic for adding executors
9cc4649 [Andrew Or] Simplifying synchronization logic
67c03c7 [Andrew Or] Correct semantics of adding executors + update comments
6c48ab0 [Andrew Or] Update synchronization comment
8901900 [Andrew Or] Simplify remove policy + change the semantics of add policy
1cc8444 [Andrew Or] Minor wording change
ae5b64a [Andrew Or] Add synchronization
20ec6b9 [Andrew Or] First cut implementation of removing executors dynamically
4077ae2 [Andrew Or] Minor code re-organization
6f1fa66 [Andrew Or] First cut implementation of adding executors dynamically
b2e6dcc [Andrew Or] Add skeleton interface for requesting / killing executors
2014-10-29 17:48:59 -07:00
Andrew Or 1df05a40eb [SPARK-3822] Executor scaling mechanism for Yarn
This is part of a broader effort to enable dynamic scaling of executors ([SPARK-3174](https://issues.apache.org/jira/browse/SPARK-3174)). This is intended to work alongside SPARK-3795 (#2746), SPARK-3796 and SPARK-3797, but is functionally independently of these other issues.

The logic is built on top of PraveenSeluka's changes at #2798. This is different from the changes there in a few major ways: (1) the mechanism is implemented within the existing scheduler backend framework rather than in new `Actor` classes. This also introduces a parent abstract class `YarnSchedulerBackend` to encapsulate common logic to communicate with the Yarn `ApplicationMaster`. (2) The interface of requesting executors exposed to the `SparkContext` is the same, but the communication between the scheduler backend and the AM uses total number executors desired instead of an incremental number. This is discussed in #2746 and explained in the comments in the code.

I have tested this significantly on a stable Yarn cluster.

------------
A remaining task for this issue is to tone down the error messages emitted when an executor is removed.
Currently, `SparkContext` and its components react as if the executor has failed, resulting in many scary error messages and eventual timeouts. While it's not strictly necessary to fix this as of the first-cut implementation of this mechanism, it would be good to add logic to distinguish this case. I prefer to address this in a separate PR. I have filed a separate JIRA for this task at SPARK-4134.

Author: Andrew Or <andrew@databricks.com>
Author: Andrew Or <andrewor14@gmail.com>

Closes #2840 from andrewor14/yarn-scaling-mechanism and squashes the following commits:

485863e [Andrew Or] Minor log message changes
4920be8 [Andrew Or] Clarify that public API is only for Yarn mode for now
1c57804 [Andrew Or] Reword a few comments + other review comments
6321140 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
02836c0 [Andrew Or] Limit scope of synchronization
4e2ed7f [Andrew Or] Fix bug: keep track of removed executors properly
73ade46 [Andrew Or] Wording changes (minor)
2a7a6da [Andrew Or] Add `sc.killExecutor` as a shorthand (minor)
665f229 [Andrew Or] Mima excludes
79aa2df [Andrew Or] Simplify the request interface by asking for a total
04f625b [Andrew Or] Fix race condition that causes over-allocation of executors
f4783f8 [Andrew Or] Change the semantics of requesting executors
005a124 [Andrew Or] Fix tests
4628b16 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
db4a679 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
572f5c5 [Andrew Or] Unused import (minor)
f30261c [Andrew Or] Kill multiple executors rather than one at a time
de260d9 [Andrew Or] Simplify by skipping useless null check
9c52542 [Andrew Or] Simplify by skipping the TaskSchedulerImpl
97dd1a8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-scaling-mechanism
d987b3e [Andrew Or] Move addWebUIFilters to Yarn scheduler backend
7b76d0a [Andrew Or] Expose mechanism in SparkContext as developer API
47466cd [Andrew Or] Refactor common Yarn scheduler backend logic
c4dfaac [Andrew Or] Avoid thrashing when removing executors
53e8145 [Andrew Or] Start yarn actor early to listen for AM registration message
bbee669 [Andrew Or] Add mechanism in yarn client mode
2014-10-29 14:01:00 -07:00
Kousuke Saruta 247c529b35 [SPARK-3657] yarn alpha YarnRMClientImpl throws NPE appMasterRequest.setTrackingUrl starting spark-shell
tgravescs reported this issue.

Following is quoted from tgravescs' report.

YarnRMClientImpl.registerApplicationMaster can throw null pointer exception when setting the trackingurl if its empty:

    appMasterRequest.setTrackingUrl(new URI(uiAddress).getAuthority())

I hit this just start spark-shell without the tracking url set.

14/09/23 16:18:34 INFO yarn.YarnRMClientImpl: Connecting to ResourceManager at kryptonitered-jt1.red.ygrid.yahoo.com/98.139.154.99:8030
Exception in thread "main" java.lang.NullPointerException
        at org.apache.hadoop.yarn.proto.YarnServiceProtos$RegisterApplicationMasterRequestProto$Builder.setTrackingUrl(YarnServiceProtos.java:710)
        at org.apache.hadoop.yarn.api.protocolrecords.impl.pb.RegisterApplicationMasterRequestPBImpl.setTrackingUrl(RegisterApplicationMasterRequestPBImpl.java:132)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.registerApplicationMaster(YarnRMClientImpl.scala:102)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:55)
        at org.apache.spark.deploy.yarn.YarnRMClientImpl.register(YarnRMClientImpl.scala:38)
        at org.apache.spark.deploy.yarn.ApplicationMaster.registerAM(ApplicationMaster.scala:168)
        at org.apache.spark.deploy.yarn.ApplicationMaster.runExecutorLauncher(ApplicationMaster.scala:206)
        at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:120)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2981 from sarutak/SPARK-3657-2 and squashes the following commits:

e2fd6bc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3657
70b8882 [Kousuke Saruta] Fixed NPE thrown
2014-10-28 12:37:09 -07:00
WangTaoTheTonic 1ea3e3dc9d [SPARK-4096][YARN]let ApplicationMaster accept executor memory argument in same format as JVM memory strings
Here `ApplicationMaster` accept executor memory argument only in number format, we should let it accept JVM style memory strings as well.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2955 from WangTaoTheTonic/modifyDesc and squashes the following commits:

ab98c70 [WangTaoTheTonic] append parameter passed in
3779767 [WangTaoTheTonic] Update executor memory description in the help message
2014-10-28 12:31:42 -07:00
WangTaoTheTonic 0ac52e3055 [SPARK-4098][YARN]use appUIAddress instead of appUIHostPort in yarn-client mode
https://issues.apache.org/jira/browse/SPARK-4098

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2958 from WangTaoTheTonic/useAddress and squashes the following commits:

29236e6 [WangTaoTheTonic] use appUIAddress instead of appUIHostPort in yarn-cluster mode
2014-10-28 09:51:44 -05:00
WangTaoTheTonic e8813be653 [SPARK-4095][YARN][Minor]extract val isLaunchingDriver in ClientBase
Instead of checking if `args.userClass` is null repeatedly, we extract it to an global val as in `ApplicationMaster`.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2954 from WangTaoTheTonic/MemUnit and squashes the following commits:

13bda20 [WangTaoTheTonic] extract val isLaunchingDriver in ClientBase
2014-10-28 08:53:10 -05:00
WangTaoTheTonic 47346cd029 [SPARK-4116][YARN]Delete the abandoned log4j-spark-container.properties
Since its name reduced at https://github.com/apache/spark/pull/560, the log4j-spark-container.properties was never used again.
And I have searched its name globally in code and found no cite.

Author: WangTaoTheTonic <barneystinson@aliyun.com>

Closes #2977 from WangTaoTheTonic/delLog4j and squashes the following commits:

fb2729f [WangTaoTheTonic] delete the log4j file obsoleted
2014-10-28 08:46:31 -05:00
Prashant Sharma c9e05ca27c [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2878 from ScrapCodes/SPARK-4032/deprecate-yarn-alpha and squashes the following commits:

17e9857 [Prashant Sharma] added deperecated comment to Client and ExecutorRunnable.
3a34b1e [Prashant Sharma] Updated docs...
4608dea [Prashant Sharma] [SPARK-4032] Deprecate YARN alpha support in Spark 1.2
2014-10-27 10:02:48 -07:00
Kousuke Saruta d2987e8f7a [SPARK-3900][YARN] ApplicationMaster's shutdown hook fails and IllegalStateException is thrown.
ApplicationMaster registers a shutdown hook and it calls ApplicationMaster#cleanupStagingDir.

cleanupStagingDir invokes FileSystem.get(yarnConf) and it invokes FileSystem.getInternal. FileSystem.getInternal also registers shutdown hook.
In FileSystem of hadoop 0.23, the shutdown hook registration does not consider whether shutdown is in progress or not (In 2.2, it's considered).

    // 0.23
    if (map.isEmpty() ) {
      ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY);
    }

    // 2.2
    if (map.isEmpty()
                && !ShutdownHookManager.get().isShutdownInProgress()) {
       ShutdownHookManager.get().addShutdownHook(clientFinalizer, SHUTDOWN_HOOK_PRIORITY);
    }

Thus, in 0.23, another shutdown hook can be registered when ApplicationMaster's shutdown hook run.

This issue cause IllegalStateException as follows.

    java.lang.IllegalStateException: Shutdown in progress, cannot add a shutdownHook
            at org.apache.hadoop.util.ShutdownHookManager.addShutdownHook(ShutdownHookManager.java:152)
            at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2306)
            at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2278)
            at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:316)
            at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:162)
            at org.apache.spark.deploy.yarn.ApplicationMaster.org$apache$spark$deploy$yarn$ApplicationMaster$$cleanupStagingDir(ApplicationMaster.scala:307)
            at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$3.run(ApplicationMaster.scala:118)
            at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2924 from sarutak/SPARK-3900-2 and squashes the following commits:

9112817 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900-2
97018fa [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
2c2850e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
ee52db2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
a7d6c9b [Kousuke Saruta] Merge branch 'SPARK-3900' of github.com:sarutak/spark into SPARK-3900
1cdf03c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
a5f6443 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3900
57b397d [Kousuke Saruta] Fixed IllegalStateException caused by shutdown hook registration in another shutdown hook
2014-10-24 08:51:08 -05:00
zsxwing 137d942353 [SPARK-3877][YARN] Throw an exception when application is not successful so that the exit code wil be set to 1
When an yarn application fails (yarn-cluster mode), the exit code of spark-submit is still 0. It's hard for people to write some automatic scripts to run spark jobs in yarn because the failure can not be detected in these scripts.

This PR added a status checking after `monitorApplication`. If an application is not successful, `run()` will throw an `SparkException`, so that Client.scala will exit with code 1. Therefore, people can use the exit code of `spark-submit` to write some automatic scripts.

Author: zsxwing <zsxwing@gmail.com>

Closes #2732 from zsxwing/SPARK-3877 and squashes the following commits:

1f89fa5 [zsxwing] Fix the unit test
a0498e1 [zsxwing] Update the docs and the error message
e1cb9ef [zsxwing] Fix the hacky way of calling Client
ff16fec [zsxwing] Remove System.exit in Client.scala and add a test
6a2c103 [zsxwing] [SPARK-3877] Throw an exception when application is not successful so that the exit code wil be set to 1
2014-10-22 15:04:41 -07:00
Marcelo Vanzin 803e7f0877 [SPARK-3979] [yarn] Use fs's default replication.
This avoids issues when HDFS is configured in a way that would not
allow the hardcoded default replication of "3".

Note: getDefaultReplication(Path) was added in 0.23.3, and the oldest
one available on Maven Central is 0.23.7, so I chose to not add code
to access that method via reflection.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2831 from vanzin/SPARK-3979 and squashes the following commits:

b0e3a97 [Marcelo Vanzin] [SPARK-3979] [yarn] Use fs's default replication.
2014-10-17 13:45:10 -05:00
Andrew Or c86c976037 [HOTFIX] Fix compilation error for Yarn 2.0.*-alpha
This was reported in https://issues.apache.org/jira/browse/SPARK-3445. There are API differences between the 0.23.* and the 2.0.*-alpha branches that are not accounted for when this code was introduced.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2776 from andrewor14/fix-yarn-alpha and squashes the following commits:

ec94752 [Andrew Or] Fix compilation error for 2.0.*-alpha
2014-10-12 15:41:27 -07:00
Sean Owen 363baacade SPARK-3811 [CORE] More robust / standard Utils.deleteRecursively, Utils.createTempDir
I noticed a few issues with how temp directories are created and deleted:

*Minor*

* Guava's `Files.createTempDir()` plus `File.deleteOnExit()` is used in many tests to make a temp dir, but `Utils.createTempDir()` seems to be the standard Spark mechanism
* Call to `File.deleteOnExit()` could be pushed into `Utils.createTempDir()` as well, along with this replacement
* _I messed up the message in an exception in `Utils` in SPARK-3794; fixed here_

*Bit Less Minor*

* `Utils.deleteRecursively()` fails immediately if any `IOException` occurs, instead of trying to delete any remaining files and subdirectories. I've observed this leave temp dirs around. I suggest changing it to continue in the face of an exception and throw one of the possibly several exceptions that occur at the end.
* `Utils.createTempDir()` will add a JVM shutdown hook every time the method is called. Even if the subdir is the parent of another parent dir, since this check is inside the hook. However `Utils` manages a set of all dirs to delete on shutdown already, called `shutdownDeletePaths`. A single hook can be registered to delete all of these on exit. This is how Tachyon temp paths are cleaned up in `TachyonBlockManager`.

I noticed a few other things that might be changed but wanted to ask first:

* Shouldn't the set of dirs to delete be `File`, not just `String` paths?
* `Utils` manages the set of `TachyonFile` that have been registered for deletion, but the shutdown hook is managed in `TachyonBlockManager`. Should this logic not live together, and not in `Utils`? it's more specific to Tachyon, and looks a slight bit odd to import in such a generic place.

Author: Sean Owen <sowen@cloudera.com>

Closes #2670 from srowen/SPARK-3811 and squashes the following commits:

071ae60 [Sean Owen] Update per @vanzin's review
da0146d [Sean Owen] Make Utils.deleteRecursively try to delete all paths even when an exception occurs; use one shutdown hook instead of one per method call to delete temp dirs
3a0faa4 [Sean Owen] Standardize on Utils.createTempDir instead of Files.createTempDir
2014-10-09 18:21:59 -07:00
Kousuke Saruta f18dd5962e [SPARK-3848] yarn alpha doesn't build on master
yarn alpha build was broken by #2432
as it added an argument to YarnAllocator but not to yarn/alpha YarnAllocationHandler
commit 79e45c9323

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2715 from sarutak/SPARK-3848 and squashes the following commits:

bafb8d1 [Kousuke Saruta] Fixed parameters for the default constructor of alpha/YarnAllocatorHandler.
2014-10-08 11:53:43 -05:00
Marcelo Vanzin 7fca8f41c8 [SPARK-3788] [yarn] Fix compareFs to do the right thing for HDFS namespaces.
HA and viewfs use namespaces instead of host names, so you can't
resolve them since that will fail. So be smarter to avoid doing
unnecessary work.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2649 from vanzin/SPARK-3788 and squashes the following commits:

fedbc73 [Marcelo Vanzin] Update comment.
c938845 [Marcelo Vanzin] Use Objects.equal() to avoid issues with ==.
9f7b571 [Marcelo Vanzin] [SPARK-3788] [yarn] Fix compareFs to do the right thing for HA, federation.
2014-10-08 08:48:55 -05:00
Marcelo Vanzin 35afdfd624 [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.
It seems some dependencies are not declared when pulling the 2.2
test dependencies, so we need to add them manually for the Yarn
cluster to come up.

These don't seem to be necessary for 2.3 and beyond, so restrict
them to the hadoop-2.2 profile.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2682 from vanzin/SPARK-3710 and squashes the following commits:

701d4fb [Marcelo Vanzin] Add comment.
0540bdf [Marcelo Vanzin] [SPARK-3710] Fix Yarn integration tests on Hadoop 2.2.
2014-10-07 23:26:24 -07:00
Thomas Graves 70e824f750 [SPARK-3627] - [yarn] - fix exit code and final status reporting to RM
See the description and whats handled in the jira comment: https://issues.apache.org/jira/browse/SPARK-3627?focusedCommentId=14150013&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14150013

This does not handle yarn client mode reporting of the driver to the AM.   I think that should be handled when we make it an unmanaged AM.

Author: Thomas Graves <tgraves@apache.org>

Closes #2577 from tgravescs/SPARK-3627 and squashes the following commits:

9c2efbf [Thomas Graves] review comments
e8cc261 [Thomas Graves] fix accidental typo during fixing comment
24c98e3 [Thomas Graves] rework
85f1901 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into SPARK-3627
fab166d [Thomas Graves] update based on review comments
32f4dfa [Thomas Graves] switch back
f0b6519 [Thomas Graves] change order of cleanup staging dir
d3cc800 [Thomas Graves] SPARK-3627 - yarn - fix exit code and final status reporting to RM
2014-10-07 09:51:37 -05:00
Kousuke Saruta 79e45c9323 [SPARK-3377] [SPARK-3610] Metrics can be accidentally aggregated / History server log name should not be based on user input
This PR is another solution for #2250

I'm using codahale base MetricsSystem of Spark with JMX or Graphite, and I saw following 2 problems.

(1) When applications which have same spark.app.name run on cluster at the same time, some metrics names are mixed. For instance, if 2+ application is running on the cluster at the same time, each application emits the same named metric like "SparkPi.DAGScheduler.stage.failedStages" and Graphite cannot distinguish the metrics is for which application.

(2) When 2+ executors run on the same machine, JVM metrics of each executors are mixed. For instance, 2+ executors running on the same node can emit the same named metric "jvm.memory" and Graphite cannot distinguish the metrics is from which application.

And there is an similar issue. The directory for event logs is named using application name.
Application name is defined by user and the name can includes illegal character for path names.
Further more, the directory name consists of application name and System.currentTimeMillis even though each application has unique Application ID so if we run jobs which have same name, it's difficult to identify which directory is for which application.

Closes #2250
Closes #1067

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2432 from sarutak/metrics-structure-improvement2 and squashes the following commits:

3288b2b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
39169e4 [Kousuke Saruta] Fixed style
6570494 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
817e4f0 [Kousuke Saruta] Simplified MetricsSystem#buildRegistryName
67fa5eb [Kousuke Saruta] Unified MetricsSystem#registerSources and registerSinks in start
10be654 [Kousuke Saruta] Fixed style.
990c078 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
f0c7fba [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
59cc2cd [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite
f9b6fb3 [Kousuke Saruta] Modified style.
2cf8a0f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
389090d [Kousuke Saruta] Replaced taskScheduler.applicationId() with getApplicationId in SparkContext#postApplicationStart
ff45c89 [Kousuke Saruta] Added some test cases to MetricsSystemSuite
69c46a6 [Kousuke Saruta] Added warning logging logic to MetricsSystem#buildRegistryName
5cca0d2 [Kousuke Saruta] Added Javadoc comment to SparkContext#getApplicationId
16a9f01 [Kousuke Saruta] Added data types to be returned to some methods
6434b06 [Kousuke Saruta] Reverted changes related to ApplicationId
0413b90 [Kousuke Saruta] Deleted ApplicationId.java and ApplicationIdSuite.java
a42300c [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
0fc1b09 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
42bea55 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
248935d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
f6af132 [Kousuke Saruta] Modified SchedulerBackend and TaskScheduler to return System.currentTimeMillis as an unique Application Id
1b8b53e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
97cb85c [Kousuke Saruta] Modified confliction of MimExcludes
2cdd009 [Kousuke Saruta] Modified defailt implementation of applicationId
9aadb0b [Kousuke Saruta] Modified NetworkReceiverSuite to ensure "executor.start()" is finished in test "network receiver life cycle"
3011efc [Kousuke Saruta] Added ApplicationIdSuite.scala
d009c55 [Kousuke Saruta] Modified ApplicationId#equals to compare appIds
dfc83fd [Kousuke Saruta] Modified ApplicationId to implement Serializable
9ff4851 [Kousuke Saruta] Modified MimaExcludes.scala to ignore createTaskScheduler method in SparkContext
4567ffc [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
6a91b14 [Kousuke Saruta] Modified SparkContextSchedulerCreationSuite, ExecutorRunnerTest and EventLoggingListenerSuite
0325caf [Kousuke Saruta] Added ApplicationId.scala
0a2fc14 [Kousuke Saruta] Modified style
eabda80 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
0f890e6 [Kousuke Saruta] Modified SparkDeploySchedulerBackend and Master to pass baseLogDir instead f eventLogDir
bcf25bf [Kousuke Saruta] Modified directory name for EventLogs
28d4d93 [Kousuke Saruta] Modified SparkContext and EventLoggingListener so that the directory for EventLogs is named same for Application ID
203634e [Kousuke Saruta] Modified comment in SchedulerBackend#applicationId and TaskScheduler#applicationId
424fea4 [Kousuke Saruta] Modified  the subclasses of TaskScheduler and SchedulerBackend so that they can return non-optional Unique Application ID
b311806 [Kousuke Saruta] Swapped last 2 arguments passed to CoarseGrainedExecutorBackend
8a2b6ec [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
086ee25 [Kousuke Saruta] Merge branch 'metrics-structure-improvement2' of github.com:sarutak/spark into metrics-structure-improvement2
e705386 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
36d2f7a [Kousuke Saruta] Added warning message for the situation we cannot get application id for the prefix for the name of metrics
eea6e19 [Kousuke Saruta] Modified CoarseGrainedMesosSchedulerBackend and MesosSchedulerBackend so that we can get Application ID
c229fbe [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
e719c39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
4a93c7f [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement2
4776f9e [Kousuke Saruta] Modified MetricsSystemSuite.scala
efcb6e1 [Kousuke Saruta] Modified to add application id to metrics name
2ec848a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
3ea7896 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
ead8966 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
08e627e [Kousuke Saruta] Revert "tmp"
7b67f5a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
45bd33d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
93e263a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
848819c [Kousuke Saruta] Merge branch 'metrics-structure-improvement' of github.com:sarutak/spark into metrics-structure-improvement
912a637 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
e4a4593 [Kousuke Saruta] tmp
3e098d8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
4603a39 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
fa7175b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
15f88a3 [Kousuke Saruta] Modified MetricsSystem#buildRegistryName because conf.get does not return null when correspondin entry is absent
6f7dcd4 [Kousuke Saruta] Modified constructor of DAGSchedulerSource and BlockManagerSource because the instance of SparkContext is no longer used
6fc5560 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource
4e057c9 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into metrics-structure-improvement
85ffc02 [Kousuke Saruta] Revert "Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource"
868e326 [Kousuke Saruta] Modified MetricsSystem to set registry name with unique application-id and driver/executor-id
71609f5 [Kousuke Saruta] Modified sourceName of ExecutorSource, DAGSchedulerSource and BlockManagerSource
55debab [Kousuke Saruta] Modified SparkContext and Executor to set spark.executor.id to identifiers
4180993 [Kousuke Saruta] Modified SparkContext to retain spark.unique.app.name property in SparkConf
2014-10-03 13:48:56 -07:00
Marcelo Vanzin 30abef1547 [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.
The existing code only considered one of the RMs when running in
Yarn HA mode, so it was possible to get errors if the active RM
was not registered in the filter.

The change makes use of a new API added to Yarn that returns all
proxy addresses, and falls back to the old behavior if the API
is not present. While there, I also made a change to look for the
scheme (http or https) being used by Yarn when building the proxy
URIs.

Since, in the case of multiple RMs, Yarn uses commas as a separator,
it was not possible anymore to use spark.filter.params to propagate
this information (which used commas to delimit different config params).
Instead, I added a new param (spark.filter.jsonParams) which expects
a JSON string containing a map with the config data. I chose not to
add it to the documentation at this point since I don't believe users
will use it directly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2469 from vanzin/SPARK-3606 and squashes the following commits:

aeb458a [Marcelo Vanzin] Undelete needed import.
65e400d [Marcelo Vanzin] Remove unused import.
d121883 [Marcelo Vanzin] Use separate config for each param instead of json.
04bc156 [Marcelo Vanzin] Review feedback.
4d4d6b9 [Marcelo Vanzin] [SPARK-3606] [yarn] Correctly configure AmIpFilter for Yarn HA.
2014-10-03 13:18:35 -07:00
Marcelo Vanzin fbe8e9856b [SPARK-2778] [yarn] Add workaround for race in MiniYARNCluster.
Sometimes the cluster's start() method returns before the configuration
having been updated, which is done by ClientRMService in, I assume, a
separate thread (otherwise there would be no race). That can cause tests
to fail if the old configuration data is read, since it will contain
the wrong RM address.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2605 from vanzin/SPARK-2778 and squashes the following commits:

8d02ce0 [Marcelo Vanzin] Minor cleanup.
5bebee7 [Marcelo Vanzin] [SPARK-2778] [yarn] Add workaround for race in MiniYARNCluster.
2014-10-03 11:36:24 -07:00
Nishkam Ravi b4fb7b80a0 Modify default YARN memory_overhead-- from an additive constant to a multiplier
Redone against the recent master branch (https://github.com/apache/spark/pull/1391)

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes #2485 from nishkamravi2/master_nravi and squashes the following commits:

636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2014-10-02 13:48:35 -05:00
Reynold Xin 3888ee2f38 [SPARK-3748] Log thread name in unit test logs
Thread names are useful for correlating failures.

Author: Reynold Xin <rxin@apache.org>

Closes #2600 from rxin/log4j and squashes the following commits:

83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
2014-10-01 01:03:49 -07:00
Patrick Wendell 157e7d0f62 HOTFIX: Ignore flaky tests in YARN 2014-09-30 09:46:58 -07:00
Andrew Or 8da10bf146 [SPARK-3476] Remove outdated memory checks in Yarn
See description in [JIRA](https://issues.apache.org/jira/browse/SPARK-3476).

Author: Andrew Or <andrewor14@gmail.com>

Closes #2528 from andrewor14/yarn-memory-checks and squashes the following commits:

c5400cd [Andrew Or] Simplify checks
e30ffac [Andrew Or] Remove outdated memory checks
2014-09-26 11:50:48 -07:00
Marcelo Vanzin b8487713d3 [SPARK-2778] [yarn] Add yarn integration tests.
This patch adds a couple of, currently, very simple integration tests
to make sure both client and cluster modes are working. The tests don't
do much yet other than run a simple job, but the plan is to enhance
them after we get the framework in.

The cluster tests are noisy, so redirect all log output to a file
like other tests do. Copying the conf around sucks but it's less
work than messing with maven/sbt and having to clean up other
projects.

Note the test is only added for yarn-stable. The code compiles
against yarn-alpha but there are two issues I ran into that I
could not overcome:
- an old netty dependency kept creeping into the classpath and
  causing akka to not work, when using sbt; the old netty was
  correctly suppressed under maven.
- MiniYARNCluster kept failing to execute containers because it
  did not create the NM's local dir itself; this is apparently
  a known behavior, but I'm not sure how to work around it.

None of those issues are present with the stable Yarn.

Also, these tests are a little slow to run. Apparently Spark doesn't
yet tag tests (so that these could be isolated in a "slow" batch),
so this is something to keep in mind.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2257 from vanzin/yarn-tests and squashes the following commits:

6d5b84e [Marcelo Vanzin] Fix wrong system property being set.
8b0933d [Marcelo Vanzin] Merge branch 'master' into yarn-tests
5c2b56f [Marcelo Vanzin] Use custom log4j conf for Yarn containers.
ec73f17 [Marcelo Vanzin] More review feedback.
67f5b02 [Marcelo Vanzin] Review feedback.
f01517c [Marcelo Vanzin] Review feedback.
68fbbbf [Marcelo Vanzin] Use older constructor available in older Hadoop releases.
d07ef9a [Marcelo Vanzin] Merge branch 'master' into yarn-tests
add8416 [Marcelo Vanzin] [SPARK-2778] [yarn] Add yarn integration tests.
2014-09-24 23:10:26 -07:00
Kousuke Saruta 11c10df825 [SPARK-3304] [YARN] ApplicationMaster's Finish status is wrong when uncaught exception is thrown from ReporterThread
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2198 from sarutak/SPARK-3304 and squashes the following commits:

2696237 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
5b80363 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
4eb0a3e [Kousuke Saruta] Remoed the description about spark.yarn.scheduler.reporterThread.maxFailure
9741597 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
f7538d4 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
358ef8d [Kousuke Saruta] Merge branch 'SPARK-3304' of github.com:sarutak/spark into SPARK-3304
0d138c6 [Kousuke Saruta] Revert "tmp"
f8da10a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
b6e9879 [Kousuke Saruta] tmp
8d256ed [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
13b2652 [Kousuke Saruta] Merge branch 'SPARK-3304' of github.com:sarutak/spark into SPARK-3304
2711e15 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
c081f8e [Kousuke Saruta] Modified ApplicationMaster to handle exception in ReporterThread itself
0bbd3a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3304
a6982ad [Kousuke Saruta] Added ability handling uncaught exception thrown from Reporter thread
2014-09-23 11:40:14 -05:00
Andrew Or c4022dd52b [SPARK-3477] Clean up code in Yarn Client / ClientBase
This is part of a broader effort to clean up the Yarn integration code after #2020.

The high-level changes in this PR include:
- Removing duplicate code, especially across the alpha and stable APIs
- Simplify unnecessarily complex method signatures and hierarchies
- Rename unclear variable and method names
- Organize logging output produced when the user runs Spark on Yarn
- Extensively add documentation
- Privatize classes where possible

I have tested the stable API on a Hadoop 2.4 cluster. I tested submitting a jar that references classes in other jars in both client and cluster mode. I also made changes in the alpha API, though I do not have access to an alpha cluster. I have verified that it compiles, but it would be ideal if others can help test it.

For those interested in some examples in detail, please read on.

--------------------------------------------------------------------------------------------------------

***Appendix***

- The loop to `getApplicationReport` from the RM is duplicated in 4 places: in the stable `Client`, alpha `Client`, and twice in `YarnClientSchedulerBackend`. We should not have different loops for client and cluster deploy modes.
- There are many fragmented small helper methods that are only used once and should just be inlined. For instance, `ClientBase#getLocalPath` returns `null` on certain conditions, and its only caller `ClientBase#addFileToClasspath` checks whether the value returned is `null`. We could just have the caller check on that same condition to avoid passing `null`s around.
- In `YarnSparkHadoopUtil#addToEnvironment`, we take in an argument `classpathSeparator` that always has the same value upstream (i.e. `File.pathSeparator`). This argument is now removed from the signature and all callers of this method upstream.
- `ClientBase#copyRemoteFile` is now renamed to `copyFileToRemote`. It was unclear whether we are copying a remote file to our local file system, or copying a locally visible file to a remote file system. Also, even the content of the method has inaccurately named variables. We use `val remoteFs` to signify the file system of the locally visible file and `val fs` to signify the remote, destination file system. These are now renamed `srcFs` and `destFs` respectively.
- We currently log the AM container's environment and resource mappings directly as Scala collections. This is incredibly hard to read and probably too verbose for the average Spark user. In other modes (e.g. standalone), we also don't log the launch commands by default, so the logging level of these information is now set to `DEBUG`.
- None of these classes (`Client`, `ClientBase`, `YarnSparkHadoopUtil` etc.) is intended to be used by a Spark application (the user should go through Spark submit instead). At the very least they should be `private[spark]`.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2350 from andrewor14/yarn-cleanup and squashes the following commits:

39e8c7b [Andrew Or] Address review comments
6619f9b [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup
2ca6d64 [Andrew Or] Improve logging in application monitor
a3b9693 [Andrew Or] Minor changes
7dd6298 [Andrew Or] Simplify ClientBase#monitorApplication
547487c [Andrew Or] Provide default values for null application report entries
a0ad1e9 [Andrew Or] Fix class not found error
1590141 [Andrew Or] Address review comments
45ccdea [Andrew Or] Remove usages of getAMMemory
d8e33b6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup
ed0b42d [Andrew Or] Fix alpha compilation error
c0587b4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-cleanup
6d74888 [Andrew Or] Minor comment changes
6573c1d [Andrew Or] Clean up, simplify and document code for setting classpaths
e4779b6 [Andrew Or] Clean up log messages + variable naming in ClientBase
8766d37 [Andrew Or] Heavily add documentation to Client* classes + various clean-ups
6c94d79 [Andrew Or] Various cleanups in ClientBase and ClientArguments
ef7069a [Andrew Or] Clean up YarnClientSchedulerBackend more
6de9072 [Andrew Or] Guard against potential NPE in debug logging mode
fabe4c4 [Andrew Or] Reuse more code in YarnClientSchedulerBackend
3f941dc [Andrew Or] First cut at simplifying the Client (stable and alpha)
2014-09-23 11:20:52 -05:00
peng.zhang 14f8c34040 [YARN] SPARK-2668: Add variable of yarn log directory for reference from the log4j configuration
Assign value of yarn container log directory to java opts "spark.yarn.app.container.log.dir", So user defined log4j.properties can reference this value and write log to YARN container's log directory.
Otherwise, user defined file appender will only write to container's CWD, and log files in CWD will not be displayed on YARN UI,and either cannot be aggregated to HDFS log directory after job finished.

User defined log4j.properties reference example:
log4j.appender.rolling_file.File = ${spark.yarn.app.container.log.dir}/spark.log

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #1573 from renozhang/yarn-log-dir and squashes the following commits:

16c5cb8 [peng.zhang] Update doc
f2b5e2a [peng.zhang] Change variable's name, and update running-on-yarn.md
503ea2d [peng.zhang] Support log4j log to yarn container dir
2014-09-23 08:45:56 -05:00
chesterxgchen 7d1a37239c SPARK-3177 (on Master Branch)
The JIRA and PR was original created for branch-1.1, and move to master branch now.
Chester

The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case,  the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method.  Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).

 To fix the test, I add a new method

  def getFieldValue2[A: ClassTag, A1: ClassTag, B](clazz: Class[_], field: String,
                                                      defaults: => B)
                              (mapTo:  A => B)(mapTo1: A1 => B) : B =
    Try(clazz.getField(field)).map(_.get(null)).map {
      case v: A => mapTo(v)
      case v1: A1 => mapTo1(v1)
      case _ => defaults
    }.toOption.getOrElse(defaults)

to handle the cases where the field type can be either type A or A1. In this new method the type A or A1 is pattern matched and corresponding mapTo function (mapTo or mapTo1) is used.

Author: chesterxgchen <chester@alpinenow.com>

Closes #2204 from chesterxgchen/SPARK-3177-master and squashes the following commits:

e72a6ea [chesterxgchen]  The Issue is due to that yarn-alpha and yarn have different APIs for certain class fields. In this particular case,  the ClientBase using reflection to to address this issue, and we need to different way to test the ClientBase's method.  Original ClientBaseSuite using getFieldValue() method to do this. But it doesn't work for yarn-alpha as the API returns an array of String instead of just String (which is the case for Yarn-stable API).
2014-09-17 10:25:52 -05:00
Kousuke Saruta cc14644460 [SPARK-3410] The priority of shutdownhook for ApplicationMaster should not be integer literal
I think, it need to keep the priority of shutdown hook for ApplicationMaster than the priority of shutdown hook for o.a.h.FileSystem depending on changing the priority for FileSystem.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2283 from sarutak/SPARK-3410 and squashes the following commits:

1d44fef [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3410
bd6cc53 [Kousuke Saruta] Modified style
ee6f1aa [Kousuke Saruta] Added constant "SHUTDOWN_HOOK_PRIORITY" to ApplicationMaster
54eb68f [Kousuke Saruta] Changed Shutdown hook priority to 20
2f0aee3 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3410
4c5cb93 [Kousuke Saruta] Modified the priority for AM's shutdown hook
217d1a4 [Kousuke Saruta] Removed unused import statements
717aba2 [Kousuke Saruta] Modified ApplicationMaster to make to keep the priority of shutdown hook for ApplicationMaster higher than the priority of shutdown hook for HDFS
2014-09-15 08:53:58 -05:00
Prashant Sharma f493f7982b [SPARK-3452] Maven build should skip publishing artifacts people shouldn...
...'t depend on

Publish local in maven term is `install`

and publish otherwise is `deploy`

So disabled both for following projects.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2329 from ScrapCodes/SPARK-3452/maven-skip-install and squashes the following commits:

257b79a [Prashant Sharma] [SPARK-3452] Maven build should skip publishing artifacts people shouldn't depend on
2014-09-14 21:17:29 -07:00
Thomas Graves 25311c2c54 [SPARK-3456] YarnAllocator on alpha can lose container requests to RM
Author: Thomas Graves <tgraves@apache.org>

Closes #2373 from tgravescs/SPARK-3456 and squashes the following commits:

77e9532 [Thomas Graves] [SPARK-3456] YarnAllocator on alpha can lose container requests to RM
2014-09-12 20:31:11 -05:00
Sandy Ryza 1d767967e9 SPARK-3014. Log a more informative messages in a couple failure scenario...
...s

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1934 from sryza/sandy-spark-3014 and squashes the following commits:

ae19cc1 [Sandy Ryza] SPARK-3014. Log a more informative messages in a couple failure scenarios
2014-09-12 16:48:28 -05:00
Andrew Or 6324eb7b5b [Spark-3490] Disable SparkUI for tests
We currently open many ephemeral ports during the tests, and as a result we occasionally can't bind to new ones. This has caused the `DriverSuite` and the `SparkSubmitSuite` to fail intermittently.

By disabling the `SparkUI` when it's not needed, we already cut down on the number of ports opened significantly, on the order of the number of `SparkContexts` ever created. We must keep it enabled for a few tests for the UI itself, however.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2363 from andrewor14/disable-ui-for-tests and squashes the following commits:

332a7d5 [Andrew Or] No need to set spark.ui.port to 0 anymore
30c93a2 [Andrew Or] Simplify streaming UISuite
a431b84 [Andrew Or] Fix streaming test failures
8f5ae53 [Andrew Or] Fix no new line at the end
29c9b5b [Andrew Or] Disable SparkUI for tests
2014-09-11 17:18:46 -07:00
Chris Cope ed1980ffa9 [SPARK-2140] Updating heap memory calculation for YARN stable and alpha.
Updated pull request, reflecting YARN stable and alpha states. I am getting intermittent test failures on my own test infrastructure. Is that tracked anywhere yet?

Author: Chris Cope <ccope@resilientscience.com>

Closes #2253 from copester/master and squashes the following commits:

5ad89da [Chris Cope] [SPARK-2140] Removing calculateAMMemory functions since they are no longer needed.
52b4e45 [Chris Cope] [SPARK-2140] Updating heap memory calculation for YARN stable and alpha.
2014-09-11 08:13:07 -05:00
Sandy Ryza 1f4a648d4e SPARK-1713. Use a thread pool for launching executors.
This patch copies the approach used in the MapReduce application master for launching containers.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #663 from sryza/sandy-spark-1713 and squashes the following commits:

036550d [Sandy Ryza] SPARK-1713. [YARN] Use a threadpool for launching executor containers
2014-09-10 14:34:24 -05:00
Josh Rosen 26503fdf20 [HOTFIX] Fix scala style issue introduced by #2276. 2014-09-10 12:02:23 -07:00
Benoy Antony 6f7a76838f [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme i...
...s https

Author: Benoy Antony <benoy@apache.org>

Closes #2276 from benoyantony/SPARK-3286 and squashes the following commits:

c3d51ee [Benoy Antony] Use address with scheme, but Allpha version removes the scheme
e82f94e [Benoy Antony] Use address with scheme, but Allpha version removes the scheme
92127c9 [Benoy Antony] rebasing from master
450c536 [Benoy Antony] [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme is https
f060c02 [Benoy Antony] [SPARK-3286] - Cannot view ApplicationMaster UI when Yarn’s url scheme is https
2014-09-10 11:59:39 -05:00
GuoQiang Li 607ae39c22 [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
Author: GuoQiang Li <witgo@qq.com>

Closes #2268 from witgo/SPARK-3397 and squashes the following commits:

eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
2014-09-06 15:04:50 -07:00
Thomas Graves 62c5576099 [SPARK-3375] spark on yarn container allocation issues
If yarn doesn't get the containers immediately it stops asking for them and the yarn application hangs with never getting any executors.

The issue here is that we are sending the number of containers as 0 after we send the original one of X. on the yarn side this clears out the original request.

For a ping we should just send empty asks.

Author: Thomas Graves <tgraves@apache.org>

Closes #2275 from tgravescs/SPARK-3375 and squashes the following commits:

74b6820 [Thomas Graves] send empty resource requests when we aren't asking for containers
2014-09-05 09:56:22 -05:00
Thomas Graves 51b53a758c [SPARK-3260] yarn - pass acls along with executor launch
Pass along the acl settings when we launch a container so that they can be applied to viewing the logs on a running NodeManager.

Author: Thomas Graves <tgraves@apache.org>

Closes #2185 from tgravescs/SPARK-3260 and squashes the following commits:

6f94b5a [Thomas Graves] make unit test more robust
28b9dd3 [Thomas Graves] yarn - pass acls along with executor launch
2014-09-05 09:54:40 -05:00