Commit graph

467 commits

Author SHA1 Message Date
Dongjoon Hyun 761c2d1b6e [MINOR][DOCS] Add proper periods and spaces for CLI help messages and config doc.
## What changes were proposed in this pull request?

This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11848 from dongjoon-hyun/add_proper_period_and_space.
2016-03-21 08:00:09 +00:00
jerryshao 3537782168 [SPARK-13885][YARN] Fix attempt id regression for Spark running on Yarn
## What changes were proposed in this pull request?

This regression is introduced in #9182, previously attempt id is simply as counter "1" or "2". With the change of #9182, it is changed to full name as "appattemtp-xxx-00001", this will affect all the parts which uses this attempt id, like event log file name, history server app url link. So here change it back to the counter to keep consistent with previous code.

Also revert back this patch #11518, this patch fix the url link of history log according to the new way of attempt id, since here we change back to the previous way, so this patch is not necessary, here to revert it.

Also clean "spark.yarn.app.id" and "spark.yarn.app.attemptId", since it is useless now.

## How was this patch tested?

Test it with unit test and manually test different scenario:

1. application running in yarn-client mode.
2. application running in yarn-cluster mode.
3. application running in yarn-cluster mode with multiple attempts.

Checked both the event log file name and url link.

CC vanzin tgravescs , please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #11721 from jerryshao/SPARK-13885.
2016-03-18 12:39:49 -07:00
Wenchen Fan 8ef3399aff [SPARK-13928] Move org.apache.spark.Logging into org.apache.spark.internal.Logging
## What changes were proposed in this pull request?

Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11764 from cloud-fan/logger.
2016-03-17 19:23:38 +08:00
Jeff Zhang eacd9d8eda [SPARK-13360][PYSPARK][YARN] PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON…
… is not picked up in yarn-cluster mode

Author: Jeff Zhang <zjffdu@apache.org>

Closes #11238 from zjffdu/SPARK-13360.
2016-03-16 10:21:01 -07:00
Carson Wang 496d2a2b40 [SPARK-13889][YARN] Fix integer overflow when calculating the max number of executor failure
## What changes were proposed in this pull request?
The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow.

## How was this patch tested?
It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2.

Author: Carson Wang <carson.wang@intel.com>

Closes #11713 from carsonwang/IntOverflow.
2016-03-16 10:56:01 +00:00
jerryshao d89c71417c [SPARK-13642][YARN] Changed the default application exit state to failed for yarn cluster mode
## What changes were proposed in this pull request?

Changing the default exit state to `failed` for any application running on yarn cluster mode.

## How was this patch tested?

Unit test is done locally.

CC tgravescs and vanzin .

Author: jerryshao <sshao@hortonworks.com>

Closes #11693 from jerryshao/SPARK-13642.
2016-03-15 10:20:38 -07:00
Josh Rosen 07cb323e7a [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
2016-03-14 12:22:02 -07:00
Ryan Blue 63f642aea3 [SPARK-13779][YARN] Avoid cancelling non-local container requests.
To maximize locality, the YarnAllocator would cancel any requests with a
stale locality preference or no locality preference. This assumed that
the majority of tasks had locality preferences, but may not be the case
when scanning S3. This caused container requests for S3 tasks to be
constantly cancelled and resubmitted.

This changes the allocator's logic to cancel only stale requests and
just enough requests without locality preferences to submit requests
with locality preferences. This avoids cancelling requests without
locality preferences that would be resubmitted without locality
preferences.

We've deployed this patch on our clusters and verified that jobs that couldn't get executors because they kept canceling and resubmitting requests are fixed. Large jobs are running fine.

Author: Ryan Blue <blue@apache.org>

Closes #11612 from rdblue/SPARK-13779-fix-yarn-allocator-requests.
2016-03-14 11:18:37 -07:00
Dongjoon Hyun acdf219703 [MINOR][DOCS] Fix more typos in comments/strings.
## What changes were proposed in this pull request?

This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11689 from dongjoon-hyun/fix_more_typos.
2016-03-14 09:07:39 +00:00
Sean Owen 1840852841 [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
## What changes were proposed in this pull request?

- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #11657 from srowen/SPARK-13823.
2016-03-13 21:03:49 -07:00
Marcelo Vanzin 07f1c54477 [SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive.
In preparation for the demise of assemblies, this change allows the
YARN backend to use multiple jars and globs as the "Spark jar". The
config option has been renamed to "spark.yarn.jars" to reflect that.

A second option "spark.yarn.archive" was also added; if set, this
takes precedence and uploads an archive expected to contain the jar
files with the Spark code and its dependencies.

Existing deployments should keep working, mostly. This change drops
support for the "SPARK_JAR" environment variable, and also does not
fall back to using "jarOfClass" if no configuration is set, falling
back to finding files under SPARK_HOME instead. This should be fine
since "jarOfClass" probably wouldn't work unless you were using
spark-submit anyway.

Tested with the unit tests, and trying the different config options
on a YARN cluster.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11500 from vanzin/SPARK-13577.
2016-03-11 07:54:57 -06:00
jerryshao ca1a7b9d6a [HOTFIX][YARN] Fix yarn cluster mode fire and forget regression
## What changes were proposed in this pull request?

Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed.

## How was this patch tested?

Manually verified this change.

Author: jerryshao <sshao@hortonworks.com>

Closes #11577 from jerryshao/hot-fix-yarn-cluster.
2016-03-08 09:43:35 -08:00
Marcelo Vanzin e1fb857992 [SPARK-529][CORE][YARN] Add type-safe config keys to SparkConf.
This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.

To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new `config` package object
under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
logic had to be slightly modified in a handful of places.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10205 from vanzin/conf-opts.
2016-03-07 14:13:44 -08:00
Dongjoon Hyun b5f02d6743 [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule
## What changes were proposed in this pull request?

After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.

## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11438 from dongjoon-hyun/SPARK-13583.
2016-03-03 10:12:32 +00:00
Sean Owen e97fc7f176 [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x
## What changes were proposed in this pull request?

Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:

- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map

The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.

## How was the this patch tested?

Existing Jenkins unit tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11292 from srowen/SPARK-13423.
2016-03-03 09:54:09 +00:00
Dongjoon Hyun 9c274ac4a6 [SPARK-13627][SQL][YARN] Fix simple deprecation warnings.
## What changes were proposed in this pull request?

This PR aims to fix the following deprecation warnings.
  * MethodSymbolApi.paramss--> paramLists
  * AnnotationApi.tpe -> tree.tpe
  * BufferLike.readOnly -> toList.
  * StandardNames.nme -> termNames
  * scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
  * TypeApi.declarations-> decls

## How was this patch tested?

Check the compile build log and pass the tests.
```
./build/sbt
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11479 from dongjoon-hyun/SPARK-13627.
2016-03-02 20:34:22 -08:00
Marcelo Vanzin c7fccb56cd [SPARK-13478][YARN] Use real user when fetching delegation tokens.
The Hive client library is not smart enough to notice that the current
user is a proxy user; so when using a proxy user, it fails to fetch
delegation tokens from the metastore because of a missing kerberos
TGT for the current user.

To fix it, just run the code that fetches the delegation token as the
real logged in user.

Tested on a kerberos cluster both submitting normally and with a proxy
user; Hive and HBase tokens are retrieved correctly in both cases.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11358 from vanzin/SPARK-13478.
2016-02-29 13:01:27 -08:00
huangzhaowei 5c3912e5c9 [SPARK-12523][YARN] Support long-running of the Spark On HBase and hive meta store.
Obtain the hive metastore and hbase token as well as hdfs token in DelegationToeknRenewer to supoort long-running application of spark on hbase or thriftserver.

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #10645 from SaintBacchus/SPARK-12523.
2016-02-26 07:32:07 -06:00
hushan 7a6ee8a8fe [SPARK-12009][YARN] Avoid to re-allocating yarn container while driver want to stop all Executors
Author: hushan <hushan@xiaomi.com>

Closes #9992 from suyanNone/tricky.
2016-02-25 16:57:41 -08:00
huangzhaowei 5fcf4c2bfc [SPARK-12316] Wait a minutes to avoid cycle calling.
When application end, AM will clean the staging dir.
But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'.
Then it lead driver StackOverflowError.
https://issues.apache.org/jira/browse/SPARK-12316

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #10475 from SaintBacchus/SPARK-12316.
2016-02-25 09:14:19 -06:00
Terence Yim fae88af184 [SPARK-13441][YARN] Fix NPE in yarn Client.createConfArchive method
## What changes were proposed in this pull request?

Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead

## How was the this patch tested?

Ran the ./dev/run-tests locally
Tested manually on a cluster

Author: Terence Yim <terence@cask.co>

Closes #11337 from chtyim/fixes/SPARK-13441-null-check.
2016-02-25 13:29:30 +00:00
jerryshao e99d017098 [SPARK-13220][CORE] deprecate yarn-client and yarn-cluster mode
Author: jerryshao <sshao@hortonworks.com>

Closes #11229 from jerryshao/SPARK-13220.
2016-02-23 12:30:57 +00:00
Dongjoon Hyun 024482bf51 [MINOR][DOCS] Fix all typos in markdown files of doc and similar patterns in other comments
## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.
2016-02-22 09:52:07 +00:00
Josh Rosen 289373b28c [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).

The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).

After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10608 from JoshRosen/SPARK-6363.
2016-01-30 00:20:28 -08:00
Shixiong Zhu bc1babd63d [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult`  depends on it.
- Update comments and docs

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10854 from zsxwing/remove-akka.
2016-01-22 21:20:04 -08:00
Shixiong Zhu 4f60651cbe [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.
2016-01-12 14:27:05 -08:00
Kousuke Saruta 112abf9100 [SPARK-12692][BUILD][YARN] Scala style: Fix the style violation (Space before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10686 from sarutak/SPARK-12692-followup-yarn.
2016-01-11 21:37:54 -08:00
Marcelo Vanzin b3ba1be3b7 [SPARK-3873][TESTS] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
2016-01-05 19:07:39 -08:00
Shixiong Zhu 4f5a24d7e7 [SPARK-7995][SPARK-6280][CORE] Remove AkkaRpcEnv and remove systemName from setupEndpointRef
### Remove AkkaRpcEnv

Keep `SparkEnv.actorSystem` because Streaming still uses it. Will remove it and AkkaUtils after refactoring Streaming actorStream API.

### Remove systemName
There are 2 places using `systemName`:
* `RpcEnvConfig.name`. Actually, although it's used as `systemName` in `AkkaRpcEnv`, `NettyRpcEnv` uses it as the service name to output the log `Successfully started service *** on port ***`. Since the service name in log is useful, I keep `RpcEnvConfig.name`.
* `def setupEndpointRef(systemName: String, address: RpcAddress, endpointName: String)`. Each `ActorSystem` has a `systemName`. Akka requires `systemName` in its URI and will refuse a connection if `systemName` is not matched. However, `NettyRpcEnv` doesn't use it. So we can remove `systemName` from `setupEndpointRef` since we are removing `AkkaRpcEnv`.

### Remove RpcEnv.uriOf

`uriOf` exists because Akka uses different URI formats for with and without authentication, e.g., `akka.ssl.tcp...` and `akka.tcp://...`. But `NettyRpcEnv` uses the same format. So it's not necessary after removing `AkkaRpcEnv`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10459 from zsxwing/remove-akka-rpc-env.
2015-12-31 00:15:55 -08:00
Marcelo Vanzin fd33333138 [SPARK-3873][YARN] Fix import ordering.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10536 from vanzin/SPARK-3873-yarn.
2015-12-30 18:20:27 -08:00
Kazuaki Ishizaki 3920466118 [SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property
Restore the original value of os.arch property after each test

Since some of tests forced to set the specific value to os.arch property, we need to set the original value.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10289 from kiszk/SPARK-12311.
2015-12-24 13:37:28 +00:00
Reynold Xin f496031bd2 Bump master version to 2.0.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #10387 from rxin/version-bump.
2015-12-19 15:13:05 -08:00
Devaraj K ca0690b5ef [SPARK-4117][YARN] Spark on Yarn handle AM being told command from RM
Spark on Yarn handle AM being told command from RM

When RM throws ApplicationAttemptNotFoundException for allocate
invocation, making the ApplicationMaster to finish immediately without any
retries.

Author: Devaraj K <devaraj@apache.org>

Closes #10129 from devaraj-kavali/SPARK-4117.
2015-12-15 18:30:59 -08:00
Steve Loughran 442a7715a5 [SPARK-12241][YARN] Improve failure reporting in Yarn client obtainTokenForHBase()
This lines up the HBase token logic with that done for Hive in SPARK-11265: reflection with only CFNE being swallowed.

There is a test, one which doesn't try to put HBase on the yarn/test class and really do the reflection (the way the hive introspection does). If people do want that then it could be added with careful POM work

+also: cut an incorrect comment from the Hive test case before copying it, and a couple of imports that may have been related to the hive test in the past.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #10227 from steveloughran/stevel/patches/SPARK-12241-obtainTokenForHBase.
2015-12-09 10:25:38 -08:00
jerryshao 6900f01737 [SPARK-10582][YARN][CORE] Fix AM failure situation for dynamic allocation
Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in `ExecutorAllocationManager` and `CoarseGrainedSchedulerBacked` should be reset.

This issue is originally addressed in #8737 , here re-opened again. Thanks a lot KaiXinXiaoLei for finding this issue.

andrewor14 and vanzin would you please help to review this, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9963 from jerryshao/SPARK-10582.
2015-12-09 09:52:03 -08:00
meiyoula bbfc16ec9d [SPARK-12142][CORE]Reply false when container allocator is not ready and reset target
Using Dynamic Allocation function, when a new AM is starting, and ExecutorAllocationManager send RequestExecutor message to AM. If the container allocator is not ready, the whole app will hang on

Author: meiyoula <1039320815@qq.com>

Closes #10138 from XuTingjun/patch-1.
2015-12-04 16:50:43 -08:00
Shixiong Zhu 649be4fa45 [SPARK-12101][CORE] Fix thread pools that cannot cache tasks in Worker and AppClient
`SynchronousQueue` cannot cache any task. This issue is similar to #9978. It's an easy fix. Just use the fixed `ThreadUtils.newDaemonCachedThreadPool`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10108 from zsxwing/fix-threadpool.
2015-12-03 11:06:25 -08:00
Steve Loughran 8fa3e474a8 [SPARK-11314][YARN] add service API and test service for Yarn Cluster schedulers
This is purely the yarn/src/main and yarn/src/test bits of the YARN ATS integration: the extension model to load and run implementations of `SchedulerExtensionService` in the yarn cluster scheduler process —and to stop them afterwards.

There's duplication between the two schedulers, yarn-client and yarn-cluster, at least in terms of setting everything up, because the common superclass, `YarnSchedulerBackend` is in spark-core, and the extension services need the YARN app/attempt IDs.

If you look at how the the extension services are loaded, the case class `SchedulerExtensionServiceBinding` is used to pass in config info -currently just the spark context and the yarn IDs, of which one, the attemptID, will be null when running client-side. I'm passing in a case class to ensure that it would be possible in future to add extra arguments to the binding class, yet, as the method signature will not have changed, still be able to load existing services.

There's no functional extension service here, just one for testing. The real tests come in the bigger pull requests. At the same time, there's no restriction of this extension service purely to the ATS history publisher. Anything else that wants to listen to the spark context and publish events could use this, and I'd also consider writing one for the YARN-913 registry service, so that the URLs of the web UI would be locatable through that (low priority; would make more sense if integrated with a REST client).

There's no minicluster test. Given the test execution overhead of setting up minicluster tests, it'd  probably be better to add an extension service into one of the existing tests.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #9182 from steveloughran/stevel/feature/SPARK-1537-service.
2015-12-03 10:33:06 -08:00
Cheng Lian 69dbe6b40d [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues
This PR backports PR #10039 to master

Author: Cheng Lian <lian@databricks.com>

Closes #10063 from liancheng/spark-12046.doc-fix.master.
2015-12-01 10:21:31 -08:00
jerryshao 5fd86e4fc2 [SPARK-7173][YARN] Add label expression support for application master
Add label expression support for AM to restrict it runs on the specific set of nodes. I tested it locally and works fine.

sryza and vanzin please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9800 from jerryshao/SPARK-7173.
2015-11-23 10:41:17 -08:00
Marcelo Vanzin 880128f37e [SPARK-4134][CORE] Lower severity of some executor loss logs.
Don't log ERROR messages when executors are explicitly killed or when
the exit reason is not yet known.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9780 from vanzin/SPARK-11789.
2015-11-19 16:49:18 -08:00
Holden Karau 52c734b589 [SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two params have both in error msg
When we exceed the max memory tell users to increase both params instead of just the one.

Author: Holden Karau <holden@us.ibm.com>

Closes #9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.
2015-11-17 15:51:03 -08:00
jerryshao 24477d2705 [SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issue
Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.

The problem is  `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done.

One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple.

Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.

Here this PR chooses solution 2.

Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9684 from jerryshao/SPARK-11718.
2015-11-16 11:43:18 -08:00
tedyu 9009175416 [SPARK-11615] Drop @VisibleForTesting annotation
See http://search-hadoop.com/m/q3RTtjpe8r1iRbTj2 for discussion.

Summary: addition of VisibleForTesting annotation resulted in spark-shell malfunctioning.

Author: tedyu <yuzhihong@gmail.com>

Closes #9585 from tedyu/master.
2015-11-10 16:52:59 -08:00
Thomas Graves f6680cdc5d [SPARK-11555] spark on yarn spark-class --num-workers doesn't work
I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.

--num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #9523 from tgravescs/SPARK-11555.
2015-11-06 15:24:33 -08:00
Marcelo Vanzin 8790ee6d69 [SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors.
In YARN mode, when preemption is enabled, we may leave executors in a
zombie state while we wait to retrieve the reason for which the executor
exited. This is so that we don't account for failed tasks that were
running on a preempted executor.

The issue is that while we wait for this information, the scheduler
might decide to schedule tasks on the executor, which will never be
able to run them. Other side effects include the block manager still
considering the executor available to cache blocks, for example.

So, when we know that an executor went down but we don't know why,
stop everything related to the executor, except its running tasks.
Only when we know the reason for the exit (or give up waiting for
it) we do update the running tasks.

This is achieved by a new `disableExecutor()` method in the
`Schedulable` interface. For managers that do not behave like this
(i.e. every one but YARN), the existing `executorLost()` method
will behave the same way it did before.

On top of that change, a few minor changes that made debugging easier,
and fixed some other minor issues:
- The cluster-mode AM was printing a misleading log message every
  time an executor disconnected from the driver (because the akka
  actor system was shared between driver and AM).
- Avoid sending unnecessary requests for an executor's exit reason
  when we already know it was explicitly disabled / killed. This
  avoids both multiple requests, and unnecessary requests that would
  just cause warning messages on the AM (in the explicit kill case).
- Tone down a log message about the executor being lost when it
  exited normally (e.g. preemption)
- Wake up the AM monitor thread when requests for executor loss
  reasons arrive too, so that we can more quickly remove executors
  from this zombie state.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8887 from vanzin/SPARK-10622.
2015-11-04 09:07:22 -08:00
Marcelo Vanzin 71d1c907de [SPARK-10997][CORE] Add "client mode" to netty rpc env.
"Client mode" means the RPC env will not listen for incoming connections.
This allows certain processes in the Spark stack (such as Executors or
tha YARN client-mode AM) to act as pure clients when using the netty-based
RPC backend, reducing the number of sockets needed by the app and also the
number of open ports.

Client connections are also preferred when endpoints that actually have
a listening socket are involved; so, for example, if a Worker connects
to a Master and the Master needs to send a message to a Worker endpoint,
that client connection will be used, even though the Worker is also
listening for incoming connections.

With this change, the workaround for SPARK-10987 isn't necessary anymore, and
is removed. The AM connects to the driver in "client mode", and that connection
is used for all driver <-> AM communication, and so the AM is properly notified
when the connection goes down.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9210 from vanzin/SPARK-10997.
2015-11-02 10:26:36 -08:00
jerryshao a930e624eb [SPARK-9817][YARN] Improve the locality calculation of containers by taking pending container requests into consideraion
This is a follow-up PR to further improve the locality calculation by considering the pending container's request. Since the locality preferences of tasks may be shifted from time to time, current localities of pending container requests may not fully match the new preferences, this PR improve it by removing outdated, unmatched container requests and replace with new requests.

sryza please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #8100 from jerryshao/SPARK-9817.
2015-11-02 10:23:30 -08:00
Marcelo Vanzin f8d93edec8 [SPARK-11073][CORE][YARN] Remove akka dependency in secret key generation.
Use standard JDK APIs for that (with a little help from Guava). Most of the
changes here are in test code, since there were no tests specific to that
part of the code.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9257 from vanzin/SPARK-11073.
2015-11-01 15:57:42 -08:00
Steve Loughran 40d3c6797a [SPARK-11265][YARN] YarnClient can't get tokens to talk to Hive 1.2.1 in a secure cluster
This is a fix for SPARK-11265; the introspection code to get Hive delegation tokens failing on Spark 1.5.1+, due to changes in the Hive codebase

Author: Steve Loughran <stevel@hortonworks.com>

Closes #9232 from steveloughran/stevel/patches/SPARK-11265-hive-tokens.
2015-10-31 18:23:15 -07:00