## What changes were proposed in this pull request?
<copy form JIRA>
Currently if neither `spark.yarn.jars` nor `spark.yarn.archive` is set (by default), Spark on yarn code will upload all the jars in the folder separately into distributed cache, this is quite time consuming, and very verbose, instead of upload jars separately into distributed cache, here changes to zip all the jars first, and then put into distributed cache.
This will significantly improve the speed of starting time.
## How was this patch tested?
Unit test and local integrated test is done.
Verified with SparkPi both in spark cluster and client mode.
Author: jerryshao <sshao@hortonworks.com>
Closes#12597 from jerryshao/SPARK-14836.
## What changes were proposed in this pull request?
This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.
## How was this patch tested?
Scala-style checks passed.
Author: Pravin Gadakh <prgadakh@in.ibm.com>
Closes#12416 from pravingadakh/SPARK-14613.
This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems.
Please help to review, tgravescs sryza and vanzin
Author: jerryshao <sshao@hortonworks.com>
Closes#10241 from jerryshao/SPARK-6735.
## What changes were proposed in this pull request?
With the addition of ExternalClusterManager(ECM) interface in PR #11723, any cluster manager can now be integrated with Spark. It was suggested in ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function into YarnClusterManager that implements ECM interface.
## How was this patch tested?
Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too.
Author: Hemant Bhanawat <hemant@snappydata.io>
Closes#12641 from hbhanawat/yarnClusterMgr.
## What changes were proposed in this pull request?
Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type
## How was this patch tested?
Unit tests
Author: Azeem Jiva <azeemj@gmail.com>
Closes#12520 from javawithjiva/minor.
## What changes were proposed in this pull request?
SPARK-12130 make 2.0 shuffle service incompatible with 1.x. So from discussion: [http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html](url) we should maintain compatibility between Spark 1.x and Spark 2.x's shuffle service.
I put string comparison into executor's register at first avoid string comparison in getBlockData every time.
## How was this patch tested?
N/A
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#12568 from lianhuiwang/SPARK-14731.
## What changes were proposed in this pull request?
This is a follow-up to #12557, with the following changes:
1. Fixes some of the style issues.
2. Merges Signaling and SignalLogger into a new class called SignalUtils. It was pretty confusing to have Signaling and Signal in one file, and it was also confusing to have two classes named Signaling and one called the other.
3. Made logging registration idempotent.
## How was this patch tested?
N/A.
Author: Reynold Xin <rxin@databricks.com>
Closes#12605 from rxin/SPARK-10001.
## What changes were proposed in this pull request?
Implement some `hashCode` and `equals` together in order to enable the scalastyle.
This is a first batch, I will continue to implement them but I wanted to know your thoughts.
Author: Joan <joan@goyeau.com>
Closes#12157 from joan38/SPARK-6429-HashCode-Equals.
This change avoids using the environment to pass this information, since
with many jars it's easy to hit limits on certain OSes. Instead, it encodes
the information into the Spark configuration propagated to the AM.
The first problem that needed to be solved is a chicken & egg issue: the
config file is distributed using the cache, and it needs to contain information
about the files that are being distributed. To solve that, the code now treats
the config archive especially, and uses slightly different code to distribute
it, so that only its cache path needs to be saved to the config file.
The second problem is that the extra information would show up in the Web UI,
which made the environment tab even more noisy than it already is when lots
of jars are listed. This is solved by two changes: the list of cached files
is now read only once in the AM, and propagated down to the ExecutorRunnable
code (which actually sends the list to the NMs when starting containers). The
second change is to unset those config entries after the list is read, so that
the SparkContext never sees them.
Tested with both client and cluster mode by running "run-example SparkPi". This
uploads a whole lot of files when run from a build dir (instead of a distribution,
where the list is cleaned up), and I verified that the configs do not show
up in the UI.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#12487 from vanzin/SPARK-14602.
## What changes were proposed in this pull request?
In SPARK-13063, It makes the SPARK YARN STAGING DIR as configurable. But it only support default FileSystem. If there are many clusters, It can be different FileSystem for different cluster in our spark.
## How was this patch tested?
I have tested it successfully with following commands:
MASTER=yarn-client ./bin/spark-shell --conf spark.yarn.stagingDir=hdfs:namenode2/temp
$SPARK_HOME/bin/spark-submit --conf spark.yarn.stagingDir=hdfs:namenode2/temp
cc tgravescs vanzin andrewor14
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#12473 from lianhuiwang/SPARK-14705.
## What changes were proposed in this pull request?
In the current implementation of assembly-free spark deployment, jars under `assembly/target/scala-xxx/jars` will be uploaded to distributed cache by default, there's a chance these jars' name will be conflicted with name of jars specified in `--jars`, this will introduce exception when starting application:
```
client token: N/A
diagnostics: Application application_1459907402325_0004 failed 2 times due to AM Container for appattempt_1459907402325_0004_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://hw12100.local:8088/proxy/application_1459907402325_0004/Then, click on links to logs of each attempt.
Diagnostics: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590
java.io.IOException: Resource hdfs://localhost:8020/user/sshao/.sparkStaging/application_1459907402325_0004/avro-mapred-1.7.7-hadoop2.jar changed on src filesystem (expected 1459909780508, was 1459909782590
at org.apache.hadoop.yarn.util.FSDownload.copy(FSDownload.java:253)
at org.apache.hadoop.yarn.util.FSDownload.access$000(FSDownload.java:61)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:359)
at org.apache.hadoop.yarn.util.FSDownload$2.run(FSDownload.java:357)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1628)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:356)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
```
So here by checking the name of file to avoid same name files uploaded again.
## How was this patch tested?
Unit test and manual integrated test is done locally.
Author: jerryshao <sshao@hortonworks.com>
Closes#12203 from jerryshao/SPARK-14423.
## What changes were proposed in this pull request?
According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule.
```
case: Always omit braces in case clauses.
```
This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code.
## How was this patch tested?
Pass the Jenkins tests (including Scala style checking)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12280 from dongjoon-hyun/SPARK-14508.
## What changes were proposed in this pull request?
Currently Spark clients are started with the same memory setting for Xms and Xms leading to reserving unnecessary higher amounts of memory.
This behavior is changed and the clients can now specify an initial heap size using the extraJavaOptions in the config for driver,executor and am individually.
Note, that only -Xms can be provided through this config option, if the client wants to set the max size(-Xmx), this has to be done via the *.memory configuration knobs which are currently supported.
## How was this patch tested?
Monitored executor and yarn logs in debug mode to verify the commands through which they are being launched in client and cluster mode. The driver memory was verified locally using jps -v. Setting up -Xmx parameter in the javaExtraOptions raises exception with the info provided.
Author: Dhruve Ashar <dhruveashar@gmail.com>
Closes#12115 from dhruve/impr/SPARK-12384.
The current package name uses a dash, which is a little weird but seemed
to work. That is, until a new test tried to mock a class that references
one of those shaded types, and then things started failing.
Most changes are just noise to fix the logging configs.
For reference, SPARK-8815 also raised this issue, although at the time it
did not cause any issues in Spark, so it was not addressed.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11941 from vanzin/SPARK-14134.
## What changes were proposed in this pull request?
While I was reviewing #12078, I found most of CoarseGrainedSchedulerBackend's mutable fields doesn't have any comments about the thread-safe assumptions and it's hard for people to figure out which part of codes should be protected by the lock. This PR just added comments/annotations for them and also added strict access modifiers for some fields.
## How was this patch tested?
Existing unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#12188 from zsxwing/comments.
## What changes were proposed in this pull request?
This PR fixes the following line.
```
private[spark] val STAGING_DIR = ConfigBuilder("spark.yarn.stagingDir")
.doc("Staging directory used while submitting applications.")
.stringConf
- .optional
+ .createOptional
```
## How was this patch tested?
Pass the build.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12187 from dongjoon-hyun/hotfix.
Because SQL keeps track of all known configs, some customization was
needed in SQLConf to allow that, since the core API does not have that
feature.
Tested via existing (and slightly updated) unit tests.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11570 from vanzin/SPARK-529-sql.
## What changes were proposed in this pull request?
Made the SPARK YARN STAGING DIR as configurable with the configuration as 'spark.yarn.staging-dir'.
## How was this patch tested?
I have verified it manually by running applications on yarn, If the 'spark.yarn.staging-dir' is configured then the value used as staging directory otherwise uses the default value i.e. file system’s home directory for the user.
Author: Devaraj K <devaraj@apache.org>
Closes#12082 from devaraj-kavali/SPARK-13063.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).
I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).
Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11796 from vanzin/SPARK-13579.
## What changes were proposed in this pull request?
This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes.
(All comment-only changes over 77 files: +786 lines, −747 lines)
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#12130 from dongjoon-hyun/use_multiine_javadoc_comments.
Currently, when max number of executor failures reached the `maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register another one.This time, `YarnAllocator` will be created a new instance.
But, the value of property `executorIdCounter` in `YarnAllocator` will reset to `0`. Then the Id of new executor will starting from `1`. This will confuse with the executor has already created before, which will cause FetchFailedException.
This situation is just in yarn client mode, so this is an issue in yarn client mode. For more details, [link to jira issues SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864)
This PR introduce a mechanism to initialize `executorIdCounter` after `ApplicationMaster` killed.
Author: zhonghaihua <793507405@qq.com>
Closes#10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.
## What changes were proposed in this pull request?
Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.
## How was this patch tested?
This patch is tested manually with unit test.
CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.
Author: jerryshao <sshao@hortonworks.com>
Closes#11603 from jerryshao/SPARK-12343.
## What changes were proposed in this pull request?
1. Currently log4j which uses distributed cache only adds to AM's classpath, not executor's, this is introduced in #9118, which breaks the original meaning of that PR, so here add log4j file to the classpath of both AM and executors.
2. Automatically upload metrics.properties to distributed cache, so that it could be used by remote driver and executors implicitly.
## How was this patch tested?
Unit test and integration test is done.
Author: jerryshao <sshao@hortonworks.com>
Closes#11885 from jerryshao/SPARK-14062.
Move the logic to find Spark jars to CommandBuilderUtils and make it
available for YARN code, so that it's possible to easily launch Spark
on YARN from a build directory.
Tested by running SparkPi from the build directory on YARN.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11970 from vanzin/SPARK-13955.
## What changes were proposed in this pull request?
This PR adds some proper periods and spaces to Spark CLI help messages and SQL/YARN conf docs for consistency.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11848 from dongjoon-hyun/add_proper_period_and_space.
## What changes were proposed in this pull request?
This regression is introduced in #9182, previously attempt id is simply as counter "1" or "2". With the change of #9182, it is changed to full name as "appattemtp-xxx-00001", this will affect all the parts which uses this attempt id, like event log file name, history server app url link. So here change it back to the counter to keep consistent with previous code.
Also revert back this patch #11518, this patch fix the url link of history log according to the new way of attempt id, since here we change back to the previous way, so this patch is not necessary, here to revert it.
Also clean "spark.yarn.app.id" and "spark.yarn.app.attemptId", since it is useless now.
## How was this patch tested?
Test it with unit test and manually test different scenario:
1. application running in yarn-client mode.
2. application running in yarn-cluster mode.
3. application running in yarn-cluster mode with multiple attempts.
Checked both the event log file name and url link.
CC vanzin tgravescs , please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#11721 from jerryshao/SPARK-13885.
## What changes were proposed in this pull request?
Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code.
## How was this patch tested?
existing tests.
Author: Wenchen Fan <wenchen@databricks.com>
Closes#11764 from cloud-fan/logger.
## What changes were proposed in this pull request?
The max number of executor failure before failing the application is default to twice the maximum number of executors if dynamic allocation is enabled. The default value for "spark.dynamicAllocation.maxExecutors" is Int.MaxValue. So this causes an integer overflow and a wrong result. The calculated value of the default max number of executor failure is 3. This PR adds a check to avoid the overflow.
## How was this patch tested?
It tests if the value is greater that Int.MaxValue / 2 to avoid the overflow when it multiplies 2.
Author: Carson Wang <carson.wang@intel.com>
Closes#11713 from carsonwang/IntOverflow.
## What changes were proposed in this pull request?
Changing the default exit state to `failed` for any application running on yarn cluster mode.
## How was this patch tested?
Unit test is done locally.
CC tgravescs and vanzin .
Author: jerryshao <sshao@hortonworks.com>
Closes#11693 from jerryshao/SPARK-13642.
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.
In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.
Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2
/cc zsxwing tdas davies brkyvz
Author: Josh Rosen <joshrosen@databricks.com>
Closes#11687 from JoshRosen/py4j-0.9.2.
To maximize locality, the YarnAllocator would cancel any requests with a
stale locality preference or no locality preference. This assumed that
the majority of tasks had locality preferences, but may not be the case
when scanning S3. This caused container requests for S3 tasks to be
constantly cancelled and resubmitted.
This changes the allocator's logic to cancel only stale requests and
just enough requests without locality preferences to submit requests
with locality preferences. This avoids cancelling requests without
locality preferences that would be resubmitted without locality
preferences.
We've deployed this patch on our clusters and verified that jobs that couldn't get executors because they kept canceling and resubmitting requests are fixed. Large jobs are running fine.
Author: Ryan Blue <blue@apache.org>
Closes#11612 from rdblue/SPARK-13779-fix-yarn-allocator-requests.
## What changes were proposed in this pull request?
This PR fixes 135 typos over 107 files:
* 121 typos in comments
* 11 typos in testcase name
* 3 typos in log messages
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11689 from dongjoon-hyun/fix_more_typos.
## What changes were proposed in this pull request?
- Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
- Same for `InputStreamReader` and `OutputStreamWriter` constructors
- Standardizes on UTF-8 everywhere
- Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
- (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit 1deecd8d9c )
## How was this patch tested?
Jenkins tests
Author: Sean Owen <sowen@cloudera.com>
Closes#11657 from srowen/SPARK-13823.
In preparation for the demise of assemblies, this change allows the
YARN backend to use multiple jars and globs as the "Spark jar". The
config option has been renamed to "spark.yarn.jars" to reflect that.
A second option "spark.yarn.archive" was also added; if set, this
takes precedence and uploads an archive expected to contain the jar
files with the Spark code and its dependencies.
Existing deployments should keep working, mostly. This change drops
support for the "SPARK_JAR" environment variable, and also does not
fall back to using "jarOfClass" if no configuration is set, falling
back to finding files under SPARK_HOME instead. This should be fine
since "jarOfClass" probably wouldn't work unless you were using
spark-submit anyway.
Tested with the unit tests, and trying the different config options
on a YARN cluster.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11500 from vanzin/SPARK-13577.
## What changes were proposed in this pull request?
Fire and forget is disabled by default, with this patch #10205 it is enabled by default, so this is a regression should be fixed.
## How was this patch tested?
Manually verified this change.
Author: jerryshao <sshao@hortonworks.com>
Closes#11577 from jerryshao/hot-fix-yarn-cluster.
This is, in a way, the basics to enable SPARK-529 (which was closed as
won't fix but I think is still valuable). In fact, Spark SQL created
something for that, and this change basically factors out that code
and inserts it into SparkConf, with some extra bells and whistles.
To showcase the usage of this pattern, I modified the YARN backend
to use the new config keys (defined in the new `config` package object
under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
logic had to be slightly modified in a handful of places.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#10205 from vanzin/conf-opts.
## What changes were proposed in this pull request?
After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
## How was this patch tested?
```
./dev/lint-java
./build/sbt compile
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11438 from dongjoon-hyun/SPARK-13583.
## What changes were proposed in this pull request?
Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
- Inner class should be static
- Mismatched hashCode/equals
- Overflow in compareTo
- Unchecked warnings
- Misuse of assert, vs junit.assert
- get(a) + getOrElse(b) -> getOrElse(a,b)
- Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
- Dead code
- tailrec
- exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
- reduce(_+_) -> sum map + flatten -> map
The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
## How was the this patch tested?
Existing Jenkins unit tests.
Author: Sean Owen <sowen@cloudera.com>
Closes#11292 from srowen/SPARK-13423.
## What changes were proposed in this pull request?
This PR aims to fix the following deprecation warnings.
* MethodSymbolApi.paramss--> paramLists
* AnnotationApi.tpe -> tree.tpe
* BufferLike.readOnly -> toList.
* StandardNames.nme -> termNames
* scala.tools.nsc.interpreter.AbstractFileClassLoader -> scala.reflect.internal.util.AbstractFileClassLoader
* TypeApi.declarations-> decls
## How was this patch tested?
Check the compile build log and pass the tests.
```
./build/sbt
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11479 from dongjoon-hyun/SPARK-13627.
The Hive client library is not smart enough to notice that the current
user is a proxy user; so when using a proxy user, it fails to fetch
delegation tokens from the metastore because of a missing kerberos
TGT for the current user.
To fix it, just run the code that fetches the delegation token as the
real logged in user.
Tested on a kerberos cluster both submitting normally and with a proxy
user; Hive and HBase tokens are retrieved correctly in both cases.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#11358 from vanzin/SPARK-13478.
Obtain the hive metastore and hbase token as well as hdfs token in DelegationToeknRenewer to supoort long-running application of spark on hbase or thriftserver.
Author: huangzhaowei <carlmartinmax@gmail.com>
Closes#10645 from SaintBacchus/SPARK-12523.
When application end, AM will clean the staging dir.
But if the driver trigger to update the delegation token, it will can't find the right token file and then it will endless cycle call the method 'updateCredentialsIfRequired'.
Then it lead driver StackOverflowError.
https://issues.apache.org/jira/browse/SPARK-12316
Author: huangzhaowei <carlmartinmax@gmail.com>
Closes#10475 from SaintBacchus/SPARK-12316.
## What changes were proposed in this pull request?
Instead of using result of File.listFiles() directly, which may throw NPE, check for null first. If it is null, log a warning instead
## How was the this patch tested?
Ran the ./dev/run-tests locally
Tested manually on a cluster
Author: Terence Yim <terence@cask.co>
Closes#11337 from chtyim/fixes/SPARK-13441-null-check.
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes#11300 from dongjoon-hyun/minor_fix_typos.
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#10608 from JoshRosen/SPARK-6363.
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10854 from zsxwing/remove-akka.
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
- Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10692 from zsxwing/py4j-0.9.1.
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes#10686 from sarutak/SPARK-12692-followup-yarn.
### Remove AkkaRpcEnv
Keep `SparkEnv.actorSystem` because Streaming still uses it. Will remove it and AkkaUtils after refactoring Streaming actorStream API.
### Remove systemName
There are 2 places using `systemName`:
* `RpcEnvConfig.name`. Actually, although it's used as `systemName` in `AkkaRpcEnv`, `NettyRpcEnv` uses it as the service name to output the log `Successfully started service *** on port ***`. Since the service name in log is useful, I keep `RpcEnvConfig.name`.
* `def setupEndpointRef(systemName: String, address: RpcAddress, endpointName: String)`. Each `ActorSystem` has a `systemName`. Akka requires `systemName` in its URI and will refuse a connection if `systemName` is not matched. However, `NettyRpcEnv` doesn't use it. So we can remove `systemName` from `setupEndpointRef` since we are removing `AkkaRpcEnv`.
### Remove RpcEnv.uriOf
`uriOf` exists because Akka uses different URI formats for with and without authentication, e.g., `akka.ssl.tcp...` and `akka.tcp://...`. But `NettyRpcEnv` uses the same format. So it's not necessary after removing `AkkaRpcEnv`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10459 from zsxwing/remove-akka-rpc-env.
Restore the original value of os.arch property after each test
Since some of tests forced to set the specific value to os.arch property, we need to set the original value.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes#10289 from kiszk/SPARK-12311.
Spark on Yarn handle AM being told command from RM
When RM throws ApplicationAttemptNotFoundException for allocate
invocation, making the ApplicationMaster to finish immediately without any
retries.
Author: Devaraj K <devaraj@apache.org>
Closes#10129 from devaraj-kavali/SPARK-4117.
This lines up the HBase token logic with that done for Hive in SPARK-11265: reflection with only CFNE being swallowed.
There is a test, one which doesn't try to put HBase on the yarn/test class and really do the reflection (the way the hive introspection does). If people do want that then it could be added with careful POM work
+also: cut an incorrect comment from the Hive test case before copying it, and a couple of imports that may have been related to the hive test in the past.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#10227 from steveloughran/stevel/patches/SPARK-12241-obtainTokenForHBase.
Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in `ExecutorAllocationManager` and `CoarseGrainedSchedulerBacked` should be reset.
This issue is originally addressed in #8737 , here re-opened again. Thanks a lot KaiXinXiaoLei for finding this issue.
andrewor14 and vanzin would you please help to review this, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#9963 from jerryshao/SPARK-10582.
Using Dynamic Allocation function, when a new AM is starting, and ExecutorAllocationManager send RequestExecutor message to AM. If the container allocator is not ready, the whole app will hang on
Author: meiyoula <1039320815@qq.com>
Closes#10138 from XuTingjun/patch-1.
`SynchronousQueue` cannot cache any task. This issue is similar to #9978. It's an easy fix. Just use the fixed `ThreadUtils.newDaemonCachedThreadPool`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes#10108 from zsxwing/fix-threadpool.
This is purely the yarn/src/main and yarn/src/test bits of the YARN ATS integration: the extension model to load and run implementations of `SchedulerExtensionService` in the yarn cluster scheduler process —and to stop them afterwards.
There's duplication between the two schedulers, yarn-client and yarn-cluster, at least in terms of setting everything up, because the common superclass, `YarnSchedulerBackend` is in spark-core, and the extension services need the YARN app/attempt IDs.
If you look at how the the extension services are loaded, the case class `SchedulerExtensionServiceBinding` is used to pass in config info -currently just the spark context and the yarn IDs, of which one, the attemptID, will be null when running client-side. I'm passing in a case class to ensure that it would be possible in future to add extra arguments to the binding class, yet, as the method signature will not have changed, still be able to load existing services.
There's no functional extension service here, just one for testing. The real tests come in the bigger pull requests. At the same time, there's no restriction of this extension service purely to the ATS history publisher. Anything else that wants to listen to the spark context and publish events could use this, and I'd also consider writing one for the YARN-913 registry service, so that the URLs of the web UI would be locatable through that (low priority; would make more sense if integrated with a REST client).
There's no minicluster test. Given the test execution overhead of setting up minicluster tests, it'd probably be better to add an extension service into one of the existing tests.
Author: Steve Loughran <stevel@hortonworks.com>
Closes#9182 from steveloughran/stevel/feature/SPARK-1537-service.
Add label expression support for AM to restrict it runs on the specific set of nodes. I tested it locally and works fine.
sryza and vanzin please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#9800 from jerryshao/SPARK-7173.
Don't log ERROR messages when executors are explicitly killed or when
the exit reason is not yet known.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9780 from vanzin/SPARK-11789.
When we exceed the max memory tell users to increase both params instead of just the one.
Author: Holden Karau <holden@us.ibm.com>
Closes#9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.
Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.
The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done.
One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple.
Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.
Here this PR chooses solution 2.
Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#9684 from jerryshao/SPARK-11718.
See http://search-hadoop.com/m/q3RTtjpe8r1iRbTj2 for discussion.
Summary: addition of VisibleForTesting annotation resulted in spark-shell malfunctioning.
Author: tedyu <yuzhihong@gmail.com>
Closes#9585 from tedyu/master.
I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.
--num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied
Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
Closes#9523 from tgravescs/SPARK-11555.
In YARN mode, when preemption is enabled, we may leave executors in a
zombie state while we wait to retrieve the reason for which the executor
exited. This is so that we don't account for failed tasks that were
running on a preempted executor.
The issue is that while we wait for this information, the scheduler
might decide to schedule tasks on the executor, which will never be
able to run them. Other side effects include the block manager still
considering the executor available to cache blocks, for example.
So, when we know that an executor went down but we don't know why,
stop everything related to the executor, except its running tasks.
Only when we know the reason for the exit (or give up waiting for
it) we do update the running tasks.
This is achieved by a new `disableExecutor()` method in the
`Schedulable` interface. For managers that do not behave like this
(i.e. every one but YARN), the existing `executorLost()` method
will behave the same way it did before.
On top of that change, a few minor changes that made debugging easier,
and fixed some other minor issues:
- The cluster-mode AM was printing a misleading log message every
time an executor disconnected from the driver (because the akka
actor system was shared between driver and AM).
- Avoid sending unnecessary requests for an executor's exit reason
when we already know it was explicitly disabled / killed. This
avoids both multiple requests, and unnecessary requests that would
just cause warning messages on the AM (in the explicit kill case).
- Tone down a log message about the executor being lost when it
exited normally (e.g. preemption)
- Wake up the AM monitor thread when requests for executor loss
reasons arrive too, so that we can more quickly remove executors
from this zombie state.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8887 from vanzin/SPARK-10622.
"Client mode" means the RPC env will not listen for incoming connections.
This allows certain processes in the Spark stack (such as Executors or
tha YARN client-mode AM) to act as pure clients when using the netty-based
RPC backend, reducing the number of sockets needed by the app and also the
number of open ports.
Client connections are also preferred when endpoints that actually have
a listening socket are involved; so, for example, if a Worker connects
to a Master and the Master needs to send a message to a Worker endpoint,
that client connection will be used, even though the Worker is also
listening for incoming connections.
With this change, the workaround for SPARK-10987 isn't necessary anymore, and
is removed. The AM connects to the driver in "client mode", and that connection
is used for all driver <-> AM communication, and so the AM is properly notified
when the connection goes down.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9210 from vanzin/SPARK-10997.
This is a follow-up PR to further improve the locality calculation by considering the pending container's request. Since the locality preferences of tasks may be shifted from time to time, current localities of pending container requests may not fully match the new preferences, this PR improve it by removing outdated, unmatched container requests and replace with new requests.
sryza please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes#8100 from jerryshao/SPARK-9817.
Use standard JDK APIs for that (with a little help from Guava). Most of the
changes here are in test code, since there were no tests specific to that
part of the code.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9257 from vanzin/SPARK-11073.
This is a fix for SPARK-11265; the introspection code to get Hive delegation tokens failing on Spark 1.5.1+, due to changes in the Hive codebase
Author: Steve Loughran <stevel@hortonworks.com>
Closes#9232 from steveloughran/stevel/patches/SPARK-11265-hive-tokens.
Commit af3bc59d1f introduced new
functionality so that if an executor dies for a reason that's not
caused by one of the tasks running on the executor (e.g., due to
pre-emption), Spark doesn't count the failure towards the maximum
number of failures for the task. That commit introduced some vague
naming that this commit attempts to fix; in particular:
(1) The variable "isNormalExit", which was used to refer to cases where
the executor died for a reason unrelated to the tasks running on the
machine, has been renamed (and reversed) to "exitCausedByApp". The problem
with the existing name is that it's not clear (at least to me!) what it
means for an exit to be "normal"; the new name is intended to make the
purpose of this variable more clear.
(2) The variable "shouldEventuallyFailJob" has been renamed to
"countTowardsTaskFailures". This variable is used to determine whether
a task's failure should be counted towards the maximum number of failures
allowed for a task before the associated Stage is aborted. The problem
with the existing name is that it can be confused with implying that
the task's failure should immediately cause the stage to fail because it
is somehow fatal (this is the case for a fetch failure, for example: if
a task fails because of a fetch failure, there's no point in retrying,
and the whole stage should be failed).
Author: Kay Ousterhout <kayousterhout@gmail.com>
Closes#9164 from kayousterhout/SPARK-11178.
Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings.
If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties
Author: vundela <vsr@cloudera.com>
Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
Closes#9118 from vundela/master.
I also added some information to container-failure error msgs about what host they failed on, which would have helped me identify the problem that lead me to this JIRA and PR sooner.
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes#9147 from ryan-williams/dyn-exec-failures.
when spark.yarn.user.classpath.first=true and using 'spark-submit --jars hdfs://user/foo.jar', it can not put foo.jar to system classpath. so we need to put yarn's linkNames of jars to the system classpath. vanzin tgravescs
Author: Lianhui Wang <lianhuiwang09@gmail.com>
Closes#9045 from lianhuiwang/spark-11026.
Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
Author: jerryshao <sshao@hortonworks.com>
Closes#8857 from jerryshao/SPARK-10739 and squashes the following commits:
36eabdc [jerryshao] change the doc
7f9b77d [jerryshao] Style change
1c9afd0 [jerryshao] Address the comments
caca695 [jerryshao] Add application attempt window for Spark on Yarn
The issue is that local paths on Windows, when provided with drive
letters or backslashes, are not valid URIs.
Instead of trying to figure out whether paths are URIs or not, use
Utils.resolveURI() which does that for us.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9049 from vanzin/SPARK-11023 and squashes the following commits:
77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.
This change adds an API that encapsulates information about an app
launched using the library. It also creates a socket-based communication
layer for apps that are launched as child processes; the launching
application listens for connections from launched apps, and once
communication is established, the channel can be used to send updates
to the launching app, or to send commands to the child app.
The change also includes hooks for local, standalone/client and yarn
masters.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#7052 from vanzin/SPARK-8673.
In YARN client mode, when the AM connects to the driver, it may be the case
that the driver never needs to send a message back to the AM (i.e., no
dynamic allocation or preemption). This triggers an issue in the netty rpc
backend where no disconnection event is sent to endpoints, and the AM never
exits after the driver shuts down.
The real fix is too complicated, so this is a quick hack to unblock YARN
client mode until we can work on the real fix. It forces the driver to
send a message to the AM when the AM registers, thus establishing that
connection and enabling the disconnection event when the driver goes
away.
Also, a minor side issue: when the executor is shutting down, it needs
to send an "ack" back to the driver when using the netty rpc backend; but
that "ack" wasn't being sent because the handler was shutting down the rpc
env before returning. So added a change to delay the shutdown a little bit,
allowing the ack to be sent back.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9021 from vanzin/SPARK-10987.
The `self` method returns null when called from the constructor;
instead, registration should happen in the `onStart` method, at
which point the `self` reference has already been initialized.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#9005 from vanzin/SPARK-10964.
A recent change to fix the referenced bug caused this exception in
the `SparkContext.stop()` path:
org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode!
at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:167)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:182)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:440)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1579)
at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1730)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1729)
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8996 from vanzin/SPARK-10812.
This should go into 1.5.2 also.
The issue is we were no longer adding the __app__.jar to the system classpath.
Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@yahoo-inc.com>
Closes#8959 from tgravescs/SPARK-10901.
This makes YARN containers behave like all other processes launched by
Spark, which launch with a default perm gen size of 256m unless
overridden by the user (or not needed by the vm).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8970 from vanzin/SPARK-10916.
This PR just reverted 02144d6745 to remerge #6457 and also included the commits in #8905.
Author: zsxwing <zsxwing@gmail.com>
Closes#8944 from zsxwing/SPARK-6028.
This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790).
Also consolidate and simplify some similar code snippets to keep the consistent semantics.
Author: jerryshao <sshao@hortonworks.com>
Closes#8910 from jerryshao/SPARK-10790.
While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight.
```
[info] SampleMiniClusterTest:
[info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
[info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
[info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
[info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
[info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
[info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
[info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
[info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
[info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
[info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
[info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
[info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
```
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes#8437 from vanzin/test-tags.
`ApplicationMaster` no longer has the `--num-executors` flag, and had an undocumented `--properties-file` configuration option.
cc srowen
Author: Erick Tryzelaar <erick.tryzelaar@gmail.com>
Closes#8754 from erickt/master.