Commit graph

535 commits

Author SHA1 Message Date
Kishor Patil ff6e4cbdc8 [SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as Failed
## What changes were proposed in this pull request?

Due to race conditions, the ` assert(numExecutorsRunning <= targetNumExecutors)` can fail causing `AssertionError`. So removed the assertion, instead moved the conditional check before launching new container:
```
java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:156)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
```
## How was this patch tested?
This was manually tested using a large ForkAndJoin job with Dynamic Allocation enabled to validate the failing job succeeds, without any such exception.

Author: Kishor Patil <kpatil@yahoo-inc.com>

Closes #15069 from kishorvpatil/SPARK-17511.
2016-09-14 14:19:35 -05:00
Thomas Graves a3981c28c9 [SPARK-17433] YarnShuffleService doesn't handle moving credentials levelDb
The secrets leveldb isn't being moved if you run spark shuffle services without yarn nm recovery on and then turn it on.  This fixes that.  I unfortunately missed this when I ported the patch from our internal branch 2 to master branch due to the changes for the recovery path.  Note this only applies to master since it is the only place the yarn nm recovery dir is used.

Unit tests ran and tested on 8 node cluster.  Fresh startup with NM recovery, fresh startup no nm recovery, switching between no nm recovery and recovery.  Also tested running applications to make sure wasn't affected by rolling upgrade.

Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@apache.org>

Closes #14999 from tgravescs/SPARK-17433.
2016-09-09 13:43:32 -05:00
Liwei Lin 3ce3a282c8 [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.
2016-09-07 10:04:00 +01:00
Marcelo Vanzin 0bd00ff245 [SPARK-15891][YARN] Clean up some logging in the YARN AM.
To make the log file more readable, rework some of the logging done
by the AM:

- log executor command / env just once, since they're all almost the same;
  the information that changes, such as executor ID, is already available
  in other log messages.
- avoid printing logs when nothing happens, especially when updating the
  container requests in the allocator.
- print fewer log messages when requesting many unlocalized executors,
  instead of repeating the same message multiple times.
- removed some logs that seemed unnecessary.

In the process, I slightly fixed up the wording in a few log messages, and
did some minor clean up of method arguments that were redundant.

Tested by running existing unit tests, and analyzing the logs of an
application that exercises dynamic allocation by forcing executors
to be allocated and be killed in waves.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14943 from vanzin/SPARK-15891.
2016-09-06 15:54:54 -07:00
Thomas Graves e79962f2f3 [SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade
The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled".  Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it.  Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #14718 from tgravescs/SPARK-16711.
2016-09-02 10:42:13 -07:00
Angus Gerry a0aac4b775 [SPARK-16533][CORE] resolve deadlocking in driver when executors die
## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes #14710 from angolon/SPARK-16533.
2016-09-01 10:35:31 -07:00
Ferdinand Xu 4b4e329e49 [SPARK-5682][CORE] Add encrypted shuffle in spark
This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.
2016-08-30 09:15:31 -07:00
Sean Owen 0b3a4be92c [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment
## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14748 from srowen/SPARK-16781.
2016-08-24 20:04:09 +01:00
Steve Loughran cc97ea188e [SPARK-16736][CORE][SQL] purge superfluous fs calls
A review of the code, working back from Hadoop's `FileSystem.exists()` and `FileSystem.isDirectory()` code, then removing uses of the calls when superfluous.

1. delete is harmless if called on a nonexistent path, so don't do any checks before deletes
1. any `FileSystem.exists()`  check before `getFileStatus()` or `open()` is superfluous as the operation itself does the check. Instead the `FileNotFoundException` is caught and triggers the downgraded path. When a `FileNotFoundException` was thrown before, the code still creates a new FNFE with the error messages. Though now the inner exceptions are nested, for easier diagnostics.

Initially, relying on Jenkins test runs.

One troublespot here is that some of the codepaths are clearly error situations; it's not clear that they have coverage anyway. Trying to create the failure conditions in tests would be ideal, but it will also be hard.

Author: Steve Loughran <stevel@apache.org>

Closes #14371 from steveloughran/cloud/SPARK-16736-superfluous-fs-calls.
2016-08-17 11:43:01 -07:00
Marcelo Vanzin e3fec51fa1 [SPARK-16930][YARN] Fix a couple of races in cluster app initialization.
There are two narrow races that could cause the ApplicationMaster to miss
when the user application instantiates the SparkContext, which could cause
app failures when nothing was wrong with the app. It was also possible for
a failing application to get stuck in the loop that waits for the context
for a long time, instead of failing quickly.

The change uses a promise to track the SparkContext instance, which gets
rid of the races and allows for some simplification of the code.

Tested with existing unit tests, and a new one being added to test the
timeout code.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14542 from vanzin/SPARK-16930.
2016-08-17 11:12:21 -07:00
WangTaoTheTonic ea0bf91b4a [SPARK-17022][YARN] Handle potential deadlock in driver handling messages
## What changes were proposed in this pull request?

We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock.

## How was this patch tested?

manual tests

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #14605 from WangTaoTheTonic/lock.
2016-08-11 15:09:23 -07:00
jerryshao ab648c0004 [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN
## What changes were proposed in this pull request?

Add a configurable token manager for Spark on running on yarn.

### Current Problems ###

1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.

### Changes In This Proposal ###

In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:

1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.

### Behavior Changes ###

For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).

For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:

1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.

So we still keep the same semantics as current code while add one new configuration.

### Current Status ###

- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.

## How was this patch tested?

Unit test and integrated test.

Please suggest and review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>

Closes #14065 from jerryshao/SPARK-16342.
2016-08-10 15:39:30 -07:00
Holden Karau 9216901d52 [SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting
## What changes were proposed in this pull request?

Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.

## How was this patch tested?

Existing tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #14407 from holdenk/SPARK-16779.
2016-08-08 15:54:03 -07:00
KevinGrealish b14d7b5cf4 [SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode.
## What changes were proposed in this pull request?

This fix allows submit of pyspark jobs to specify python 2 or 3.

Change ordering in setup for application master environment so env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be overridden by spark.yarn.appMasterEnv.* conf settings. This applies to YARN in cluster mode. This allows them to be set per submission without needing the unset the env vars (which is not always possible - e.g. batch submit with LIVY only exposes the arguments to spark-submit)

## How was this patch tested?
Manual and existing unit tests.

Author: KevinGrealish <KevinGre@microsoft.com>

Closes #13824 from KevinGrealish/SPARK-16110.
2016-07-27 13:50:31 -07:00
Xin Ren 21a6dd2aef [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant definition and inherited from the parent
https://issues.apache.org/jira/browse/SPARK-16535

## What changes were proposed in this pull request?

When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot
```
Definition of groupId is redundant, because it's inherited from the parent
```
![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png)

I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok.
```
<groupId>org.apache.spark</groupId>
```
As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1).

ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762

## How was this patch tested?

I've tested by re-building the project, and build succeeded.

Author: Xin Ren <iamshrek@126.com>

Closes #14189 from keypointt/SPARK-16535.
2016-07-19 11:59:46 +01:00
Marcelo Vanzin b7b5e17876 [SPARK-16505][YARN] Optionally propagate error during shuffle service startup.
This prevents the NM from starting when something is wrong, which would
lead to later errors which are confusing and harder to debug.

Added a unit test to verify startup fails if something is wrong.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14162 from vanzin/SPARK-16505.
2016-07-14 09:42:32 -05:00
sharkd 3d6f679cfe [MINOR][YARN] Fix code error in yarn-cluster unit test
## What changes were proposed in this pull request?

Fix code error in yarn-cluster unit test.

## How was this patch tested?

Use exist tests

Author: sharkd <sharkd.tu@gmail.com>

Closes #14166 from sharkdtu/master.
2016-07-13 11:36:02 +01:00
sharkd d513c99c19 [SPARK-16414][YARN] Fix bugs for "Can not get user config when calling SparkHadoopUtil.get.conf on yarn cluser mode"
## What changes were proposed in this pull request?

The `SparkHadoopUtil` singleton was instantiated before `ApplicationMaster` in `ApplicationMaster.main` when deploying spark on yarn cluster mode, the `conf` in the `SparkHadoopUtil` singleton didn't include user's configuration.

So, we should load the properties file with the Spark configuration and set entries as system properties before `SparkHadoopUtil` first instantiate.

## How was this patch tested?

Add a test case

Author: sharkd <sharkd.tu@gmail.com>
Author: sharkdtu <sharkdtu@tencent.com>

Closes #14088 from sharkdtu/master.
2016-07-12 10:10:35 -07:00
Reynold Xin ffcb6e055a [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT
## What changes were proposed in this pull request?
After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #14130 from rxin/SPARK-16477.
2016-07-11 09:42:56 -07:00
peng.zhang bad0f7dbba [SPARK-16095][YARN] Yarn cluster mode should report correct state to SparkLauncher
## What changes were proposed in this pull request?
Yarn cluster mode should return correct state for SparkLauncher

## How was this patch tested?
unit test

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #13962 from renozhang/SPARK-16095-spark-launcher-wrong-state.
2016-07-01 15:51:21 -07:00
jerryshao 272a2f78f3 [SPARK-15990][YARN] Add rolling log aggregation support for Spark on yarn
## What changes were proposed in this pull request?

Yarn supports rolling log aggregation since 2.6, previously log will only be aggregated to HDFS after application is finished, it is quite painful for long running applications like Spark Streaming, thriftserver. Also out of disk problem will be occurred when log file is too large. So here propose to add support of rolling log aggregation for Spark on yarn.

One limitation for this is that log4j should be set to change to file appender, now in Spark itself uses console appender by default, in which file will not be created again once removed after aggregation. But I think lots of production users should have changed their log4j configuration instead of default on, so this is not a big problem.

## How was this patch tested?

Manually verified with Hadoop 2.7.1.

Author: jerryshao <sshao@hortonworks.com>

Closes #13712 from jerryshao/SPARK-15990.
2016-06-29 08:17:27 -05:00
peng.zhang f4fd7432fb [SPARK-16125][YARN] Fix not test yarn cluster mode correctly in YarnClusterSuite
## What changes were proposed in this pull request?

Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly.
This pull request fixes it.

## How was this patch tested?
Unit test

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: peng.zhang <peng.zhang@xiaomi.com>

Closes #13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.
2016-06-24 08:28:32 +01:00
Ryan Blue 738f134bf4 [SPARK-13723][YARN] Change behavior of --num-executors with dynamic allocation.
## What changes were proposed in this pull request?

This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.

This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.

## How was this patch tested?

This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.

Author: Ryan Blue <blue@apache.org>

Closes #13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
2016-06-23 14:03:46 -05:00
Ryan Blue a410814c87 [SPARK-15725][YARN] Ensure ApplicationMaster sleeps for the min interval.
## What changes were proposed in this pull request?

Update `ApplicationMaster` to sleep for at least the minimum allocation interval before calling `allocateResources`. This prevents overloading the `YarnAllocator` that is happening because the thread is triggered when an executor is killed and its connections die. In YARN, this prevents the app from overloading the allocator and becoming unstable.

## How was this patch tested?

Tested that this allows the an app to recover instead of hanging. It is still possible for the YarnAllocator to be overwhelmed by requests, but this prevents the issue for the most common cause.

Author: Ryan Blue <blue@apache.org>

Closes #13482 from rdblue/SPARK-15725-am-sleep-work-around.
2016-06-23 13:54:37 -05:00
Peter Ableda 5bf2889bfc [SPARK-16138] Try to cancel executor requests only if we have at least 1
## What changes were proposed in this pull request?
Adding additional check to if statement

## How was this patch tested?
I built and deployed to internal cluster to observe behaviour. After the change the invalid logging is gone:

```
16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s).
16/06/22 08:46:36 INFO yarn.YarnAllocator: Canceling requests for 1 executor container(s) to have a new desired total 1 executors.
16/06/22 08:46:36 INFO yarn.YarnAllocator: Driver requested a total number of 0 executor(s).
16/06/22 08:47:36 INFO yarn.ApplicationMaster$AMEndpoint: Driver requested to kill executor(s) 1.
```

Author: Peter Ableda <abledapeter@gmail.com>

Closes #13850 from peterableda/patch-2.
2016-06-23 09:00:31 -05:00
Marcelo Vanzin bcb0258ae6 [SPARK-16080][YARN] Set correct link name for conf archive in executors.
This makes sure the files are in the executor's classpath as they're
expected to be. Also update the unit test to make sure the files are
there as expected.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13792 from vanzin/SPARK-16080.
2016-06-21 12:48:06 -05:00
Marcelo Vanzin 40eeef9525 [SPARK-15046][YARN] Parse value of token renewal interval correctly.
Use the config variable definition both to set and parse the value,
avoiding issues with code expecting the value in a different format.

Tested by running spark-submit with --principal / --keytab.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13669 from vanzin/SPARK-15046.
2016-06-15 09:09:21 -05:00
Sean Owen 6151d2641f [MINOR] Clean up several build warnings, mostly due to internal use of old accumulators
## What changes were proposed in this pull request?

Another PR to clean up recent build warnings. This particularly cleans up several instances of the old accumulator API usage in tests that are straightforward to update. I think this qualifies as "minor".

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13642 from srowen/BuildWarnings.
2016-06-14 09:40:07 -07:00
Peter Ableda d681742b2d [SPARK-15813] Improve Canceling log message to make it less ambiguous
## What changes were proposed in this pull request?
Add new desired executor number to make the log message less ambiguous.

## How was this patch tested?
This is a trivial change

Author: Peter Ableda <abledapeter@gmail.com>

Closes #13552 from peterableda/patch-1.
2016-06-13 09:40:17 +01:00
jerryshao aa03645107 [SPARK-12447][YARN] Only update the states when executor is successfully launched
The details is described in https://issues.apache.org/jira/browse/SPARK-12447.

vanzin Please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #10412 from jerryshao/SPARK-12447.
2016-06-09 17:31:19 -07:00
Subroto Sanyal 61d729abda [SPARK-15754][YARN] Not letting the credentials containing hdfs delegation tokens to be added in current user credential.
## What changes were proposed in this pull request?
The credentials are not added to the credentials of UserGroupInformation.getCurrentUser(). Further if the client has possibility to login using keytab then the updateDelegationToken thread is not started on client.

## How was this patch tested?
ran dev/run-tests

Author: Subroto Sanyal <ssanyal@datameer.com>

Closes #13499 from subrotosanyal/SPARK-15754-save-ugi-from-changing.
2016-06-03 16:50:00 -07:00
Sean Owen ce1572d16f [MINOR] Resolve a number of miscellaneous build warnings
## What changes were proposed in this pull request?

This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #13377 from srowen/BuildWarnings.
2016-05-29 16:48:14 -05:00
Steve Loughran 01b350a4f7 [SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics
This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted

Author: Steve Loughran <stevel@hortonworks.com>
Author: Steve Loughran <stevel@apache.org>

Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.
2016-05-26 13:55:22 -05:00
Marcelo Vanzin a313a5ae74 [SPARK-15405][YARN] Remove unnecessary upload of config archive.
We only need one copy of it. The client code that was uploading the
second copy just needs to be modified to update the metadata in the
cache, so that the AM knows where to find the configuration.

Tested by running app on YARN and verifying in the logs only one archive
is uploaded.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13232 from vanzin/SPARK-15405.
2016-05-24 10:26:55 -07:00
tedyu 06c9f52071 [SPARK-15273] YarnSparkHadoopUtil#getOutOfMemoryErrorArgument should respect OnOutOfMemoryError parameter given by user
## What changes were proposed in this pull request?

As Nirav reported in this thread:
http://search-hadoop.com/m/q3RTtdF3yNLMd7u

YarnSparkHadoopUtil#getOutOfMemoryErrorArgument previously specified 'kill %p' unconditionally.
We should respect the parameter given by user.

## How was this patch tested?

Existing tests

Author: tedyu <yuzhihong@gmail.com>

Closes #13057 from tedyu/master.
2016-05-20 18:13:18 -05:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Holden Karau 382dbc12bb [SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1
## What changes were proposed in this pull request?

This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 .

## How was this patch tested?

Existing doctests & unit tests pass

Author: Holden Karau <holden@us.ibm.com>

Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
2016-05-13 08:59:18 +01:00
bomeng 81bf870848 [SPARK-14897][SQL] upgrade to jetty 9.2.16
## What changes were proposed in this pull request?

Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+.

`javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version.

## How was this patch tested?

Manual test and current test cases should cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #12916 from bomeng/SPARK-14897.
2016-05-12 20:07:44 +01:00
Marcelo Vanzin 0b9cae4242 [SPARK-11249][LAUNCHER] Throw error if app resource is not provided.
Without this, the code would build an invalid spark-submit command line,
and a more cryptic error would be presented to the user. Also, expose
a constant that allows users to set a dummy resource in cases where
they don't need an actual resource file; for backwards compatibility,
that uses the same "spark-internal" resource that Spark itself uses.

Tested via unit tests, run-example, spark-shell, and running the
thrift server with mixed spark and hive command line arguments.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12909 from vanzin/SPARK-11249.
2016-05-10 10:35:54 -07:00
jerryshao aab99d31a9 [SPARK-14963][YARN] Using recoveryPath if NM recovery is enabled
## What changes were proposed in this pull request?

From Hadoop 2.5+, Yarn NM supports NM recovery which using recovery path for auxiliary services such as spark_shuffle, mapreduce_shuffle. So here change to use this path install of NM local dir if NM recovery is enabled.

## How was this patch tested?

Unit test + local test.

Author: jerryshao <sshao@hortonworks.com>

Closes #12994 from jerryshao/SPARK-14963.
2016-05-10 10:28:36 -05:00
Jacek Laskowski bbb7773437 [SPARK-15152][DOC][MINOR] Scaladoc and Code style Improvements
## What changes were proposed in this pull request?

Minor doc and code style fixes

## How was this patch tested?

local build

Author: Jacek Laskowski <jacek@japila.pl>

Closes #12928 from jaceklaskowski/SPARK-15152.
2016-05-05 16:34:27 -07:00
mcheah b7fdc23ccc [SPARK-12154] Upgrade to Jersey 2
## What changes were proposed in this pull request?

Replace com.sun.jersey with org.glassfish.jersey. Changes to the Spark Web UI code were required to compile. The changes were relatively standard Jersey migration things.

## How was this patch tested?

I did a manual test for the standalone web APIs. Although I didn't test the functionality of the security filter itself, the code that changed non-trivially is how we actually register the filter. I attached a debugger to the Spark master and verified that the SecurityFilter code is indeed invoked upon hitting /api/v1/applications.

Author: mcheah <mcheah@palantir.com>

Closes #12715 from mccheah/feature/upgrade-jersey.
2016-05-05 10:51:03 +01:00
Dhruve Ashar a45647746d [SPARK-4224][CORE][YARN] Support group acls
## What changes were proposed in this pull request?
Currently only a list of users can be specified for view and modify acls. This change enables a group of admins/devs/users to be provisioned for viewing and modifying Spark jobs.

**Changes Proposed in the fix**
Three new corresponding config entries have been added where the user can specify the groups to be given access.

```
spark.admin.acls.groups
spark.modify.acls.groups
spark.ui.view.acls.groups
```

New config entries were added because specifying the users and groups explicitly is a better and cleaner way compared to specifying them in the existing config entry using a delimiter.

A generic trait has been introduced to provide the user to group mapping which makes it pluggable to support a variety of mapping protocols - similar to the one used in hadoop. A default unix shell based implementation has been provided.
Custom user to group mapping protocol can be specified and configured by the entry ```spark.user.groups.mapping```

**How the patch was Tested**
We ran different spark jobs setting the config entries in combinations of admin, modify and ui acls. For modify acls we tried killing the job stages from the ui and using yarn commands. For view acls we tried accessing the UI tabs and the logs. Headless accounts were used to launch these jobs and different users tried to modify and view the jobs to ensure that the groups mapping applied correctly.

Additional Unit tests have been added without modifying the existing ones. These test for different ways of setting the acls through configuration and/or API and validate the expected behavior.

Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #12760 from dhruve/impr/SPARK-4224.
2016-05-04 08:45:43 -05:00
Yin Huai 9c7c42bc6a Revert "[SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local"
This reverts commit dae538a4d7.
2016-04-28 19:57:41 -07:00
jerryshao 2398e3d69c [SPARK-14836][YARN] Zip all the jars before uploading to distributed cache
## What changes were proposed in this pull request?

<copy form JIRA>

Currently if neither `spark.yarn.jars` nor `spark.yarn.archive` is set (by default), Spark on yarn code will upload all the jars in the folder separately into distributed cache, this is quite time consuming, and very verbose, instead of upload jars separately into distributed cache, here changes to zip all the jars first, and then put into distributed cache.

This will significantly improve the speed of starting time.

## How was this patch tested?

Unit test and local integrated test is done.

Verified with SparkPi both in spark cluster and client mode.

Author: jerryshao <sshao@hortonworks.com>

Closes #12597 from jerryshao/SPARK-14836.
2016-04-28 16:39:49 -07:00
Pravin Gadakh dae538a4d7 [SPARK-14613][ML] Add @Since into the matrix and vector classes in spark-mllib-local
## What changes were proposed in this pull request?

This PR adds `since` tag into the matrix and vector classes in spark-mllib-local.

## How was this patch tested?

Scala-style checks passed.

Author: Pravin Gadakh <prgadakh@in.ibm.com>

Closes #12416 from pravingadakh/SPARK-14613.
2016-04-28 15:59:18 -07:00
jerryshao 8b44bd52fa [SPARK-6735][YARN] Add window based executor failure tracking mechanism for long running service
This work is based on twinkle-sachdeva 's proposal. In parallel to such mechanism for AM failures, here add similar mechanism for executor failure tracking, this is useful for long running Spark service to mitigate the executor failure problems.

Please help to review, tgravescs sryza and vanzin

Author: jerryshao <sshao@hortonworks.com>

Closes #10241 from jerryshao/SPARK-6735.
2016-04-28 12:38:19 -05:00
Hemant Bhanawat e4d439c831 [SPARK-14729][SCHEDULER] Refactored YARN scheduler creation code to use newly added ExternalClusterManager
## What changes were proposed in this pull request?
With the addition of ExternalClusterManager(ECM) interface in PR #11723, any cluster manager can now be integrated with Spark. It was suggested in  ExternalClusterManager PR that one of the existing cluster managers should start using the new interface to ensure that the API is correct. Ideally, all the existing cluster managers should eventually use the ECM interface but as a first step yarn will now use the ECM interface. This PR refactors YARN code from SparkContext.createTaskScheduler function  into YarnClusterManager that implements ECM interface.

## How was this patch tested?
Since this is refactoring, no new tests has been added. Existing tests have been run. Basic manual testing with YARN was done too.

Author: Hemant Bhanawat <hemant@snappydata.io>

Closes #12641 from hbhanawat/yarnClusterMgr.
2016-04-27 10:59:23 -07:00
Azeem Jiva de6e633420 [SPARK-14756][CORE] Use parseLong instead of valueOf
## What changes were proposed in this pull request?

Use Long.parseLong which returns a primative.
Use a series of appends() reduces the creation of an extra StringBuilder type

## How was this patch tested?

Unit tests

Author: Azeem Jiva <azeemj@gmail.com>

Closes #12520 from javawithjiva/minor.
2016-04-26 11:49:04 +01:00
Lianhui Wang 6bfe42a3be [SPARK-14731][shuffle]Revert SPARK-12130 to make 2.0 shuffle service compatible with 1.x
## What changes were proposed in this pull request?
SPARK-12130 make 2.0 shuffle service incompatible with 1.x. So from discussion: [http://apache-spark-developers-list.1001551.n3.nabble.com/YARN-Shuffle-service-and-its-compatibility-td17222.html](url) we should maintain compatibility between Spark 1.x and Spark 2.x's shuffle service.
I put string comparison into executor's register at first avoid string comparison in getBlockData every time.

## How was this patch tested?
N/A

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #12568 from lianhuiwang/SPARK-14731.
2016-04-25 12:33:32 -07:00