Commit graph

5680 commits

Author SHA1 Message Date
Yanbo Liang d3b8869763 [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively
## What changes were proposed in this pull request?
Users would like to add a directory as dependency in some cases, they can use ```SparkContext.addFile``` with argument ```recursive=true``` to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.

## How was this patch tested?
Unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15140 from yanboliang/spark-17585.
2016-09-21 01:37:03 -07:00
wm624@hotmail.com 61876a4279
[CORE][DOC] Fix errors in comments
## What changes were proposed in this pull request?
While reading source code of CORE and SQL core, I found some minor errors in comments such as extra space, missing blank line and grammar error.

I fixed these minor errors and might find more during my source code study.

## How was this patch tested?
Manually build

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15151 from wangmiao1981/mem.
2016-09-21 09:33:29 +01:00
Weiqing Yang 1ea49916ac [MINOR][BUILD] Fix CheckStyle Error
## What changes were proposed in this pull request?
This PR is to fix the code style errors before 2.0.1 release.

## How was this patch tested?
Manual.

Before:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[153] (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[196] (sizes) LineLength: Line is longer than 100 characters (found 108).
[ERROR] src/main/java/org/apache/spark/network/client/TransportClient.java:[239] (sizes) LineLength: Line is longer than 100 characters (found 115).
[ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[119] (sizes) LineLength: Line is longer than 100 characters (found 107).
[ERROR] src/main/java/org/apache/spark/network/server/TransportRequestHandler.java:[129] (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] src/main/java/org/apache/spark/network/util/LevelDBProvider.java:[124,11] (modifier) ModifierOrder: 'static' modifier out of order with the JLS suggestions.
[ERROR] src/main/java/org/apache/spark/network/util/TransportConf.java:[26] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[33] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[38] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[43] (sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/PrefixComparators.java:[48] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeInMemorySorter.java:[0] (misc) NewlineAtEndOfFile: File does not end with a newline.
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeSorterSpillReader.java:[67] (sizes) LineLength: Line is longer than 100 characters (found 106).
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[200] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[309] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[332] (regexp) RegexpSingleline: No trailing whitespace allowed.
[ERROR] src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java:[348] (regexp) RegexpSingleline: No trailing whitespace allowed.
 ```
After:
```
./dev/lint-java
Using `mvn` from path: /usr/local/bin/mvn
Checkstyle checks passed.
```

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #15170 from Sherry302/fixjavastyle.
2016-09-20 21:48:25 -07:00
Shixiong Zhu 80d6655921 [SPARK-17438][WEBUI] Show Application.executorLimit in the application page
## What changes were proposed in this pull request?

This PR adds `Application.executorLimit` to the applicatino page

## How was this patch tested?

Checked the UI manually.

Screenshots:

1. Dynamic allocation is disabled

<img width="484" alt="screen shot 2016-09-07 at 4 21 49 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332029/210056ea-7518-11e6-9f52-76d96046c1c0.png">

2. Dynamic allocation is enabled.

<img width="466" alt="screen shot 2016-09-07 at 4 25 30 pm" src="https://cloud.githubusercontent.com/assets/1000778/18332034/2c07700a-7518-11e6-8fce-aebe25014902.png">

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15001 from zsxwing/fix-core-info.
2016-09-19 14:00:42 -04:00
hyukjinkwon 7151011b38
[SPARK-17586][BUILD] Do not call static member via instance reference
## What changes were proposed in this pull request?

This PR fixes a warning message as below:

```
[WARNING] .../UnsafeInMemorySorter.java:284: warning: [static] static method should be qualified by type name, TaskMemoryManager, instead of by an expression
[WARNING]       currentPageNumber = memoryManager.decodePageNumber(recordPointer)
```

by referencing the static member via class not instance reference.

## How was this patch tested?

Existing tests should cover this - Jenkins tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15141 from HyukjinKwon/SPARK-17586.
2016-09-18 19:18:49 +01:00
Josh Rosen 8faa5217b4 [SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes()
## What changes were proposed in this pull request?

`MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read.

This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer).

The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`.

## How was this patch tested?

The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug.

In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix.
2016-09-17 11:46:15 -07:00
David Navas 9dbd4b864e
[SPARK-17529][CORE] Implement BitSet.clearUntil and use it during merge joins
## What changes were proposed in this pull request?

Add a clearUntil() method on BitSet (adapted from the pre-existing setUntil() method).
Use this method to clear the subset of the BitSet which needs to be used during merge joins.

## How was this patch tested?

dev/run-tests, as well as performance tests on skewed data as described in jira.

I expect there to be a small local performance hit using BitSet.clearUntil rather than BitSet.clear for normally shaped (unskewed) joins (additional read on the last long).  This is expected to be de-minimis and was not specifically tested.

Author: David Navas <davidn@clearstorydata.com>

Closes #15084 from davidnavas/bitSet.
2016-09-17 16:22:23 +01:00
Xin Ren f15d41be3c
[SPARK-17567][DOCS] Use valid url to Spark RDD paper
https://issues.apache.org/jira/browse/SPARK-17567

## What changes were proposed in this pull request?

Documentation (http://spark.apache.org/docs/latest/api/scala/#org.apache.spark.rdd.RDD) contains broken link to Spark paper (http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf).

I found it elsewhere (https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf) and I hope it is the same one. It should be uploaded to and linked from some Apache controlled storage, so it won't break again.

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #15121 from keypointt/SPARK-17567.
2016-09-17 12:30:25 +01:00
Josh Rosen 1202075c95 [SPARK-17484] Prevent invalid block locations from being reported after put() exceptions
## What changes were proposed in this pull request?

If a BlockManager `put()` call failed after the BlockManagerMaster was notified of a block's availability then incomplete cleanup logic in a `finally` block would never send a second block status method to inform the master of the block's unavailability. This, in turn, leads to fetch failures and used to be capable of causing complete job failures before #15037 was fixed.

This patch addresses this issue via multiple small changes:

- The `finally` block now calls `removeBlockInternal` when cleaning up from a failed `put()`; in addition to removing the `BlockInfo` entry (which was _all_ that the old cleanup logic did), this code (redundantly) tries to remove the block from the memory and disk stores (as an added layer of defense against bugs lower down in the stack) and optionally notifies the master of block removal (which now happens during exception-triggered cleanup).
- When a BlockManager receives a request for a block that it does not have it will now notify the master to update its block locations. This ensures that bad metadata pointing to non-existent blocks will eventually be fixed. Note that I could have implemented this logic in the block manager client (rather than in the remote server), but that would introduce the problem of distinguishing between transient and permanent failures; on the server, however, we know definitively that the block isn't present.
- Catch `NonFatal` instead of `Exception` to avoid swallowing `InterruptedException`s thrown from synchronous block replication calls.

This patch depends upon the refactorings in #15036, so that other patch will also have to be backported when backporting this fix.

For more background on this issue, including example logs from a real production failure, see [SPARK-17484](https://issues.apache.org/jira/browse/SPARK-17484).

## How was this patch tested?

Two new regression tests in BlockManagerSuite.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15085 from JoshRosen/SPARK-17484.
2016-09-15 11:54:17 -07:00
Josh Rosen 5b8f7377d5 [SPARK-17547] Ensure temp shuffle data file is cleaned up after error
SPARK-8029 (#9610) modified shuffle writers to first stage their data to a temporary file in the same directory as the final destination file and then to atomically rename this temporary file at the end of the write job. However, this change introduced the potential for the temporary output file to be leaked if an exception occurs during the write because the shuffle writers' existing error cleanup code doesn't handle deletion of the temp file.

This patch avoids this potential cause of disk-space leaks by adding `finally` blocks to ensure that temp files are always deleted if they haven't been renamed.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15104 from JoshRosen/cleanup-tmp-data-file-in-shuffle-writer.
2016-09-15 11:22:58 -07:00
Tejas Patil b479278142 [SPARK-17451][CORE] CoarseGrainedExecutorBackend should inform driver before self-kill
## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17451

`CoarseGrainedExecutorBackend` in some failure cases exits the JVM. While this does not have any issue, from the driver UI there is no specific reason captured for this. In this PR, I am adding functionality to `exitExecutor` to notify driver that the executor is exiting.

## How was this patch tested?

Ran the change over a test env and took down shuffle service before the executor could register to it. In the driver logs, where the job failure reason is mentioned (ie. `Job aborted due to stage ...` it gives the correct reason:

Before:
`ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.`

After:
`ExecutorLostFailure (executor ZZZZZZZZZ exited caused by one of the running tasks) Reason: Unable to create executor due to java.util.concurrent.TimeoutException: Timeout waiting for task.`

Author: Tejas Patil <tejasp@fb.com>

Closes #15013 from tejasapatil/SPARK-17451_inform_driver.
2016-09-15 10:23:41 -07:00
cenyuhai ad79fc0a84 [SPARK-17406][WEB UI] limit timeline executor events
## What changes were proposed in this pull request?
The job page will be too slow to open when there are thousands of executor events(added or removed). I found that in ExecutorsTab file, executorIdToData will not remove elements, it will increase all the time.Before this pr, it looks like [timeline1.png](https://issues.apache.org/jira/secure/attachment/12827112/timeline1.png). After this pr, it looks like [timeline2.png](https://issues.apache.org/jira/secure/attachment/12827113/timeline2.png)(we can set how many executor events will be displayed)

Author: cenyuhai <cenyuhai@didichuxing.com>

Closes #14969 from cenyuhai/SPARK-17406.
2016-09-15 09:58:53 +01:00
codlife 647ee05e58 [SPARK-17521] Error when I use sparkContext.makeRDD(Seq())
## What changes were proposed in this pull request?

 when i use sc.makeRDD below
```
val data3 = sc.makeRDD(Seq())
println(data3.partitions.length)
```
I got an error:
Exception in thread "main" java.lang.IllegalArgumentException: Positive number of slices required

We can fix this bug just modify the last line ,do a check of seq.size
```
  def makeRDD[T: ClassTag](seq: Seq[(T, Seq[String])]): RDD[T] = withScope {
    assertNotStopped()
    val indexToPrefs = seq.zipWithIndex.map(t => (t._2, t._1._2)).toMap
    new ParallelCollectionRDD[T](this, seq.map(_._1), math.max(seq.size, defaultParallelism), indexToPrefs)
  }
```

## How was this patch tested?

 manual tests

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: codlife <1004910847@qq.com>
Author: codlife <wangjianfei15@otcaix.iscas.ac.cn>

Closes #15077 from codlife/master.
2016-09-15 09:38:13 +01:00
Xing SHI bb32294362 [SPARK-17465][SPARK CORE] Inappropriate memory management in org.apache.spark.storage.MemoryStore may lead to memory leak
The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not.

If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map.

See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465).

Author: Xing SHI <shi-kou@indetail.co.jp>

Closes #15022 from saturday-shi/SPARK-17465.
2016-09-14 13:59:57 -07:00
Shixiong Zhu e33bfaed3b [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value can be read thread-safely
## What changes were proposed in this pull request?

Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463).

## How was this patch tested?

Existing tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15063 from zsxwing/SPARK-17463.
2016-09-14 13:33:51 -07:00
Xin Wu 040e46979d [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BY
## What changes were proposed in this pull request?
Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values.
However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC).

This PR is to support this new feature.

## How was this patch tested?
New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Xin Wu <xinwu@us.ibm.com>

Closes #14842 from xwu0226/SPARK-10747.
2016-09-14 21:14:29 +02:00
wm624@hotmail.com 18b4f035f4 [CORE][DOC] remove redundant comment
## What changes were proposed in this pull request?
In the comment, there is redundant `the estimated`.

This PR simply remove the redundant comment and adjusts format.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #15091 from wangmiao1981/comment.
2016-09-14 09:49:15 +01:00
Jagadeesan def7c265f5 [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and…
## What changes were proposed in this pull request?

The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document.

… network timeout]

Author: Jagadeesan <as2@us.ibm.com>

Closes #15042 from jagadeesanas2/SPARK-17449.
2016-09-14 09:03:16 +01:00
Josh Rosen f9c580f110 [SPARK-17485] Prevent failed remote reads of cached blocks from failing entire job
## What changes were proposed in this pull request?

In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring).

In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable.

Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`.

## How was this patch tested?

Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method.

I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15037 from JoshRosen/SPARK-17485.
2016-09-12 15:43:57 -07:00
Josh Rosen 3d40896f41 [SPARK-17483] Refactoring in BlockManager status reporting and block removal
This patch makes three minor refactorings to the BlockManager:

- Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine).
- Split  `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch.
- Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block.

These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only.
2016-09-12 13:09:33 -07:00
Sean Zhong 1742c3ab86 [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the whole RDD in memory
## What changes were proposed in this pull request?

   MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory.
   ```
   scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count()

   java.lang.OutOfMemoryError: Java heap space
	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24)
	at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232)
	at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683)
	at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
	at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
	at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
	at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
	at org.apache.spark.scheduler.Task.run(Task.scala:86)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
   ```

Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly.

More info can be found at https://issues.apache.org/jira/browse/SPARK-17503

## How was this patch tested?

Unit test and manual test.

### Before change

Heap memory consumption
<img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png">

Heap dump
<img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png">

### After change

Heap memory consumption
<img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png">

Author: Sean Zhong <seanzhong@databricks.com>

Closes #15056 from clockfly/memory_store_leak.
2016-09-12 11:30:06 -07:00
WeichenXu 8087ecf8da [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" error message in PairRDDfunctions
## What changes were proposed in this pull request?

In order to avoid confusing user,
error message in `PairRDDfunctions`
`Default partitioner cannot partition array keys.`
is updated,
the one in `partitionBy` is replaced with
`Specified partitioner cannot partition array keys.`
other is replaced with
`Specified or default partitioner cannot partition array keys.`

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15045 from WeichenXu123/fix_partitionBy_error_message.
2016-09-12 12:23:16 +01:00
codlife 4efcdb7fea [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner without sortBy
## What changes were proposed in this pull request?

if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal.

## How was this patch tested?

manual tests

Author: codlife <1004910847@qq.com>

Closes #15039 from codlife/master.
2016-09-12 12:10:46 +01:00
cenyuhai cc87280fcd [SPARK-17171][WEB UI] DAG will list all partitions in the graph
## What changes were proposed in this pull request?
DAG will list all partitions in the graph, it is too slow and hard to see all graph.
Always we don't want to see all partitions,we just want to see the relations of DAG graph.
So I just show 2 root nodes for Rdds.

Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png)

Author: cenyuhai <cenyuhai@didichuxing.com>
Author: 岑玉海 <261810726@qq.com>

Closes #14737 from cenyuhai/SPARK-17171.
2016-09-12 11:52:56 +01:00
Josh Rosen 72eec70bdb [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses field
The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
2016-09-11 21:51:22 -07:00
Ryan Blue 6ea5055fa7 [SPARK-17396][CORE] Share the task support between UnionRDD instances.
## What changes were proposed in this pull request?

Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating a huge number of threads if lots of RDDs are created at the same time.

## How was this patch tested?

This uses existing UnionRDD tests.

Author: Ryan Blue <blue@apache.org>

Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
2016-09-10 10:18:53 +01:00
Joseph K. Bradley 65b814bf50 [SPARK-17456][CORE] Utility for parsing Spark versions
## What changes were proposed in this pull request?

This patch adds methods for extracting major and minor versions as Int types in Scala from a Spark version string.

Motivation: There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well.  Currently, I want this functionality to check Spark versions to provide backwards compatibility for ML model persistence.

## How was this patch tested?

Unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #15017 from jkbradley/version-parsing.
2016-09-09 05:35:10 -07:00
Gurvinder Singh 92ce8d4849 [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI
## What changes were proposed in this pull request?

This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as

WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/

This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy

## How was this patch tested?

The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.

pwendell bomeng BryanCutler can you please review it, thanks.

Author: Gurvinder Singh <gurvinder.singh@uninett.no>

Closes #13950 from gurvindersingh/rproxy.
2016-09-08 17:20:20 -07:00
Eric Liang 649fa4bf1d [SPARK-17370] Shuffle service files not invalidated when a slave is lost
## What changes were proposed in this pull request?

DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.

However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.

The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event.

## How was this patch tested?

Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected.

cc mateiz

Author: Eric Liang <ekl@databricks.com>

Closes #14931 from ericl/sc-4439.
2016-09-07 12:33:50 -07:00
hyukjinkwon 6b41195bca [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in SparkContext for Windows paths in SparkR
## What changes were proposed in this pull request?

This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details.

## How was this patch tested?

Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r

Also, manually,

![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14960 from HyukjinKwon/SPARK-17339.
2016-09-07 19:24:03 +09:00
Liwei Lin 3ce3a282c8 [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths
## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.
2016-09-07 10:04:00 +01:00
Eric Liang c07cbb3534 [SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks
## What changes were proposed in this pull request?

It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted.

This can cause unnecessary stage resubmits, sometimes multiple times as fetch fails cause a cascade of shuffle file invalidations, and confusing FetchFailure messages that report shuffle index files missing from the local disk.

Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output.

## How was this patch tested?

Prior to the fix proposed in https://github.com/apache/spark/pull/14931, I was able to reproduce this behavior by killing slaves in the middle of a large shuffle. After this patch, stages were no longer resubmitted multiple times due to shuffle index loss.

cc JoshRosen vanzin

Author: Eric Liang <ekl@databricks.com>

Closes #14932 from ericl/dont-remove-committed-files.
2016-09-06 16:55:22 -07:00
Shixiong Zhu 175b434411 [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'
## What changes were proposed in this pull request?

Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$`

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14983 from zsxwing/SPARK-17316-3.
2016-09-06 16:49:06 -07:00
Josh Rosen 29cfab3f15 [SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues()
## What changes were proposed in this pull request?

This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException.

We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long.

This patch addresses the bug by modifying `BlockManager`'s `get()` and  `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`)

## How was this patch tested?

Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14952 from JoshRosen/SPARK-17110.
2016-09-06 15:07:28 -07:00
Shivansh e75c162e9e [SPARK-17308] Improved the spark core code by replacing all pattern match on boolean value by if/else block.
## What changes were proposed in this pull request?
Improved the code quality of spark by replacing all pattern match on boolean value by if/else block.

## How was this patch tested?

By running the tests

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14873 from shiv4nsh/SPARK-17308.
2016-09-04 12:39:26 +01:00
wm624@hotmail.com e9b58e9ef8 [SPARK-16829][SPARKR] sparkR sc.setLogLevel doesn't work
(Please fill in changes proposed in this fix)

./bin/sparkR
Launching java with spark-submit command /Users/mwang/spark_ws_0904/bin/spark-submit "sparkr-shell" /var/folders/s_/83b0sgvj2kl2kwq4stvft_pm0000gn/T//RtmpQxJGiZ/backend_porte9474603ed1e
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel).

> sc.setLogLevel("INFO")
Error: could not find function "sc.setLogLevel"

sc.setLogLevel doesn't exist.

R has a function setLogLevel.

I rename the setLogLevel function to sc.setLogLevel.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Change unit test. Run unit tests.
Manually tested it in sparkR shell.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #14433 from wangmiao1981/sc.
2016-09-03 13:56:20 -07:00
Robert Kruszewski 806d8a8e98 [SPARK-16984][SQL] don't try whole dataset immediately when first partition doesn't have…
## What changes were proposed in this pull request?

Try increase number of partitions to try so we don't revert to all.

## How was this patch tested?

Empirically. This is common case optimization.

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14573 from robert3005/robertk/execute-take-backoff.
2016-09-02 17:14:43 +02:00
Kousuke Saruta 7ee24dac8e [SPARK-17352][WEBUI] Executor computing time can be negative-number because of calculation error
## What changes were proposed in this pull request?

In StagePage, executor-computing-time is calculated but calculation error can occur potentially because it's calculated by subtraction of floating numbers.

Following capture is an example.

<img width="949" alt="capture-timeline" src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png">

## How was this patch tested?

Manual tests.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #14908 from sarutak/SPARK-17352.
2016-09-02 10:26:43 +01:00
Kousuke Saruta 2ab8dbddaa [SPARK-17342][WEBUI] Style of event timeline is broken
## What changes were proposed in this pull request?

SPARK-15373 (#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, some class was renamed like 'timeline to vis-timeline' but that ticket didn't care and now style is broken.

In this PR, I've restored the style by modifying `timeline-view.css` and `timeline-view.js`.

## How was this patch tested?

manual tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

* Before
<img width="1258" alt="2016-09-01 1 38 31" src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png">

* After
<img width="1256" alt="2016-09-01 3 30 19" src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #14900 from sarutak/SPARK-17342.
2016-09-02 08:46:15 +01:00
Sean Owen 3893e8c576 [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays
## What changes were proposed in this pull request?

Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14895 from srowen/SPARK-17331.
2016-09-01 12:13:07 -07:00
Angus Gerry a0aac4b775 [SPARK-16533][CORE] resolve deadlocking in driver when executors die
## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes #14710 from angolon/SPARK-16533.
2016-09-01 10:35:31 -07:00
Sean Owen 5d84c7fd83 [SPARK-17332][CORE] Make Java Loggers static members
## What changes were proposed in this pull request?

Make all Java Loggers static members

## How was this patch tested?

Jenkins

Author: Sean Owen <sowen@cloudera.com>

Closes #14896 from srowen/SPARK-17332.
2016-08-31 11:09:14 -07:00
Shixiong Zhu 9bcb33c541 [SPARK-17316][CORE] Make CoarseGrainedSchedulerBackend.removeExecutor non-blocking
## What changes were proposed in this pull request?

StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint.

This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value).

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #14882 from zsxwing/SPARK-17316.
2016-08-31 10:56:02 -07:00
Alex Bozarth f7beae6da0 [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history
## What changes were proposed in this pull request?

With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)

I've also added a new test for the `limit` param in `HistoryServerSuite.scala`

## How was this patch tested?

Manual testing and dev/run-tests

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #14835 from ajbozarth/spark17243.
2016-08-30 16:33:54 -05:00
Josh Rosen fb20084313 [SPARK-17304] Fix perf. issue caused by TaskSetManager.abortIfCompletelyBlacklisted
This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run

```
sc.parallelize(1 to 100000, 100000).map(identity).count()
```

then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`:

![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png)

When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels:

```scala
   for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) {
      do {
        launchedTask = resourceOfferSingleTaskSet(
            taskSet, maxLocality, shuffledOffers, availableCpus, tasks)
      } while (launchedTask)
    }
```

In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of  `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that  `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set.

Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level.

Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change.

/cc squito and kayousterhout for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14871 from JoshRosen/bail-early-if-no-cpus.
2016-08-30 13:15:21 -07:00
Ferdinand Xu 4b4e329e49 [SPARK-5682][CORE] Add encrypted shuffle in spark
This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.
2016-08-30 09:15:31 -07:00
Xin Ren 27209252f0 [MINOR][MLLIB][SQL] Clean up unused variables and unused import
## What changes were proposed in this pull request?

Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement,  when I walk through the code examples.

## How was this patch tested?

Testet manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14836 from keypointt/codeWalkThroughML.
2016-08-30 11:24:55 +01:00
Xin Ren 2d76cb11f5 [SPARK-17276][CORE][TEST] Stop env params output on Jenkins job page
https://issues.apache.org/jira/browse/SPARK-17276

## What changes were proposed in this pull request?

When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output.
The env parameter output should be muted.

![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png)

## How was this patch tested?

Tested manually on local laptop.

Author: Xin Ren <iamshrek@126.com>

Closes #14848 from keypointt/SPARK-17276.
2016-08-30 11:18:29 +01:00
Robert Kruszewski 9fbced5b25 [SPARK-17216][UI] fix event timeline bars length
## What changes were proposed in this pull request?

Make event timeline bar expand to full length of the bar (which is total time)

This issue occurs only on chrome, firefox looks fine. Haven't tested other browsers.

## How was this patch tested?
Inspection in browsers

Before
![screen shot 2016-08-24 at 3 38 24 pm](https://cloud.githubusercontent.com/assets/512084/17935104/0d6cda74-6a12-11e6-9c66-e00cfa855606.png)

After
![screen shot 2016-08-24 at 3 36 39 pm](https://cloud.githubusercontent.com/assets/512084/17935114/15740ea4-6a12-11e6-83a1-7c06eef6abb8.png)

Author: Robert Kruszewski <robertk@palantir.com>

Closes #14791 from robert3005/robertk/event-timeline.
2016-08-27 08:47:15 +01:00
Yin Huai a6bca3ad02 [SPARK-17266][TEST] Add empty strings to the regressionTests of PrefixComparatorsSuite
## What changes were proposed in this pull request?
This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/).

I could not reproduce it locally. But, let's this test case in the regressionTests.

Author: Yin Huai <yhuai@databricks.com>

Closes #14837 from yhuai/SPARK-17266.
2016-08-26 19:38:52 -07:00