Commit graph

5567 commits

Author SHA1 Message Date
felixcheung 0df3cfb8ab [SPARK-12790][CORE] Remove HistoryServer old multiple files format
Removed isLegacyLogDirectory code path and updated tests
andrewor14

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #10860 from felixcheung/historyserverformat.
2016-02-01 16:55:21 -08:00
Sean Owen 715a19d56f [SPARK-12637][CORE] Print stage info of finished stages properly
Improve printing of StageInfo in onStageCompleted

See also https://github.com/apache/spark/pull/10585

Author: Sean Owen <sowen@cloudera.com>

Closes #10922 from srowen/SPARK-12637.
2016-02-01 16:23:17 -08:00
Iulian Dragos c9b89a0a09 [SPARK-12979][MESOS] Don’t resolve paths on the local file system in Mesos scheduler
The driver filesystem is likely different from where the executors will run, so resolving paths (and symlinks, etc.) will lead to invalid paths on executors.

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10923 from dragos/issue/canonical-paths.
2016-02-01 13:38:38 -08:00
Nilanjan Raychaudhuri a41b68b954 [SPARK-12265][MESOS] Spark calls System.exit inside driver instead of throwing exception
This takes over #10729 and makes sure that `spark-shell` fails with a proper error message. There is a slight behavioral change: before this change `spark-shell` would exit, while now the REPL is still there, but `sc` and `sqlContext` are not defined and the error is visible to the user.

Author: Nilanjan Raychaudhuri <nraychaudhuri@gmail.com>
Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10921 from dragos/pr/10729.
2016-02-01 13:33:24 -08:00
Timothy Chen 51b03b71ff [SPARK-12463][SPARK-12464][SPARK-12465][SPARK-10647][MESOS] Fix zookeeper dir with mesos conf and add docs.
Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings.

Author: Timothy Chen <tnachen@gmail.com>

Closes #10057 from tnachen/fix_mesos_dir.
2016-02-01 12:45:02 -08:00
Shixiong Zhu 6075573a93 [SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateStateByKey is followed by a checkpointed dstream
Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10934 from zsxwing/recursive-checkpoint.
2016-02-01 11:02:17 -08:00
Josh Rosen 289373b28c [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version
This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).

The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).

After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10608 from JoshRosen/SPARK-6363.
2016-01-30 00:20:28 -08:00
Andrew Or 12252d1da9 [SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics
This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g.
```
sc.textFile(..., 4).coalesce(2).count()
```
we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`.

For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark.

This was caused by #10835.

Author: Andrew Or <andrew@databricks.com>

Closes #10973 from andrewor14/fix-input-metrics-coalesce.
2016-01-29 18:03:08 -08:00
Andrew Or 70e69fc4dd [SPARK-13088] Fix DAG viz in latest version of chrome
Apparently chrome removed `SVGElement.prototype.getTransformToElement`, which is used by our JS library dagre-d3 when creating edges. The real diff can be found here: 7d6c0002e4, which is taken from the fix in the main repo: 1ef067f1c6

Upstream issue: https://github.com/cpettitt/dagre-d3/issues/202

Author: Andrew Or <andrew@databricks.com>

Closes #10986 from andrewor14/fix-dag-viz.
2016-01-29 18:00:49 -08:00
Andrew Or e6ceac49a3 [SPARK-13096][TEST] Fix flaky verifyPeakExecutionMemorySet
Previously we would assert things before all events are guaranteed to have been processed. To fix this, just block until all events are actually processed, i.e. until the listener queue is empty.

https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/79/testReport/junit/org.apache.spark.util.collection/ExternalAppendOnlyMapSuite/spilling/

Author: Andrew Or <andrew@databricks.com>

Closes #10990 from andrewor14/accum-suite-less-flaky.
2016-01-29 17:59:41 -08:00
Andrew Or e38b0baa38 [SPARK-13055] SQLHistoryListener throws ClassCastException
This is an existing issue uncovered recently by #10835. The reason for the exception was because the `SQLHistoryListener` gets all sorts of accumulators, not just the ones that represent SQL metrics. For example, the listener gets the `internal.metrics.shuffleRead.remoteBlocksFetched`, which is an Int, then it proceeds to cast the Int to a Long, which fails.

The fix is to mark accumulators representing SQL metrics using some internal metadata. Then we can identify which ones are SQL metrics and only process those in the `SQLHistoryListener`.

Author: Andrew Or <andrew@databricks.com>

Closes #10971 from andrewor14/fix-sql-history.
2016-01-29 13:45:03 -08:00
zhuol e4c1162b6b [SPARK-10873] Support column sort and search for History Server.
[SPARK-10873] Support column sort and search for History Server using jQuery DataTable and REST API. Before this commit, the history server was generated hard-coded html and can not support search, also, the sorting was disabled if there is any application that has more than one attempt. Supporting search and sort (over all applications rather than the 20 entries in the current page) in any case will greatly improve user experience.

1. Create the historypage-template.html for displaying application information in datables.
2. historypage.js uses jQuery to access the data from /api/v1/applications REST API, and use DataTable to display each application's information. For application that has more than one attempt, the RowsGroup is used to merge such entries while at the same time supporting sort and search.
3. "duration" and "lastUpdated" rest API are added to application's "attempts".
4. External javascirpt and css files for datatables, RowsGroup and jquery plugins are added with licenses clarified.

Snapshots for how it looks like now:

History page view:
![historypage](https://cloud.githubusercontent.com/assets/11683054/12184383/89bad774-b55a-11e5-84e4-b0276172976f.png)

Search:
![search](https://cloud.githubusercontent.com/assets/11683054/12184385/8d3b94b0-b55a-11e5-869a-cc0ef0a4242a.png)

Sort by started time:
![sort-by-started-time](https://cloud.githubusercontent.com/assets/11683054/12184387/8f757c3c-b55a-11e5-98c8-577936366566.png)

Author: zhuol <zhuol@yahoo-inc.com>

Closes #10648 from zhuoliu/10873.
2016-01-29 11:54:58 -06:00
Andrew Or d702f0c170 [HOTFIX] Fix Scala 2.11 compilation
by explicitly marking annotated parameters as vals (SI-8813).

Caused by #10835.

Author: Andrew Or <andrew@databricks.com>

Closes #10955 from andrewor14/fix-scala211.
2016-01-27 14:01:55 -08:00
Josh Rosen 32f741115b [SPARK-13021][CORE] Fail fast when custom RDDs violate RDD.partition's API contract
Spark's `Partition` and `RDD.partitions` APIs have a contract which requires custom implementations of `RDD.partitions` to ensure that for all `x`, `rdd.partitions(x).index == x`; in other words, the `index` reported by a repartition needs to match its position in the partitions array.

If a custom RDD implementation violates this contract, then Spark has the potential to become stuck in an infinite recomputation loop when recomputing a subset of an RDD's partitions, since the tasks that are actually run will not correspond to the missing output partitions that triggered the recomputation. Here's a link to a notebook which demonstrates this problem: 5e8a5aa8d2/Violating%2520RDD.partitions%2520contract.html

In order to guard against this infinite loop behavior, this patch modifies Spark so that it fails fast and refuses to compute RDDs' whose `partitions` violate the API contract.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10932 from JoshRosen/SPARK-13021.
2016-01-27 13:27:32 -08:00
Andrew Or 87abcf7df9 [SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulators
The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts:

**SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver.

**SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620.

While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here.

Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master.

Author: Andrew Or <andrew@databricks.com>

Closes #10835 from andrewor14/task-metrics-use-accums.
2016-01-27 11:15:48 -08:00
Nishkam Ravi bae3c9a4eb [SPARK-12967][NETTY] Avoid NettyRpc error message during sparkContext shutdown
If there's an RPC issue while sparkContext is alive but stopped (which would happen only when executing SparkContext.stop), log a warning instead. This is a common occurrence.

vanzin

Author: Nishkam Ravi <nishkamravi@gmail.com>
Author: nishkamravi2 <nishkamravi@gmail.com>

Closes #10881 from nishkamravi2/master_netty.
2016-01-26 21:14:39 -08:00
Shixiong Zhu 22662b2416 [SPARK-12614][CORE] Don't throw non fatal exception from ask
Right now RpcEndpointRef.ask may throw exception in some corner cases, such as calling ask after stopping RpcEnv. It's better to avoid throwing exception from RpcEndpointRef.ask. We can send the exception to the future for `ask`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10568 from zsxwing/send-ask-fail.
2016-01-26 17:24:40 -08:00
zhuol ae0309a881 [SPARK-10911] Executors should System.exit on clean shutdown.
Call system.exit explicitly to make sure non-daemon user threads terminate. Without this, user applications might live forever if the cluster manager does not appropriately kill them. E.g., YARN had this bug: HADOOP-12441.

Author: zhuol <zhuol@yahoo-inc.com>

Closes #9946 from zhuoliu/10911.
2016-01-26 09:40:02 -06:00
Sean Owen 649e9d0f5b [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator
Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.

CC rxin pwendell for API change; tdas since it also touches streaming.

Author: Sean Owen <sowen@cloudera.com>

Closes #10413 from srowen/SPARK-3369.
2016-01-26 11:55:28 +00:00
Liang-Chi Hsieh 5936bf9fa8 [SPARK-12961][CORE] Prevent snappy-java memory leak
JIRA: https://issues.apache.org/jira/browse/SPARK-12961

To prevent memory leak in snappy-java, just call the method once and cache the result. After the library releases new version, we can remove this object.

JoshRosen

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #10875 from viirya/prevent-snappy-memory-leak.
2016-01-26 11:36:00 +00:00
Davies Liu 7d877c3439 [SPARK-12902] [SQL] visualization for generated operators
This PR brings back visualization for generated operators, they looks like:

![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png)

![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png)

Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode.

Author: Davies Liu <davies@databricks.com>

Closes #10828 from davies/viz_codegen.
2016-01-25 12:44:20 -08:00
Alex Bozarth c037d25482 [SPARK-12149][WEB UI] Executor UI improvement suggestions - Color UI
Added color coding to the Executors page for Active Tasks, Failed Tasks, Completed Tasks and Task Time.

Active Tasks is shaded blue with it's range based on percentage of total cores used.
Failed Tasks is shaded red ranging over the first 10% of total tasks failed
Completed Tasks is shaded green ranging over 10% of total tasks including failed and active tasks, but only when there are active or failed tasks on that executor.
Task Time is shaded red when GC Time goes over 10% of total time with it's range directly corresponding to the percent of total time.

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10154 from ajbozarth/spark12149.
2016-01-25 14:42:44 -06:00
Michael Allman 4ee8191e57 [SPARK-12755][CORE] Stop the event logger before the DAG scheduler
[SPARK-12755][CORE] Stop the event logger before the DAG scheduler to avoid a race condition where the standalone master attempts to build the app's history UI before the event log is stopped.

This contribution is my original work, and I license this work to the Spark project under the project's open source license.

Author: Michael Allman <michael@videoamp.com>

Closes #10700 from mallman/stop_event_logger_first.
2016-01-25 09:51:41 +00:00
Shixiong Zhu ea5c38fe75 [HOTFIX]Remove rpcEnv.awaitTermination to avoid dead-lock in some test
Looks rpcEnv.awaitTermination may block some tests forever. Just remove it and investigate the tests.
2016-01-22 22:14:47 -08:00
Shixiong Zhu bc1babd63d [SPARK-7997][CORE] Remove Akka from Spark Core and Streaming
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult`  depends on it.
- Update comments and docs

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10854 from zsxwing/remove-akka.
2016-01-22 21:20:04 -08:00
Shixiong Zhu 944fdadf77 [SPARK-12847][CORE][STREAMING] Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events
Including the following changes:

1. Add StreamingListenerForwardingBus to WrappedStreamingListenerEvent process events in `onOtherEvent` to StreamingListener
2. Remove StreamingListenerBus
3. Merge AsynchronousListenerBus and LiveListenerBus to the same class LiveListenerBus
4. Add `logEvent` method to SparkListenerEvent so that EventLoggingListener can use it to ignore WrappedStreamingListenerEvents

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10779 from zsxwing/streaming-listener.
2016-01-20 11:57:53 -08:00
scwf 43f1d59e17 [SPARK-2750][WEB UI] Add https support to the Web UI
Author: scwf <wangfei1@huawei.com>
Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: WangTaoTheTonic <wangtao111@huawei.com>
Author: w00228970 <wangfei1@huawei.com>

Closes #10238 from vanzin/SPARK-2750.
2016-01-19 14:49:55 -08:00
Andrew Or b122c861cd [SPARK-12887] Do not expose var's in TaskMetrics
This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators.

TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug.

Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them.

Parent PR: #10717

Author: Andrew Or <andrew@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: andrewor14 <andrew@databricks.com>

Closes #10815 from andrewor14/get-or-create-metrics.
2016-01-19 10:58:51 -08:00
Andrew Or 2b5d11f34d [SPARK-12885][MINOR] Rename 3 fields in ShuffleWriteMetrics
This is a small step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just renames 3 fields for consistency. Today we have:

```
inputMetrics.recordsRead
outputMetrics.bytesWritten
shuffleReadMetrics.localBlocksFetched
...
shuffleWriteMetrics.shuffleRecordsWritten
shuffleWriteMetrics.shuffleBytesWritten
shuffleWriteMetrics.shuffleWriteTime
```

The shuffle write ones are kind of redundant. We can drop the `shuffle` part in the method names. I added backward compatible (but deprecated) methods with the old names.

Parent PR: #10717

Author: Andrew Or <andrew@databricks.com>

Closes #10811 from andrewor14/rename-things.
2016-01-18 19:22:29 -08:00
Josh Rosen b8cb548a43 [SPARK-10985][CORE] Avoid passing evicted blocks throughout BlockManager
This patch refactors portions of the BlockManager and CacheManager in order to avoid having to pass `evictedBlocks` lists throughout the code. It appears that these lists were only consumed by `TaskContext.taskMetrics`, so the new code now directly updates the metrics from the lower-level BlockManager methods.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10776 from JoshRosen/SPARK-10985.
2016-01-18 13:34:12 -08:00
Andrew Or 302bb569f3 [SPARK-12884] Move classes to their own files for readability
This is a small step in implementing SPARK-10620, which migrates `TaskMetrics` to accumulators. This patch is strictly a cleanup patch and introduces no change in functionality. It literally just moves classes to their own files to avoid having single monolithic ones that contain 10 different classes.

Parent PR: #10717

Author: Andrew Or <andrew@databricks.com>

Closes #10810 from andrewor14/move-things.
2016-01-18 13:27:18 -08:00
Nong Li 9039333c0a [SPARK-12644][SQL] Update parquet reader to be vectorized.
This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch.
There are a few particulars in the Parquet encodings that make this much more efficient. In
particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are
also very suited for this.

This is a work in progress and does not affect the current execution. In subsequent patches, we will
support more encodings and types before enabling this.

Simple benchmarks indicate this can decode single ints about > 3x faster.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10593 from nongli/spark-12644.
2016-01-15 17:40:26 -08:00
Alex Bozarth 61c45876fb [SPARK-12716][WEB UI] Add a TOTALS row to the Executors Web UI
Added a Totals table to the top of the page to display the totals of each applicable column in the executors table.

Old Description:
~~Created a TOTALS row containing the totals of each column in the executors UI. By default the TOTALS row appears at the top of the table. When a column is sorted the TOTALS row will always sort to either the top or bottom of the table.~~

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10668 from ajbozarth/spark12716.
2016-01-15 16:03:21 -06:00
Reynold Xin ad1503f92e [SPARK-12667] Remove block manager's internal "external block store" API
This pull request removes the external block store API. This is rarely used, and the file system interface is actually a better, more standard way to interact with external storage systems.

There are some other things to remove also, as pointed out by JoshRosen. We will do those as follow-up pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #10752 from rxin/remove-offheap.
2016-01-15 12:03:28 -08:00
Koyo Yoshida 32cca93354 [SPARK-12708][UI] Sorting task error in Stages Page when yarn mode.
If sort column contains slash(e.g. "Executor ID / Host") when yarn mode,sort fail with following message.

![spark-12708](https://cloud.githubusercontent.com/assets/6679275/12193320/80814f8c-b62a-11e5-9914-7bf3907029df.png)

It's similar to SPARK-4313 .

Author: root <root@R520T1.(none)>
Author: Koyo Yoshida <koyo0615@gmail.com>

Closes #10663 from yoshidakuy/SPARK-12708.
2016-01-15 13:32:47 +09:00
Josh Rosen 25782981cf [SPARK-12174] Speed up BlockManagerSuite getRemoteBytes() test
This patch significantly speeds up the BlockManagerSuite's "SPARK-9591: getRemoteBytes from another location when Exception throw" test, reducing the test time from 45s to ~250ms. The key change was to set `spark.shuffle.io.maxRetries` to 0 (the code previously set `spark.network.timeout` to `2s`, but this didn't make a difference because the slowdown was not due to this timeout).

Along the way, I also cleaned up the way that we handle SparkConf in BlockManagerSuite: previously, each test would mutate a shared SparkConf instance, while now each test gets a fresh SparkConf.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10759 from JoshRosen/SPARK-12174.
2016-01-14 17:37:27 -08:00
Shixiong Zhu 501e99ef0f [SPARK-12784][UI] Fix Spark UI IndexOutOfBoundsException with dynamic allocation
Add `listener.synchronized` to get `storageStatusList` and `execInfo` atomically.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10728 from zsxwing/SPARK-12784.
2016-01-14 09:50:57 -08:00
Bryan Cutler 56cdbd654d [SPARK-9844][CORE] File appender race condition during shutdown
When an Executor process is destroyed, the FileAppender that is asynchronously reading the stderr stream of the process can throw an IOException during read because the stream is closed.  Before the ExecutorRunner destroys the process, the FileAppender thread is flagged to stop.  This PR wraps the inputStream.read call of the FileAppender in a try/catch block so that if an IOException is thrown and the thread has been flagged to stop, it will safely ignore the exception.  Additionally, the FileAppender thread was changed to use Utils.tryWithSafeFinally to better log any exception that do occur.  Added unit tests to verify a IOException is thrown and logged if FileAppender is not flagged to stop, and that no IOException when the flag is set.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10714 from BryanCutler/file-appender-read-ioexception-SPARK-9844.
2016-01-14 10:59:02 +00:00
Josh Rosen e2ae7bd046 [SPARK-12819] Deprecate TaskContext.isRunningLocally()
We've already removed local execution but didn't deprecate `TaskContext.isRunningLocally()`; we should deprecate it for 2.0.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10751 from JoshRosen/remove-local-exec-from-taskcontext.
2016-01-13 21:02:54 -08:00
jerryshao cd81fc9e86 [SPARK-12400][SHUFFLE] Avoid generating temp shuffle files for empty partitions
This problem lies in `BypassMergeSortShuffleWriter`, empty partition will also generate a temp shuffle file with several bytes. So here change to only create file when partition is not empty.

This problem only lies in here, no such issue in `HashShuffleWriter`.

Please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #10376 from jerryshao/SPARK-12400.
2016-01-13 16:34:23 -08:00
Carson Wang eabc7b8ee7 [SPARK-12690][CORE] Fix NPE in UnsafeInMemorySorter.free()
I hit the exception below. The `UnsafeKVExternalSorter` does pass `null` as the consumer when creating an `UnsafeInMemorySorter`. Normally the NPE doesn't occur because the `inMemSorter` is set to null later and the `free()` method is not called. It happens when there is another exception like OOM thrown before setting `inMemSorter` to null. Anyway, we can add the null check to avoid it.

```
ERROR spark.TaskContextImpl: Error in TaskCompletionListener
java.lang.NullPointerException
        at org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.free(UnsafeInMemorySorter.java:110)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.cleanupResources(UnsafeExternalSorter.java:288)
        at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter$1.onTaskCompletion(UnsafeExternalSorter.java:141)
        at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:79)
        at org.apache.spark.TaskContextImpl$$anonfun$markTaskCompleted$1.apply(TaskContextImpl.scala:77)
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:77)
        at org.apache.spark.scheduler.Task.run(Task.scala:91)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
        at java.lang.Thread.run(Thread.java:722)
```

Author: Carson Wang <carson.wang@intel.com>

Closes #10637 from carsonwang/FixNPE.
2016-01-13 13:28:39 -08:00
Kousuke Saruta f14922cff8 [SPARK-12692][BUILD][CORE] Scala style: Fix the style violation (Space before ",")
Fix the style violation (space before , and :).
This PR is a followup for #10643

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10719 from sarutak/SPARK-12692-followup-core.
2016-01-12 19:24:50 -08:00
Shixiong Zhu 4f60651cbe [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.
2016-01-12 14:27:05 -08:00
Yucai Yu 7e15044d9d [SPARK-12582][TEST] IndexShuffleBlockResolverSuite fails in windows
[SPARK-12582][Test] IndexShuffleBlockResolverSuite fails in windows

* IndexShuffleBlockResolverSuite fails in windows due to file is not closed.
* mv IndexShuffleBlockResolverSuite.scala from "test/java" to "test/scala".

https://issues.apache.org/jira/browse/SPARK-12582

Author: Yucai Yu <yucai.yu@intel.com>

Closes #10526 from yucai/master.
2016-01-12 13:23:23 +00:00
Tommy YU 9f0995bb0d [SPARK-12638][API DOC] Parameter explanation not very accurate for rdd function "aggregate"
Currently, RDD function aggregate's parameter doesn't explain well, especially parameter "zeroValue".
It's helpful to let junior scala user know that "zeroValue" attend both "seqOp" and "combOp" phase.

Author: Tommy YU <tummyyu@163.com>

Closes #10587 from Wenpei/rdd_aggregate_doc.
2016-01-12 13:20:04 +00:00
Reynold Xin b23c4521f5 [SPARK-12340] Fix overflow in various take functions.
This is a follow-up for the original patch #10562.

Author: Reynold Xin <rxin@databricks.com>

Closes #10670 from rxin/SPARK-12340.
2016-01-09 11:21:58 -08:00
Josh Rosen 1fdf9bbd67 [SPARK-12730][TESTS] De-duplicate some test code in BlockManagerSuite
This patch deduplicates some test code in BlockManagerSuite. I'm splitting this change off from a larger PR in order to make things easier to review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10667 from JoshRosen/block-mgr-tests-cleanup.
2016-01-08 20:50:08 -08:00
Sean Owen 659fd9d04b [SPARK-4819] Remove Guava's "Optional" from public API
Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)

See also https://github.com/apache/spark/pull/10512

Author: Sean Owen <sowen@cloudera.com>

Closes #10513 from srowen/SPARK-4819.
2016-01-08 13:02:30 -08:00
Thomas Graves 553fd7b912 [SPARK-12654] sc.wholeTextFiles with spark.hadoop.cloneConf=true fail…
…s on secure Hadoop

https://issues.apache.org/jira/browse/SPARK-12654

So the bug here is that WholeTextFileRDD.getPartitions has:
val conf = getConf
in getConf if the cloneConf=true it creates a new Hadoop Configuration. Then it uses that to create a new newJobContext.
The newJobContext will copy credentials around, but credentials are only present in a JobConf not in a Hadoop Configuration. So basically when it is cloning the hadoop configuration its changing it from a JobConf to Configuration and dropping the credentials that were there. NewHadoopRDD just uses the conf passed in for the getPartitions (not getConf) which is why it works.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #10651 from tgravescs/SPARK-12654.
2016-01-08 14:38:19 -06:00
Bryan Cutler ea104b8f1c [SPARK-12701][CORE] FileAppender should use join to ensure writing thread completion
Changed Logging FileAppender to use join in `awaitTermination` to ensure that thread is properly finished before returning.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10654 from BryanCutler/fileAppender-join-thread-SPARK-12701.
2016-01-08 11:08:45 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
Shixiong Zhu 28e0e500a2 [SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for Kryo
The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10609 from zsxwing/SPARK-12591.
2016-01-07 17:46:24 -08:00
Sean Owen 5a4021998a [SPARK-12604][CORE] Addendum - use casting vs mapValues for countBy{Key,Value}
Per rxin, let's use the casting for countByKey and countByValue as well. Let's see if this passes.

Author: Sean Owen <sowen@cloudera.com>

Closes #10641 from srowen/SPARK-12604.2.
2016-01-07 17:21:03 -08:00
Darek Blasiak 8346518357 [SPARK-12598][CORE] bug in setMinPartitions
There is a bug in the calculation of ```maxSplitSize```.  The ```totalLen``` should be divided by ```minPartitions``` and not by ```files.size```.

Author: Darek Blasiak <darek.blasiak@640labs.com>

Closes #10546 from datafarmer/setminpartitionsbug.
2016-01-07 21:15:40 +00:00
Jacek Laskowski 1b2c2162af [STREAMING][MINOR] More contextual information in logs + minor code i…
…mprovements

Please review and merge at your convenience. Thanks!

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10595 from jaceklaskowski/streaming-minor-fixes.
2016-01-07 21:12:57 +00:00
Davies Liu 6a1c864ab6 [SPARK-12295] [SQL] external spilling for window functions
This PR manage the memory used by window functions (buffered rows), also enable external spilling.

After this PR, we can run window functions on a partition with hundreds of millions of rows with only 1G.

Author: Davies Liu <davies@databricks.com>

Closes #10605 from davies/unsafe_window.
2016-01-06 23:21:52 -08:00
Guillaume Poulin b673852037 [SPARK-12678][CORE] MapPartitionsRDD clearDependencies
MapPartitionsRDD was keeping a reference to `prev` after a call to
`clearDependencies` which could lead to memory leak.

Author: Guillaume Poulin <poulin.guillaume@gmail.com>

Closes #10623 from gpoulin/map_partition_deps.
2016-01-06 21:34:46 -08:00
jerryshao 174e72ceca [SPARK-12673][UI] Add missing uri prepending for job description
Otherwise the url will be failed to proxy to the right one if in YARN mode. Here is the screenshot:

![screen shot 2016-01-06 at 5 28 26 pm](https://cloud.githubusercontent.com/assets/850797/12139632/bbe78ecc-b49c-11e5-8932-94e8b3622a09.png)

Author: jerryshao <sshao@hortonworks.com>

Closes #10618 from jerryshao/SPARK-12673.
2016-01-06 21:28:29 -08:00
Josh Rosen 8e19c7663a [SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.

Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.

For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
2016-01-06 20:50:31 -08:00
Nong Li a74d743cc7 [SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.
[SPARK-12640][SQL] Add simple benchmarking utility class and add Parquet scan benchmarks.

We've run benchmarks ad hoc to measure the scanner performance. We will continue to invest in this
and it makes sense to get these benchmarks into code. This adds a simple benchmarking utility to do
this.

Author: Nong Li <nong@databricks.com>
Author: Nong <nongli@gmail.com>

Closes #10589 from nongli/spark-12640.
2016-01-06 19:20:43 -08:00
Sean Owen ac56cf605b [SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java
Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change

Author: Sean Owen <sowen@cloudera.com>

Closes #10554 from srowen/SPARK-12604.
2016-01-06 17:17:32 -08:00
Kousuke Saruta 94c202c7d2 [SPARK-12665][CORE][GRAPHX] Remove Vector, VectorSuite and GraphKryoRegistrator which are deprecated and no longer used
Whole code of Vector.scala, VectorSuite.scala and GraphKryoRegistrator.scala  are no longer used so it's time to remove them in Spark 2.0.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10613 from sarutak/SPARK-12665.
2016-01-06 10:19:41 -08:00
QiangCai 5d871ea43e [SPARK-12340][SQL] fix Int overflow in the SparkPlan.executeTake, RDD.take and AsyncRDDActions.takeAsync
I have closed pull request https://github.com/apache/spark/pull/10487. And I create this pull request to resolve the problem.

spark jira
https://issues.apache.org/jira/browse/SPARK-12340

Author: QiangCai <david.caiq@gmail.com>

Closes #10562 from QiangCai/bugfix.
2016-01-06 18:13:07 +09:00
Marcelo Vanzin b3ba1be3b7 [SPARK-3873][TESTS] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
2016-01-05 19:07:39 -08:00
Marcelo Vanzin 7a375bb87a [SPARK-3873][CORE] Import ordering fixes.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10578 from vanzin/SPARK-3873-core.
2016-01-05 19:02:25 -08:00
Davies Liu 70fe6ce52f [SPARK-12659] fix NPE in UnsafeExternalSorter (used by cartesian product)
Cartesian product use UnsafeExternalSorter without comparator to do spilling, it will NPE if spilling happens.

This bug also hitted by #10605

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #10606 from davies/fix_spilling.
2016-01-05 18:46:52 -08:00
Reynold Xin 8ce645d4ee [SPARK-12615] Remove some deprecated APIs in RDD/SparkContext
I looked at each case individually and it looks like they can all be removed. The only one that I had to think twice was toArray (I even thought about un-deprecating it, until I realized it was a problem in Java to have toArray returning java.util.List).

Author: Reynold Xin <rxin@databricks.com>

Closes #10569 from rxin/SPARK-12615.
2016-01-05 11:10:14 -08:00
Kousuke Saruta 8eb2dc7133 [SPARK-12641] Remove unused code related to Hadoop 0.23
Currently we don't support Hadoop 0.23 but there is a few code related to it so let's clean it up.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10590 from sarutak/SPARK-12641.
2016-01-05 00:39:50 -08:00
Nong Li 8f659393b2 [SPARK-12486] Worker should kill the executors more forcefully if possible.
This patch updates the ExecutorRunner's terminate path to use the new java 8 API
to terminate processes more forcefully if possible. If the executor is unhealthy,
it would previously ignore the destroy() call. Presumably, the new java API was
added to handle cases like this.

We could update the termination path in the future to use OS specific commands
for older java versions.

Author: Nong Li <nong@databricks.com>

Closes #10438 from nongli/spark-12486-executors.
2016-01-04 10:37:56 -08:00
Sean Owen 15bd73627e [SPARK-12481][CORE][STREAMING][SQL] Remove usage of Hadoop deprecated APIs and reflection that supported 1.x
Remove use of deprecated Hadoop APIs now that 2.2+ is required

Author: Sean Owen <sowen@cloudera.com>

Closes #10446 from srowen/SPARK-12481.
2016-01-02 13:15:53 +00:00
Shixiong Zhu 4f5a24d7e7 [SPARK-7995][SPARK-6280][CORE] Remove AkkaRpcEnv and remove systemName from setupEndpointRef
### Remove AkkaRpcEnv

Keep `SparkEnv.actorSystem` because Streaming still uses it. Will remove it and AkkaUtils after refactoring Streaming actorStream API.

### Remove systemName
There are 2 places using `systemName`:
* `RpcEnvConfig.name`. Actually, although it's used as `systemName` in `AkkaRpcEnv`, `NettyRpcEnv` uses it as the service name to output the log `Successfully started service *** on port ***`. Since the service name in log is useful, I keep `RpcEnvConfig.name`.
* `def setupEndpointRef(systemName: String, address: RpcAddress, endpointName: String)`. Each `ActorSystem` has a `systemName`. Akka requires `systemName` in its URI and will refuse a connection if `systemName` is not matched. However, `NettyRpcEnv` doesn't use it. So we can remove `systemName` from `setupEndpointRef` since we are removing `AkkaRpcEnv`.

### Remove RpcEnv.uriOf

`uriOf` exists because Akka uses different URI formats for with and without authentication, e.g., `akka.ssl.tcp...` and `akka.tcp://...`. But `NettyRpcEnv` uses the same format. So it's not necessary after removing `AkkaRpcEnv`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10459 from zsxwing/remove-akka-rpc-env.
2015-12-31 00:15:55 -08:00
Reynold Xin be33a0cd3d [SPARK-12561] Remove JobLogger in Spark 2.0.
It was research code and has been deprecated since 1.0.0. No one really uses it since they can just use event logging.

Author: Reynold Xin <rxin@databricks.com>

Closes #10530 from rxin/SPARK-12561.
2015-12-30 18:28:08 -08:00
Reynold Xin ee8f8d3184 [SPARK-12588] Remove HttpBroadcast in Spark 2.0.
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.

Author: Reynold Xin <rxin@databricks.com>

Closes #10531 from rxin/SPARK-12588.
2015-12-30 18:07:07 -08:00
Carson Wang b244297966 [SPARK-12399] Display correct error message when accessing REST API with an unknown app Id
I got an exception when accessing the below REST API with an unknown application Id.
`http://<server-url>:18080/api/v1/applications/xxx/jobs`
Instead of an exception, I expect an error message "no such app: xxx" which is a similar error message when I access `/api/v1/applications/xxx`
```
org.spark-project.guava.util.concurrent.UncheckedExecutionException: java.util.NoSuchElementException: no app with key xxx
	at org.spark-project.guava.cache.LocalCache$Segment.get(LocalCache.java:2263)
	at org.spark-project.guava.cache.LocalCache.get(LocalCache.java:4000)
	at org.spark-project.guava.cache.LocalCache.getOrLoad(LocalCache.java:4004)
	at org.spark-project.guava.cache.LocalCache$LocalLoadingCache.get(LocalCache.java:4874)
	at org.apache.spark.deploy.history.HistoryServer.getSparkUI(HistoryServer.scala:116)
	at org.apache.spark.status.api.v1.UIRoot$class.withSparkUI(ApiRootResource.scala:226)
	at org.apache.spark.deploy.history.HistoryServer.withSparkUI(HistoryServer.scala:46)
	at org.apache.spark.status.api.v1.ApiRootResource.getJobs(ApiRootResource.scala:66)
```

Author: Carson Wang <carson.wang@intel.com>

Closes #10352 from carsonwang/unknownAppFix.
2015-12-30 13:49:10 -08:00
Neelesh Srinivas Salian 932cf44248 [SPARK-12263][DOCS] IllegalStateException: Memory can't be 0 for SPARK_WORKER_MEMORY without unit
Updated the Worker Unit IllegalStateException message to indicate no values less than 1MB instead of 0 to help solve this.
Requesting review

Author: Neelesh Srinivas Salian <nsalian@cloudera.com>

Closes #10483 from nssalian/SPARK-12263.
2015-12-30 11:14:13 +00:00
Shixiong Zhu 7ab0e2289d [SPARK-12490][CORE] Limit the css style scope to fix the Streaming UI
#10441 broke the Streaming UI because of the new CSS style.

<img width="503" alt="screen shot 2015-12-29 at 4 49 04 pm" src="https://cloud.githubusercontent.com/assets/1000778/12044763/1efce0fe-ae4c-11e5-9f8b-39df08426bf8.png">

This PR just added a class for the new style and only applied them to the paged tables.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10517 from zsxwing/fix-streaming-ui.
2015-12-29 19:54:10 -08:00
Josh Rosen 124a3a5e4e [SPARK-12490] Don't use Javascript for web UI's paginated table controls
The web UI's paginated table uses Javascript to implement certain navigation controls, such as table sorting and the "go to page" form. This is unnecessary and should be simplified to use plain HTML form controls and links.

/cc zsxwing, who wrote this original code, and yhuai.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10441 from JoshRosen/simplify-paginated-table-sorting.
2015-12-28 16:42:11 -08:00
Shixiong Zhu 710b411729 [SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugs
Include the following changes:

1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10440 from zsxwing/findbugs.
2015-12-28 15:01:51 -08:00
Daoyuan Wang a6d385322e [SPARK-12222][CORE] Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception
Since we only need to implement `def skipBytes(n: Int)`,
code in #10213 could be simplified.
davies scwf

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #10253 from adrian-wang/kryo.
2015-12-29 07:02:30 +09:00
Yaron Weinsberg 73b70f076d [SPARK-12517] add default RDD name for one created via sc.textFile
The feature was first added at commit: 7b877b2705 but was later removed (probably by mistake) at commit: fc8b58195a.
This change sets the default path of RDDs created via sc.textFile(...) to the path argument.

Here is the symptom:

* Using spark-1.5.2-bin-hadoop2.6:

scala> sc.textFile("/home/root/.bashrc").name
res5: String = null

scala> sc.binaryFiles("/home/root/.bashrc").name
res6: String = /home/root/.bashrc

* while using Spark 1.3.1:

scala> sc.textFile("/home/root/.bashrc").name
res0: String = /home/root/.bashrc

scala> sc.binaryFiles("/home/root/.bashrc").name
res1: String = /home/root/.bashrc

Author: Yaron Weinsberg <wyaron@gmail.com>
Author: yaron <yaron@il.ibm.com>

Closes #10456 from wyaron/master.
2015-12-29 05:19:11 +09:00
echo2mei 1e97813951 [SPARK-12396][CORE] Modify the function scheduleAtFixedRate to schedule.
Instead of just cancel the registrationRetryTimer to avoid driver retry connect to master, change the function to schedule.
It is no need to register to master iteratively.

Author: echo2mei <534384876@qq.com>

Closes #10447 from echoTomei/master.
2015-12-25 17:42:24 -08:00
pierre-borckmans ea4aab7e87 [SPARK-12440][CORE] Avoid setCheckpoint warning when directory is not local
In SparkContext method `setCheckpointDir`, a warning is issued when spark master is not local and the passed directory for the checkpoint dir appears to be local.

In practice, when relying on HDFS configuration file and using a relative path for the checkpoint directory (using an incomplete URI without HDFS scheme, ...), this warning should not be issued and might be confusing.
In fact, in this case, the checkpoint directory is successfully created, and the checkpointing mechanism works as expected.

This PR uses the `FileSystem` instance created with the given directory, and checks whether it is local or not.
(The rationale is that since this same `FileSystem` instance is used to create the checkpoint dir anyway and can therefore be reliably used to determine if it is local or not).

The warning is only issued if the directory is not local, on top of the existing conditions.

Author: pierre-borckmans <pierre.borckmans@realimpactanalytics.com>

Closes #10392 from pierre-borckmans/SPARK-12440_CheckpointDir_Warning_NonLocal.
2015-12-24 13:48:21 +00:00
Kazuaki Ishizaki 3920466118 [SPARK-12311][CORE] Restore previous value of "os.arch" property in test suites after forcing to set specific value to "os.arch" property
Restore the original value of os.arch property after each test

Since some of tests forced to set the specific value to os.arch property, we need to set the original value.

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #10289 from kiszk/SPARK-12311.
2015-12-24 13:37:28 +00:00
Sean Owen ae1f54aa0e [SPARK-12500][CORE] Fix Tachyon deprecations; pull Tachyon dependency into one class
Fix Tachyon deprecations; pull Tachyon dependency into `TachyonBlockManager` only

CC calvinjia as I probably need a double-check that the usage of the new API is correct.

Author: Sean Owen <sowen@cloudera.com>

Closes #10449 from srowen/SPARK-12500.
2015-12-23 13:24:06 -08:00
Nong Li 575a132797 [SPARK-12471][CORE] Spark daemons will log their pid on start up.
Author: Nong Li <nong@databricks.com>

Closes #10422 from nongli/12471-pids.
2015-12-22 13:27:28 -08:00
Jacek Laskowski 7c970f9093 Minor corrections, i.e. typo fixes and follow deprecated
Author: Jacek Laskowski <jacek@japila.pl>

Closes #10432 from jaceklaskowski/minor-corrections.
2015-12-22 10:47:10 -08:00
Reynold Xin 0a38637d05 [SPARK-11807] Remove support for Hadoop < 2.2
i.e. Hadoop 1 and Hadoop 2.0

Author: Reynold Xin <rxin@databricks.com>

Closes #10404 from rxin/SPARK-11807.
2015-12-21 22:15:52 -08:00
Davies Liu 29cecd4a42 [SPARK-12388] change default compression to lz4
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.

After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).

[1] https://github.com/ning/jvm-compressor-benchmark/wiki

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #10342 from davies/lz4.
2015-12-21 14:21:43 -08:00
Andrew Or d655d37ddf [SPARK-12466] Fix harmless NPE in tests
```
[info] ReplayListenerSuite:
[info] - Simple replay (58 milliseconds)
java.lang.NullPointerException
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:982)
	at org.apache.spark.deploy.master.Master$$anonfun$asyncRebuildSparkUI$1.applyOrElse(Master.scala:980)
```
https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-Master-SBT/4316/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/consoleFull

This was introduced in #10284. It's harmless because the NPE is caused by a race that occurs mainly in `local-cluster` tests (but don't actually fail the tests).

Tested locally to verify that the NPE is gone.

Author: Andrew Or <andrew@databricks.com>

Closes #10417 from andrewor14/fix-harmless-npe.
2015-12-21 14:09:04 -08:00
Reynold Xin a820ca19de [SPARK-2331] SparkContext.emptyRDD should return RDD[T] not EmptyRDD[T]
Author: Reynold Xin <rxin@databricks.com>

Closes #10394 from rxin/SPARK-2331.
2015-12-21 14:07:48 -08:00
Takeshi YAMAMURO 935f466306 [SPARK-12392][CORE] Optimize a location order of broadcast blocks by considering preferred local hosts
When multiple workers exist in a host, we can bypass unnecessary remote access for broadcasts; block managers fetch broadcast blocks from the same host instead of remote hosts.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #10346 from maropu/OptimizeBlockLocationOrder.
2015-12-21 14:03:23 -08:00
gatorsmile 4883a5087d [SPARK-12374][SPARK-12150][SQL] Adding logical/physical operators for Range
Based on the suggestions from marmbrus , added logical/physical operators for Range for improving the performance.

Also added another API for resolving the JIRA Spark-12150.

Could you take a look at my implementation, marmbrus ? If not good, I can rework it. : )

Thank you very much!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10335 from gatorsmile/rangeOperators.
2015-12-21 13:46:58 -08:00
Reynold Xin 284e29a870 [SPARK-11808] Remove Bagel.
Author: Reynold Xin <rxin@databricks.com>

Closes #10395 from rxin/SPARK-11808.
2015-12-19 22:40:35 -08:00
Reynold Xin f496031bd2 Bump master version to 2.0.0-SNAPSHOT.
Author: Reynold Xin <rxin@databricks.com>

Closes #10387 from rxin/version-bump.
2015-12-19 15:13:05 -08:00
Andrew Or a78a91f4d7 Revert "[SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode."
This reverts commit ad8c1f0b84.
2015-12-18 16:22:51 -08:00
Andrew Or 8a9417bc4b Revert "[SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server"
This reverts commit 8184568810.
2015-12-18 16:22:41 -08:00
Andrew Or 14be5dece2 Revert "[SPARK-12413] Fix Mesos ZK persistence"
This reverts commit 2bebaa39d9.
2015-12-18 16:22:33 -08:00
Luc Bourlier ba9332edd8 [SPARK-12345][CORE] Do not send SPARK_HOME through Spark submit REST interface
It is usually an invalid location on the remote machine executing the job.
It is picked up by the Mesos support in cluster mode, and most of the time causes
the job to fail.

Fixes SPARK-12345

Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #10329 from skyluc/issue/SPARK_HOME.
2015-12-18 16:21:01 -08:00
Shixiong Zhu 007a32f90a [SPARK-11097][CORE] Add channelActive callback to RpcHandler to monitor the new connections
Added `channelActive` to `RpcHandler` so that `NettyRpcHandler` doesn't need `clients` any more.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10301 from zsxwing/network-events.
2015-12-18 16:06:37 -08:00
Nong Li 0514e8d4b6 [SPARK-12411][CORE] Decrease executor heartbeat timeout to match heartbeat interval
Previously, the rpc timeout was the default network timeout, which is the same value
the driver uses to determine dead executors. This means if there is a network issue,
the executor is determined dead after one heartbeat attempt. There is a separate config
for the heartbeat interval which is a better value to use for the heartbeat RPC. With
this change, the executor will make multiple heartbeat attempts even with RPC issues.

Author: Nong Li <nong@databricks.com>

Closes #10365 from nongli/spark-12411.
2015-12-18 16:05:18 -08:00
Grace 60da0e11f6 [SPARK-9552] Return "false" while nothing to kill in killExecutors
In discussion (SPARK-9552), we proposed a force kill in `killExecutors`. But if there is nothing to kill, it will return back with true (acknowledgement). And then, it causes the certain executor(s) (which is not eligible to kill) adding to pendingToRemove list for further actions.

In this patch, we'd like to change the return semantics. If there is nothing to kill, we will return "false". and therefore  all those non-eligible executors won't be added to the pendingToRemove list.

vanzin andrewor14 As the follow up of PR#7888, please let me know your comments.

Author: Grace <jie.huang@intel.com>
Author: Jie Huang <hjie@fosun.com>
Author: Andrew Or <andrew@databricks.com>

Closes #9796 from GraceH/emptyPendingToRemove.
2015-12-18 16:04:42 -08:00
Marcelo Vanzin 2782818287 [SPARK-12350][CORE] Don't log errors when requested stream is not found.
If a client requests a non-existent stream, just send a failure message
back, without logging any error on the server side (since it's not a
server error).

On the executor side, avoid error logs by translating any errors during
transfer to a `ClassNotFoundException`, so that loading the class is
retried on a the parent class loader. This can mask IO errors during
transmission, but the most common cause is that the class is not
served by the remote end.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10337 from vanzin/SPARK-12350.
2015-12-18 09:49:08 -08:00
Michael Gummelt 2bebaa39d9 [SPARK-12413] Fix Mesos ZK persistence
I believe this fixes SPARK-12413.  I'm currently running an integration test to verify.

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #10366 from mgummelt/fix-zk-mesos.
2015-12-18 20:18:00 +09:00
Jeff Zhang 40e52a27c7 [CORE][TESTS] minor fix of JavaSerializerSuite
Not jira is created.
The original test is passed because the class cast is lazy (only when the object's method is invoked).

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10371 from zjffdu/minor_fix.
2015-12-18 00:49:56 -08:00
Iulian Dragos 8184568810 [SPARK-12345][MESOS] Properly filter out SPARK_HOME in the Mesos REST server
Fix problem with #10332, this one should fix Cluster mode on Mesos

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #10359 from dragos/issue/fix-spark-12345-one-more-time.
2015-12-18 03:19:31 +09:00
Shixiong Zhu 86e405f357 [SPARK-12220][CORE] Make Utils.fetchFile support files that contain special characters
This PR encodes and decodes the file name to fix the issue.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10208 from zsxwing/uri.
2015-12-17 09:55:37 -08:00
Davies Liu cd3d937b0c Revert "Once driver register successfully, stop it to connect to master."
This reverts commit 5a514b61bb.
2015-12-17 08:01:27 -08:00
echo2mei 5a514b61bb Once driver register successfully, stop it to connect to master.
This commit is to resolve SPARK-12396.

Author: echo2mei <534384876@qq.com>

Closes #10354 from echoTomei/master.
2015-12-17 07:59:17 -08:00
Andrew Or 97678edeaa [SPARK-12390] Clean up unused serializer parameter in BlockManager
No change in functionality is intended. This only changes internal API.

Author: Andrew Or <andrew@databricks.com>

Closes #10343 from andrewor14/clean-bm-serializer.
2015-12-16 20:01:47 -08:00
Marcelo Vanzin d1508dd9b7 [SPARK-12386][CORE] Fix NPE when spark.executor.port is set.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10339 from vanzin/SPARK-12386.
2015-12-16 19:47:49 -08:00
Rohit Agarwal fdb3822756 [SPARK-12186][WEB UI] Send the complete request URI including the query string when redirecting.
Author: Rohit Agarwal <rohita@qubole.com>

Closes #10180 from mindprince/SPARK-12186.
2015-12-16 19:04:33 -08:00
tedyu f590178d7a [SPARK-12365][CORE] Use ShutdownHookManager where Runtime.getRuntime.addShutdownHook() is called
SPARK-9886 fixed ExternalBlockStore.scala

This PR fixes the remaining references to Runtime.getRuntime.addShutdownHook()

Author: tedyu <yuzhihong@gmail.com>

Closes #10325 from ted-yu/master.
2015-12-16 19:02:12 -08:00
Imran Rashid 38d9795a4f [SPARK-10248][CORE] track exceptions in dagscheduler event loop in tests
`DAGSchedulerEventLoop` normally only logs errors (so it can continue to process more events, from other jobs).  However, this is not desirable in the tests -- the tests should be able to easily detect any exception, and also shouldn't silently succeed if there is an exception.

This was suggested by mateiz on https://github.com/apache/spark/pull/7699.  It may have already turned up an issue in "zero split job".

Author: Imran Rashid <irashid@cloudera.com>

Closes #8466 from squito/SPARK-10248.
2015-12-16 19:01:05 -08:00
Andrew Or 861549acdb [MINOR] Add missing interpolation in NettyRPCEnv
```
Exception in thread "main" org.apache.spark.rpc.RpcTimeoutException:
Cannot receive any reply in ${timeout.duration}. This timeout is controlled by spark.rpc.askTimeout
	at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:48)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:63)
	at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
	at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:33)
```

Author: Andrew Or <andrew@databricks.com>

Closes #10334 from andrewor14/rpc-typo.
2015-12-16 16:13:48 -08:00
Timothy Chen ad8c1f0b84 [SPARK-12345][MESOS] Filter SPARK_HOME when submitting Spark jobs with Mesos cluster mode.
SPARK_HOME is now causing problem with Mesos cluster mode since spark-submit script has been changed recently to take precendence when running spark-class scripts to look in SPARK_HOME if it's defined.

We should skip passing SPARK_HOME from the Spark client in cluster mode with Mesos, since Mesos shouldn't use this configuration but should use spark.executor.home instead.

Author: Timothy Chen <tnachen@gmail.com>

Closes #10332 from tnachen/scheduler_ui.
2015-12-16 10:54:15 -08:00
Bryan Cutler c5b6b398d5 [SPARK-12062][CORE] Change Master to asyc rebuild UI when application completes
This change builds the event history of completed apps asynchronously so the RPC thread will not be blocked and allow new workers to register/remove if the event log history is very large and takes a long time to rebuild.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #10284 from BryanCutler/async-MasterUI-SPARK-12062.
2015-12-15 18:28:16 -08:00
Naveen 8a215d2338 [SPARK-9886][CORE] Fix to use ShutdownHookManager in
ExternalBlockStore.scala

Author: Naveen <naveenminchu@gmail.com>

Closes #10313 from naveenminchu/branch-fix-SPARK-9886.
2015-12-15 18:25:22 -08:00
jerryshao 63ccdef813 [SPARK-10123][DEPLOY] Support specifying deploy mode from configuration
Please help to review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #10195 from jerryshao/SPARK-10123.
2015-12-15 18:24:23 -08:00
Richard W. Eggert II 765a488494 [SPARK-9026][SPARK-4514] Modifications to JobWaiter, FutureAction, and AsyncRDDActions to support non-blocking operation
These changes rework the implementations of `SimpleFutureAction`, `ComplexFutureAction`, `JobWaiter`, and `AsyncRDDActions` such that asynchronous callbacks on the generated `Futures` NEVER block waiting for a job to complete. A small amount of mutex synchronization is necessary to protect the internal fields that manage cancellation, but these locks are only held very briefly and in practice should almost never cause any blocking to occur. The existing blocking APIs of these classes are retained, but they simply delegate to the underlying non-blocking API and `Await` the results with indefinite timeouts.

Associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-9026
Also fixes: https://issues.apache.org/jira/browse/SPARK-4514

This pull request contains all my own original work, which I release to the Spark project under its open source license.

Author: Richard W. Eggert II <richard.eggert@gmail.com>

Closes #9264 from reggert/fix-futureaction.
2015-12-15 18:22:58 -08:00
CodingCat a63d9edcfb [SPARK-9516][UI] Improvement of Thread Dump Page
https://issues.apache.org/jira/browse/SPARK-9516

- [x] new look of Thread Dump Page

- [x] click column title to sort

- [x] grep

- [x] search as you type

squito JoshRosen It's ready for the review now

Author: CodingCat <zhunansjtu@gmail.com>

Closes #7910 from CodingCat/SPARK-9516.
2015-12-15 18:21:00 -08:00
Lianhui Wang 369127f032 [SPARK-12130] Replace shuffleManagerClass with shortShuffleMgrNames in ExternalShuffleBlockResolver
Replace shuffleManagerClassName with shortShuffleMgrName is  to reduce time of string's comparison. and put sort's comparison on the front. cc JoshRosen andrewor14

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #10131 from lianhuiwang/spark-12130.
2015-12-15 18:17:48 -08:00
Holden Karau c59df8c516 [SPARK-12332][TRIVIAL][TEST] Fix minor typo in ResetSystemProperties
Fix a minor typo (unbalanced bracket) in ResetSystemProperties.

Author: Holden Karau <holden@us.ibm.com>

Closes #10303 from holdenk/SPARK-12332-trivial-typo-in-ResetSystemProperties-comment.
2015-12-15 11:38:57 +00:00
Shixiong Zhu 2aecda284e [SPARK-12281][CORE] Fix a race condition when reporting ExecutorState in the shutdown hook
1. Make sure workers and masters exit so that no worker or master will still be running when triggering the shutdown hook.
2. Set ExecutorState to FAILED if it's still RUNNING when executing the shutdown hook.

This should fix the potential exceptions when exiting a local cluster
```
java.lang.AssertionError: assertion failed: executor 4 state transfer from RUNNING to RUNNING is illegal
	at scala.Predef$.assert(Predef.scala:179)
	at org.apache.spark.deploy.master.Master$$anonfun$receive$1.applyOrElse(Master.scala:260)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)

java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
	at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:246)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:191)
	at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:180)
	at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:73)
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:474)
	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
```

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10269 from zsxwing/executor-state.
2015-12-13 22:06:39 -08:00
Shixiong Zhu 8af2f8c61a [SPARK-12267][CORE] Store the remote RpcEnv address to send the correct disconnetion message
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10261 from zsxwing/SPARK-12267.
2015-12-12 21:58:55 -08:00
Andrew Or 5030923ea8 [SPARK-12155][SPARK-12253] Fix executor OOM in unified memory management
**Problem.** In unified memory management, acquiring execution memory may lead to eviction of storage memory. However, the space freed from evicting cached blocks is distributed among all active tasks. Thus, an incorrect upper bound on the execution memory per task can cause the acquisition to fail, leading to OOM's and premature spills.

**Example.** Suppose total memory is 1000B, cached blocks occupy 900B, `spark.memory.storageFraction` is 0.4, and there are two active tasks. In this case, the cap on task execution memory is 100B / 2 = 50B. If task A tries to acquire 200B, it will evict 100B of storage but can only acquire 50B because of the incorrect cap. For another example, see this [regression test](https://github.com/andrewor14/spark/blob/fix-oom/core/src/test/scala/org/apache/spark/memory/UnifiedMemoryManagerSuite.scala#L233) that I stole from JoshRosen.

**Solution.** Fix the cap on task execution memory. It should take into account the space that could have been freed by storage in addition to the current amount of memory available to execution. In the example above, the correct cap should have been 600B / 2 = 300B.

This patch also guards against the race condition (SPARK-12253):
(1) Existing tasks collectively occupy all execution memory
(2) New task comes in and blocks while existing tasks spill
(3) After tasks finish spilling, another task jumps in and puts in a large block, stealing the freed memory
(4) New task still cannot acquire memory and goes back to sleep

Author: Andrew Or <andrew@databricks.com>

Closes #10240 from andrewor14/fix-oom.
2015-12-10 15:30:08 -08:00
Josh Rosen 23a9e62bad [SPARK-12251] Document and improve off-heap memory configurations
This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.

- Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
- Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
- Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
- Document these configurations on the configuration page.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10237 from JoshRosen/SPARK-12251.
2015-12-10 15:29:04 -08:00
Marcelo Vanzin 4a46b8859d [SPARK-11563][CORE][REPL] Use RpcEnv to transfer REPL-generated classes.
This avoids bringing up yet another HTTP server on the driver, and
instead reuses the file server already managed by the driver's
RpcEnv. As a bonus, the repl now inherits the security features of
the network library.

There's also a small change to create the directory for storing classes
under the root temp dir for the application (instead of directly
under java.io.tmpdir).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9923 from vanzin/SPARK-11563.
2015-12-10 13:26:30 -08:00
Andrew Or 8770bd1213 [SPARK-12165][ADDENDUM] Fix outdated comments on unroll test
JoshRosen

Author: Andrew Or <andrew@databricks.com>

Closes #10229 from andrewor14/unroll-test-comments.
2015-12-09 17:24:04 -08:00
Sean Owen 1eb7c22ce7 [SPARK-11824][WEBUI] WebUI does not render descriptions with 'bad' HTML, throws console error
Don't warn when description isn't valid HTML since it may properly be like "SELECT ... where foo <= 1"

The tests for this code indicate that it's normal to handle strings like this that don't contain HTML as a string rather than markup. Hence logging every such instance as a warning is too noisy since it's not a problem. this is an issue for stages whose name contain SQL like the above

CC tdas as author of this bit of code

Author: Sean Owen <sowen@cloudera.com>

Closes #10159 from srowen/SPARK-11824.
2015-12-09 19:47:38 +00:00
Josh Rosen aec5ea000e [SPARK-12165][SPARK-12189] Fix bugs in eviction of storage memory by execution
This patch fixes a bug in the eviction of storage memory by execution.

## The bug:

In general, execution should be able to evict storage memory when the total storage memory usage is greater than `maxMemory * spark.memory.storageFraction`. Due to a bug, however, Spark might wind up evicting no storage memory in certain cases where the storage memory usage was between `maxMemory * spark.memory.storageFraction` and `maxMemory`. For example, here is a regression test which illustrates the bug:

```scala
    val maxMemory = 1000L
    val taskAttemptId = 0L
    val (mm, ms) = makeThings(maxMemory)
    // Since we used the default storage fraction (0.5), we should be able to allocate 500 bytes
    // of storage memory which are immune to eviction by execution memory pressure.

    // Acquire enough storage memory to exceed the storage region size
    assert(mm.acquireStorageMemory(dummyBlock, 750L, evictedBlocks))
    assertEvictBlocksToFreeSpaceNotCalled(ms)
    assert(mm.executionMemoryUsed === 0L)
    assert(mm.storageMemoryUsed === 750L)

    // At this point, storage is using 250 more bytes of memory than it is guaranteed, so execution
    // should be able to reclaim up to 250 bytes of storage memory.
    // Therefore, execution should now be able to require up to 500 bytes of memory:
    assert(mm.acquireExecutionMemory(500L, taskAttemptId, MemoryMode.ON_HEAP) === 500L) // <--- fails by only returning 250L
    assert(mm.storageMemoryUsed === 500L)
    assert(mm.executionMemoryUsed === 500L)
    assertEvictBlocksToFreeSpaceCalled(ms, 250L)
```

The problem relates to the control flow / interaction between `StorageMemoryPool.shrinkPoolToReclaimSpace()` and `MemoryStore.ensureFreeSpace()`. While trying to allocate the 500 bytes of execution memory, the `UnifiedMemoryManager` discovers that it will need to reclaim 250 bytes of memory from storage, so it calls `StorageMemoryPool.shrinkPoolToReclaimSpace(250L)`. This method, in turn, calls `MemoryStore.ensureFreeSpace(250L)`. However, `ensureFreeSpace()` first checks whether the requested space is less than `maxStorageMemory - storageMemoryUsed`, which will be true if there is any free execution memory because it turns out that `MemoryStore.maxStorageMemory = (maxMemory - onHeapExecutionMemoryPool.memoryUsed)` when the `UnifiedMemoryManager` is used.

The control flow here is somewhat confusing (it grew to be messy / confusing over time / as a result of the merging / refactoring of several components). In the pre-Spark 1.6 code, `ensureFreeSpace` was called directly by the `MemoryStore` itself, whereas in 1.6 it's involved in a confusing control flow where `MemoryStore` calls `MemoryManager.acquireStorageMemory`, which then calls back into `MemoryStore.ensureFreeSpace`, which, in turn, calls `MemoryManager.freeStorageMemory`.

## The solution:

The solution implemented in this patch is to remove the confusing circular control flow between `MemoryManager` and `MemoryStore`, making the storage memory acquisition process much more linear / straightforward. The key changes:

- Remove a layer of inheritance which made the memory manager code harder to understand (53841174760a24a0df3eb1562af1f33dbe340eb9).
- Move some bounds checks earlier in the call chain (13ba7ada77f87ef1ec362aec35c89a924e6987cb).
- Refactor `ensureFreeSpace()` so that the part which evicts blocks can be called independently from the part which checks whether there is enough free space to avoid eviction (7c68ca09cb1b12f157400866983f753ac863380e).
- Realize that this lets us remove a layer of overloads from `ensureFreeSpace` (eec4f6c87423d5e482b710e098486b3bbc4daf06).
- Realize that `ensureFreeSpace()` can simply be replaced with an `evictBlocksToFreeSpace()` method which is called [after we've already figured out](2dc842aea8/core/src/main/scala/org/apache/spark/memory/StorageMemoryPool.scala (L88)) how much memory needs to be reclaimed via eviction; (2dc842aea82c8895125d46a00aa43dfb0d121de9).

Along the way, I fixed some problems with the mocks in `MemoryManagerSuite`: the old mocks would [unconditionally](80a824d36e/core/src/test/scala/org/apache/spark/memory/MemoryManagerSuite.scala (L84)) report that a block had been evicted even if there was enough space in the storage pool such that eviction would be avoided.

I also fixed a problem where `StorageMemoryPool._memoryUsed` might become negative due to freed memory being double-counted when excution evicts storage. The problem was that `StorageMemoryPoolshrinkPoolToFreeSpace` would [decrement `_memoryUsed`](7c68ca09cb (diff-935c68a9803be144ed7bafdd2f756a0fL133)) even though `StorageMemoryPool.freeMemory` had already decremented it as each evicted block was freed. See SPARK-12189 for details.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Andrew Or <andrew@databricks.com>

Closes #10170 from JoshRosen/SPARK-12165.
2015-12-09 11:39:59 -08:00
jerryshao 6900f01737 [SPARK-10582][YARN][CORE] Fix AM failure situation for dynamic allocation
Because of AM failure, the target executor number between driver and AM will be different, which will lead to unexpected behavior in dynamic allocation. So when AM is re-registered with driver, state in `ExecutorAllocationManager` and `CoarseGrainedSchedulerBacked` should be reset.

This issue is originally addressed in #8737 , here re-opened again. Thanks a lot KaiXinXiaoLei for finding this issue.

andrewor14 and vanzin would you please help to review this, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9963 from jerryshao/SPARK-10582.
2015-12-09 09:52:03 -08:00
uncleGen a113216865 [SPARK-12031][CORE][BUG] Integer overflow when do sampling
Author: uncleGen <hustyugm@gmail.com>

Closes #10023 from uncleGen/1.6-bugfix.
2015-12-09 15:09:40 +00:00
Fei Wang 3934562d34 [SPARK-12222] [CORE] Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception
Jira: https://issues.apache.org/jira/browse/SPARK-12222

Deserialize RoaringBitmap using Kryo serializer throw Buffer underflow exception:
```
com.esotericsoftware.kryo.KryoException: Buffer underflow.
	at com.esotericsoftware.kryo.io.Input.require(Input.java:156)
	at com.esotericsoftware.kryo.io.Input.skip(Input.java:131)
	at com.esotericsoftware.kryo.io.Input.skip(Input.java:264)
```

This is caused by a bug of kryo's `Input.skip(long count)`(https://github.com/EsotericSoftware/kryo/issues/119) and we call this method in `KryoInputDataInputBridge`.

Instead of upgrade kryo's version, this pr bypass the  kryo's `Input.skip(long count)` by directly call another `skip` method in kryo's Input.java(https://github.com/EsotericSoftware/kryo/blob/kryo-2.21/src/com/esotericsoftware/kryo/io/Input.java#L124), i.e. write the bug-fixed version of `Input.skip(long count)` in KryoInputDataInputBridge's `skipBytes` method.

more detail link to https://github.com/apache/spark/pull/9748#issuecomment-162860246

Author: Fei Wang <wangfei1@huawei.com>

Closes #10213 from scwf/patch-1.
2015-12-08 21:32:31 -08:00
Andrew Or 9494521695 [SPARK-12187] *MemoryPool classes should not be fully public
This patch tightens them to `private[memory]`.

Author: Andrew Or <andrew@databricks.com>

Closes #10182 from andrewor14/memory-visibility.
2015-12-08 14:34:15 -08:00
tedyu 75c60bf4ba [SPARK-12074] Avoid memory copy involving ByteBuffer.wrap(ByteArrayOutputStream.toByteArray)
SPARK-12060 fixed JavaSerializerInstance.serialize
This PR applies the same technique on two other classes.

zsxwing

Author: tedyu <yuzhihong@gmail.com>

Closes #10177 from tedyu/master.
2015-12-08 10:01:44 -08:00
Xin Ren 6cb06e8711 [SPARK-11155][WEB UI] Stage summary json should include stage duration
The json endpoint for stages doesn't include information on the stage duration that is present in the UI. This looks like a simple oversight, they should be included. eg., the metrics should be included at api/v1/applications/<appId>/stages.

Metrics I've added are: submissionTime, firstTaskLaunchedTime and completionTime

Author: Xin Ren <iamshrek@126.com>

Closes #10107 from keypointt/SPARK-11155.
2015-12-08 11:46:46 -06:00
Shixiong Zhu 3f4efb5c23 [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize
Merged #10051 again since #10083 is resolved.

This reverts commit 328b757d5d.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10167 from zsxwing/merge-SPARK-12060.
2015-12-07 12:01:09 -08:00
Shixiong Zhu 3af53e61fd [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly
`ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.

This patch fixed all places that use `ByteBuffer.array` incorrectly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10083 from zsxwing/bytebuffer-array.
2015-12-04 17:02:04 -08:00
rotems f30373f5ee [SPARK-12080][CORE] Kryo - Support multiple user registrators
Author: rotems <roter>

Closes #10078 from Botnaim/KryoMultipleCustomRegistrators.
2015-12-04 16:58:34 -08:00
meiyoula bbfc16ec9d [SPARK-12142][CORE]Reply false when container allocator is not ready and reset target
Using Dynamic Allocation function, when a new AM is starting, and ExecutorAllocationManager send RequestExecutor message to AM. If the container allocator is not ready, the whole app will hang on

Author: meiyoula <1039320815@qq.com>

Closes #10138 from XuTingjun/patch-1.
2015-12-04 16:50:43 -08:00
Josh Rosen b7204e1d41 [SPARK-12112][BUILD] Upgrade to SBT 0.13.9
We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).

I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
2015-12-05 08:15:30 +08:00
Dmitry Erastov d0d8222778 [SPARK-6990][BUILD] Add Java linting script; fix minor warnings
This replaces https://github.com/apache/spark/pull/9696

Invoke Checkstyle and print any errors to the console, failing the step.
Use Google's style rules modified according to
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide
Some important checks are disabled (see TODOs in `checkstyle.xml`) due to
multiple violations being present in the codebase.

Suggest fixing those TODOs in a separate PR(s).

More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/).

Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles):

> Checkstyle checks failed at following occurrences:
[ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause.
> [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions.
> [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1

Also fix some of the minor violations that didn't require sweeping changes.

Apologies for the previous botched PRs - I finally figured out the issue.

cr: JoshRosen, pwendell

> I state that the contribution is my original work, and I license the work to the project under the project's open source license.

Author: Dmitry Erastov <derastov@gmail.com>

Closes #9867 from dskrvk/master.
2015-12-04 12:03:45 -08:00
Nong 95296d9b1a [SPARK-12089] [SQL] Fix memory corrupt due to freeing a page being referenced
When the spillable sort iterator was spilled, it was mistakenly keeping
the last page in memory rather than the current page. This causes the
current record to get corrupted.

Author: Nong <nong@cloudera.com>

Closes #10142 from nongli/spark-12089.
2015-12-04 10:01:20 -08:00
Carson Wang b6e9963ee4 [SPARK-11206] Support SQL UI on the history server (resubmit)
Resubmit #9297 and #9991
On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.

To support SQL UI on the history server:
1. I added an onOtherEvent method to the SparkListener trait and post all SQL related events to the same event bus.
2. Two SQL events SparkListenerSQLExecutionStart and SparkListenerSQLExecutionEnd are defined in the sql module.
3. The new SQL events are written to event log using Jackson.
4. A new trait SparkHistoryListenerFactory is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using java.util.ServiceLoader.

Author: Carson Wang <carson.wang@intel.com>

Closes #10061 from carsonwang/SqlHistoryUI.
2015-12-03 16:39:12 -08:00
Anderson de Andrade f434f36d50 [SPARK-12056][CORE] Create a TaskAttemptContext only after calling setConf.
TaskAttemptContext's constructor will clone the configuration instead of referencing it. Calling setConf after creating TaskAttemptContext makes any changes to the configuration made inside setConf unperceived by RecordReader instances.

As an example, Titan's InputFormat will change conf when calling setConf. They wrap their InputFormat around Cassandra's ColumnFamilyInputFormat, and append Cassandra's configuration. This change fixes the following error when using Titan's CassandraInputFormat with Spark:

*java.lang.RuntimeException: org.apache.thrift.protocol.TProtocolException: Required field 'keyspace' was not present! Struct: set_key space_args(keyspace:null)*

There's a discussion of this error here: https://groups.google.com/forum/#!topic/aureliusgraphs/4zpwyrYbGAE

Author: Anderson de Andrade <adeandrade@verticalscope.com>

Closes #10046 from adeandrade/newhadooprdd-fix.
2015-12-03 16:37:00 -08:00
Andrew Or 688e521c28 [SPARK-12108] Make event logs smaller
**Problem.** Event logs in 1.6 were much bigger than 1.5. I ran page rank and the event log size in 1.6 was almost 5x that in 1.5. I did a bisect to find that the RDD callsite added in #9398 is largely responsible for this.

**Solution.** This patch removes the long form of the callsite (which is not used!) from the event log. This reduces the size of the event log significantly.

*Note on compatibility*: if this patch is to be merged into 1.6.0, then it won't break any compatibility. Otherwise, if it is merged into 1.6.1, then we might need to add more backward compatibility handling logic (currently does not exist yet).

Author: Andrew Or <andrew@databricks.com>

Closes #10115 from andrewor14/smaller-event-logs.
2015-12-03 11:09:29 -08:00
Shixiong Zhu 649be4fa45 [SPARK-12101][CORE] Fix thread pools that cannot cache tasks in Worker and AppClient
`SynchronousQueue` cannot cache any task. This issue is similar to #9978. It's an easy fix. Just use the fixed `ThreadUtils.newDaemonCachedThreadPool`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10108 from zsxwing/fix-threadpool.
2015-12-03 11:06:25 -08:00
jerryshao 7bc9e1db2c [SPARK-12059][CORE] Avoid assertion error when unexpected state transition met in Master
Downgrade to warning log for unexpected state transition.

andrewor14 please review, thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #10091 from jerryshao/SPARK-12059.
2015-12-03 11:05:12 -08:00
Steve Loughran 8fa3e474a8 [SPARK-11314][YARN] add service API and test service for Yarn Cluster schedulers
This is purely the yarn/src/main and yarn/src/test bits of the YARN ATS integration: the extension model to load and run implementations of `SchedulerExtensionService` in the yarn cluster scheduler process —and to stop them afterwards.

There's duplication between the two schedulers, yarn-client and yarn-cluster, at least in terms of setting everything up, because the common superclass, `YarnSchedulerBackend` is in spark-core, and the extension services need the YARN app/attempt IDs.

If you look at how the the extension services are loaded, the case class `SchedulerExtensionServiceBinding` is used to pass in config info -currently just the spark context and the yarn IDs, of which one, the attemptID, will be null when running client-side. I'm passing in a case class to ensure that it would be possible in future to add extra arguments to the binding class, yet, as the method signature will not have changed, still be able to load existing services.

There's no functional extension service here, just one for testing. The real tests come in the bigger pull requests. At the same time, there's no restriction of this extension service purely to the ATS history publisher. Anything else that wants to listen to the spark context and publish events could use this, and I'd also consider writing one for the YARN-913 registry service, so that the URLs of the web UI would be locatable through that (low priority; would make more sense if integrated with a REST client).

There's no minicluster test. Given the test execution overhead of setting up minicluster tests, it'd  probably be better to add an extension service into one of the existing tests.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #9182 from steveloughran/stevel/feature/SPARK-1537-service.
2015-12-03 10:33:06 -08:00
Josh Rosen ae40253373 [SPARK-12082][FLAKY-TEST] Increase timeouts in NettyBlockTransferSecuritySuite
We should try increasing a timeout in NettyBlockTransferSecuritySuite in order to reduce that suite's flakiness in Jenkins.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10113 from JoshRosen/SPARK-12082.
2015-12-03 11:12:02 +08:00
Jeroen Schot 128c29035b [SPARK-3580][CORE] Add Consistent Method To Get Number of RDD Partitions Across Different Languages
I have tried to address all the comments in pull request https://github.com/apache/spark/pull/2447.

Note that the second commit (using the new method in all internal code of all components) is quite intrusive and could be omitted.

Author: Jeroen Schot <jeroen.schot@surfsara.nl>

Closes #9767 from schot/master.
2015-12-02 09:40:07 +00:00
Andrew Or d96f8c997b [SPARK-12081] Make unified memory manager work with small heaps
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.

**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).

Author: Andrew Or <andrew@databricks.com>

Closes #10081 from andrewor14/unified-memory-small-heaps.
2015-12-01 19:51:12 -08:00
Andrew Or 1ce4adf55b [SPARK-8414] Ensure context cleaner periodic cleanups
Garbage collection triggers cleanups. If the driver JVM is huge and there is little memory pressure, we may never clean up shuffle files on executors. This is a problem for long-running applications (e.g. streaming).

Author: Andrew Or <andrew@databricks.com>

Closes #10070 from andrewor14/periodic-gc.
2015-12-01 19:36:34 -08:00
Shixiong Zhu 328b757d5d Revert "[SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize"
This reverts commit 1401166576.
2015-12-01 15:13:10 -08:00
Tathagata Das 60b541ee1b [SPARK-12004] Preserve the RDD partitioner through RDD checkpointing
The solution is the save the RDD partitioner in a separate file in the RDD checkpoint directory. That is, `<checkpoint dir>/_partitioner`.  In most cases, whether the RDD partitioner was recovered or not, does not affect the correctness, only reduces performance. So this solution makes a best-effort attempt to save and recover the partitioner. If either fails, the checkpointing is not affected. This makes this patch safe and backward compatible.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9983 from tdas/SPARK-12004.
2015-12-01 14:08:36 -08:00
Josh Rosen 34e7093c11 [SPARK-12065] Upgrade Tachyon from 0.8.1 to 0.8.2
This commit upgrades the Tachyon dependency from 0.8.1 to 0.8.2.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10054 from JoshRosen/upgrade-to-tachyon-0.8.2.
2015-12-01 11:49:20 -08:00
woj-i 6a8cf80cc8 [SPARK-11821] Propagate Kerberos keytab for all environments
andrewor14 the same PR as in branch 1.5
harishreedharan

Author: woj-i <wojciechindyk@gmail.com>

Closes #9859 from woj-i/master.
2015-12-01 11:05:45 -08:00
Cheng Lian 69dbe6b40d [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues
This PR backports PR #10039 to master

Author: Cheng Lian <lian@databricks.com>

Closes #10063 from liancheng/spark-12046.doc-fix.master.
2015-12-01 10:21:31 -08:00
Shixiong Zhu 1401166576 [SPARK-12060][CORE] Avoid memory copy in JavaSerializerInstance.serialize
`JavaSerializerInstance.serialize` uses `ByteArrayOutputStream.toByteArray` to get the serialized data. `ByteArrayOutputStream.toByteArray` needs to copy the content in the internal array to a new array. However, since the array will be converted to `ByteBuffer` at once, we can avoid the memory copy.

This PR added `ByteBufferOutputStream` to access the protected `buf` and convert it to a `ByteBuffer` directly.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10051 from zsxwing/SPARK-12060.
2015-12-01 09:45:55 -08:00
Sean Owen 96bf468c78 [SPARK-12049][CORE] User JVM shutdown hook can cause deadlock at shutdown
Avoid potential deadlock with a user app's shutdown hook thread by more narrowly synchronizing access to 'hooks'

Author: Sean Owen <sowen@cloudera.com>

Closes #10042 from srowen/SPARK-12049.
2015-11-30 17:33:09 -08:00
Marcelo Vanzin 9bf2120672 [SPARK-12007][NETWORK] Avoid copies in the network lib's RPC layer.
This change seems large, but most of it is just replacing `byte[]`
with `ByteBuffer` and `new byte[]` with `ByteBuffer.allocate()`,
since it changes the network library's API.

The following are parts of the code that actually have meaningful
changes:

- The Message implementations were changed to inherit from a new
  AbstractMessage that can optionally hold a reference to a body
  (in the form of a ManagedBuffer); this is similar to how
  ResponseWithBody worked before, except now it's not restricted
  to just responses.

- The TransportFrameDecoder was pretty much rewritten to avoid
  copies as much as possible; it doesn't rely on CompositeByteBuf
  to accumulate incoming data anymore, since CompositeByteBuf
  has issues when slices are retained. The code now is able to
  create frames without having to resort to copying bytes except
  for a few bytes (containing the frame length) in very rare cases.

- Some minor changes in the SASL layer to convert things back to
  `byte[]` since the JDK SASL API operates on those.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9987 from vanzin/SPARK-12007.
2015-11-30 17:22:05 -08:00
CodingCat 0a46e43772 [SPARK-12037][CORE] initialize heartbeatReceiverRef before calling startDriverHeartbeat
https://issues.apache.org/jira/browse/SPARK-12037

a simple fix by changing the order of the statements

Author: CodingCat <zhunansjtu@gmail.com>

Closes #10032 from CodingCat/SPARK-12037.
2015-11-30 17:19:26 -08:00
Teng Qiu a8ceec5e8c [SPARK-12053][CORE] EventLoggingListener.getLogPath needs 4 parameters
```EventLoggingListener.getLogPath``` needs 4 input arguments:
https://github.com/apache/spark/blob/v1.6.0-preview2/core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala#L276-L280

the 3rd parameter should be appAttemptId, 4th parameter is codec...

Author: Teng Qiu <teng.qiu@gmail.com>

Closes #10044 from chutium/SPARK-12053.
2015-12-01 07:27:32 +09:00
Josh Rosen 2c5dee0fb8 Revert "[SPARK-11206] Support SQL UI on the history server"
This reverts commit cc243a079b / PR #9297

I'm reverting this because it broke SQLListenerMemoryLeakSuite in the master Maven builds.

See #9991 for a discussion of why this broke the tests.
2015-11-30 13:42:35 -08:00
Davies Liu 8df584b020 [SPARK-11982] [SQL] improve performance of cartesian product
This PR improve the performance of CartesianProduct by caching the result of right plan.

After this patch, the query time of TPC-DS Q65 go down to 4 seconds from 28 minutes (420X faster).

cc nongli

Author: Davies Liu <davies@databricks.com>

Closes #9969 from davies/improve_cartesian.
2015-11-30 11:54:18 -08:00
Wieland Hoffmann 26c3581f17 [DOC] Explicitly state that top maintains the order of elements
Top is implemented in terms of takeOrdered, which already maintains the
order, so top should, too.

Author: Wieland Hoffmann <themineo@gmail.com>

Closes #10013 from mineo/top-order.
2015-11-30 09:32:48 +00:00
toddwan e074944205 [SPARK-11859][MESOS] SparkContext accepts invalid Master URLs in the form zk://host:port for a multi-master Mesos cluster using ZooKeeper
* According to below doc and validation logic in [SparkSubmit.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L231), master URL for a mesos cluster should always start with `mesos://`

http://spark.apache.org/docs/latest/running-on-mesos.html
`The Master URLs for Mesos are in the form mesos://host:5050 for a single-master Mesos cluster, or mesos://zk://host:2181 for a multi-master Mesos cluster using ZooKeeper.`

* However, [SparkContext.scala](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L2749) fails the validation and can receive master URL in the form `zk://host:port`

* For the master URLs in the form `zk:host:port`, the valid form should be `mesos://zk://host:port`

* This PR restrict the validation in `SparkContext.scala`, and now only mesos master URLs prefixed with `mesos://` can be accepted.

* This PR also updated corresponding unit test.

Author: toddwan <tawan0109@outlook.com>

Closes #9886 from toddwan/S11859.
2015-11-30 09:26:29 +00:00
Shixiong Zhu 0c1e72e7f7 [SPARK-11996][CORE] Make the executor thread dump work again
In the previous implementation, the driver needs to know the executor listening address to send the thread dump request. However, in Netty RPC, the executor doesn't listen to any port, so the executor thread dump feature is broken.

This patch makes the driver use the endpointRef stored in BlockManagerMasterEndpoint to send the thread dump request to fix it.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9976 from zsxwing/executor-thread-dump.
2015-11-26 18:56:22 -08:00
Shixiong Zhu d3ef693325 [SPARK-11999][CORE] Fix the issue that ThreadUtils.newDaemonCachedThreadPool doesn't cache any task
In the previous codes, `newDaemonCachedThreadPool` uses `SynchronousQueue`, which is wrong. `SynchronousQueue` is an empty queue that cannot cache any task. This patch uses `LinkedBlockingQueue` to fix it along with other fixes to make sure `newDaemonCachedThreadPool` can use at most `maxThreadNumber` threads, and after that, cache tasks to `LinkedBlockingQueue`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9978 from zsxwing/cached-threadpool.
2015-11-25 23:31:21 -08:00
Carson Wang cc243a079b [SPARK-11206] Support SQL UI on the history server
On the live web UI, there is a SQL tab which provides valuable information for the SQL query. But once the workload is finished, we won't see the SQL tab on the history server. It will be helpful if we support SQL UI on the history server so we can analyze it even after its execution.

To support SQL UI on the history server:
1. I added an `onOtherEvent` method to the `SparkListener` trait and post all SQL related events to the same event bus.
2. Two SQL events `SparkListenerSQLExecutionStart` and `SparkListenerSQLExecutionEnd` are defined in the sql module.
3. The new SQL events are written to event log using Jackson.
4.  A new trait `SparkHistoryListenerFactory` is added to allow the history server to feed events to the SQL history listener. The SQL implementation is loaded at runtime using `java.util.ServiceLoader`.

Author: Carson Wang <carson.wang@intel.com>

Closes #9297 from carsonwang/SqlHistoryUI.
2015-11-25 15:13:13 -08:00
Marcelo Vanzin 4e81783e92 [SPARK-11866][NETWORK][CORE] Make sure timed out RPCs are cleaned up.
This change does a couple of different things to make sure that the RpcEnv-level
code and the network library agree about the status of outstanding RPCs.

For RPCs that do not expect a reply ("RpcEnv.send"), support for one way
messages (hello CORBA!) was added to the network layer. This is a
"fire and forget" message that does not require any state to be kept
by the TransportClient; as a result, the RpcEnv 'Ack' message is not needed
anymore.

For RPCs that do expect a reply ("RpcEnv.ask"), the network library now
returns the internal RPC id; if the RpcEnv layer decides to time out the
RPC before the network layer does, it now asks the TransportClient to
forget about the RPC, so that if the network-level timeout occurs, the
client is not killed.

As part of implementing the above, I cleaned up some of the code in the
netty rpc backend, removing types that were not necessary and factoring
out some common code. Of interest is a slight change in the exceptions
when posting messages to a stopped RpcEnv; that's mostly to avoid nasty
error messages from the local-cluster backend when shutting down, which
pollutes the terminal output.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9917 from vanzin/SPARK-11866.
2015-11-25 12:58:18 -08:00
jerryshao 88875d9413 [SPARK-10558][CORE] Fix wrong executor state in Master
`ExecutorAdded` can only be sent to `AppClient` when worker report back the executor state as `LOADING`, otherwise because of concurrency issue, `AppClient` will possibly receive `ExectuorAdded` at first, then `ExecutorStateUpdated` with `LOADING` state.

Also Master will change the executor state from `LAUNCHING` to `RUNNING` (`AppClient` report back the state as `RUNNING`), then to `LOADING` (worker report back to state as `LOADING`), it should be `LAUNCHING` -> `LOADING` -> `RUNNING`.

Also it is wrongly shown in master UI, the state of executor should be `RUNNING` rather than `LOADING`:

![screen shot 2015-09-11 at 2 30 28 pm](https://cloud.githubusercontent.com/assets/850797/9809254/3155d840-5899-11e5-8cdf-ad06fef75762.png)

Author: jerryshao <sshao@hortonworks.com>

Closes #8714 from jerryshao/SPARK-10558.
2015-11-25 11:42:53 -08:00
Alex Bozarth 83653ac5e7 [SPARK-10864][WEB UI] app name is hidden if window is resized
Currently the Web UI navbar has a minimum width of 1200px; so if a window is resized smaller than that the app name goes off screen. The 1200px width seems to have been chosen since it fits the longest example app name without wrapping.

To work with smaller window widths I made the tabs wrap since it looked better than wrapping the app name. This is a distinct change in how the navbar looks and I'm not sure if it's what we actually want to do.

Other notes:
- min-width set to 600px to keep the tabs from wrapping individually (will need to be adjusted if tabs are added)
- app name will also wrap (making three levels) if a really really long app name is used

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #9874 from ajbozarth/spark10864.
2015-11-25 11:39:00 -08:00
Zhongshuai Pei 6b781576a1 [SPARK-11974][CORE] Not all the temp dirs had been deleted when the JVM exits
deleting the temp dir like that

```

scala> import scala.collection.mutable
import scala.collection.mutable

scala> val a = mutable.Set(1,2,3,4,7,0,8,98,9)
a: scala.collection.mutable.Set[Int] = Set(0, 9, 1, 2, 3, 7, 4, 8, 98)

scala> a.foreach(x => {a.remove(x) })

scala> a.foreach(println(_))
98
```

You may not modify a collection while traversing or iterating over it.This can not delete all element of the collection

Author: Zhongshuai Pei <peizhongshuai@huawei.com>

Closes #9951 from DoingDone9/Bug_RemainDir.
2015-11-25 10:37:34 -08:00
Marcelo Vanzin c1f85fc71e [SPARK-11956][CORE] Fix a few bugs in network lib-based file transfer.
- NettyRpcEnv::openStream() now correctly propagates errors to
  the read side of the pipe.
- NettyStreamManager now throws if the file being transferred does
  not exist.
- The network library now correctly handles zero-sized streams.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9941 from vanzin/SPARK-11956.
2015-11-25 09:47:20 -08:00
Mark Hamstra 0a5aef753e [SPARK-10666][SPARK-6880][CORE] Use properties from ActiveJob associated with a Stage
This issue was addressed in https://github.com/apache/spark/pull/5494, but the fix in that PR, while safe in the sense that it will prevent the SparkContext from shutting down, misses the actual bug.  The intent of `submitMissingTasks` should be understood as "submit the Tasks that are missing for the Stage, and run them as part of the ActiveJob identified by jobId".  Because of a long-standing bug, the `jobId` parameter was never being used.  Instead, we were trying to use the jobId with which the Stage was created -- which may no longer exist as an ActiveJob, hence the crash reported in SPARK-6880.

The correct fix is to use the ActiveJob specified by the supplied jobId parameter, which is guaranteed to exist at the call sites of submitMissingTasks.

This fix should be applied to all maintenance branches, since it has existed since 1.0.

kayousterhout pankajarora12

Author: Mark Hamstra <markhamstra@gmail.com>
Author: Imran Rashid <irashid@cloudera.com>

Closes #6291 from markhamstra/SPARK-6880.
2015-11-25 09:34:34 -06:00
Ashwin Swaroop 6385002657 [SPARK-11686][CORE] Issue WARN when dynamic allocation is disabled due to spark.dynamicAllocation.enabled and spark.executor.instances both set
Changed the log type to a 'warning' instead of 'info' as required.

Author: Ashwin Swaroop <Ashwin Swaroop>

Closes #9926 from ashwinswaroop/master.
2015-11-25 13:41:14 +00:00
Davies Liu 58d9b26055 [SPARK-11805] free the array in UnsafeExternalSorter during spilling
After calling spill() on SortedIterator, the array inside InMemorySorter is not needed, it should be freed during spilling, this could help to join multiple tables with limited memory.

Author: Davies Liu <davies@databricks.com>

Closes #9793 from davies/free_array.
2015-11-24 14:33:28 -08:00
Marcelo Vanzin e6dd237463 [SPARK-11929][CORE] Make the repl log4j configuration override the root logger.
In the default Spark distribution, there are currently two separate
log4j config files, with different default values for the root logger,
so that when running the shell you have a different default log level.
This makes the shell more usable, since the logs don't overwhelm the
output.

But if you install a custom log4j.properties, you lose that, because
then it's going to be used no matter whether you're running a regular
app or the shell.

With this change, the overriding of the log level is done differently;
the log level repl's main class (org.apache.spark.repl.Main) is used
to define the root logger's level when running the shell, defaulting
to WARN if it's not set explicitly.

On a somewhat related change, the shell output about the "sc" variable
was changed a bit to contain a little more useful information about
the application, since when the root logger's log level is WARN, that
information is never shown to the user.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9816 from vanzin/shell-logging.
2015-11-24 15:08:02 -06:00
Reynold Xin f315272279 [SPARK-11946][SQL] Audit pivot API for 1.6.
Currently pivot's signature looks like

```scala
scala.annotation.varargs
def pivot(pivotColumn: Column, values: Column*): GroupedData

scala.annotation.varargs
def pivot(pivotColumn: String, values: Any*): GroupedData
```

I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List.

I also made similar changes for Python.

Author: Reynold Xin <rxin@databricks.com>

Closes #9929 from rxin/SPARK-11946.
2015-11-24 12:54:37 -08:00
tedyu 81012546ee [SPARK-11872] Prevent the call to SparkContext#stop() in the listener bus's thread
This is continuation of SPARK-11761

Andrew suggested adding this protection. See tail of https://github.com/apache/spark/pull/9741

Author: tedyu <yuzhihong@gmail.com>

Closes #9852 from tedyu/master.
2015-11-24 12:22:33 -08:00
Forest Fang 800bd799ac [SPARK-11906][WEB UI] Speculation Tasks Cause ProgressBar UI Overflow
When there are speculative tasks in the stage, running progress bar could overflow and goes hidden on a new line:
![image](https://cloud.githubusercontent.com/assets/4317392/11326841/5fd3482e-9142-11e5-8ca5-cb2f0c0c8964.png)
3 completed / 2 running (including 1 speculative) out of 4 total tasks

This is a simple fix by capping the started tasks at `total - completed` tasks
![image](https://cloud.githubusercontent.com/assets/4317392/11326842/6bb67260-9142-11e5-90f0-37f9174878ec.png)

I should note my preferred way to fix it is via css style
```css
.progress { display: flex; }
```
which shifts the correction burden from driver to web browser. However I couldn't get selenium test to measure the position/dimension of the progress bar correctly to get this unit tested.

It also has the side effect that the width will be calibrated so the running occupies 2 / 5 instead of 1 / 4.
![image](https://cloud.githubusercontent.com/assets/4317392/11326848/7b03e9f0-9142-11e5-89ad-bd99cb0647cf.png)

All in all, since this cosmetic bug is minor enough, I suppose the original simple fix should be good enough.

Author: Forest Fang <forest.fang@outlook.com>

Closes #9896 from saurfang/progressbar.
2015-11-24 09:03:32 +00:00
Reynold Xin 8d57524662 [SPARK-11933][SQL] Rename mapGroup -> mapGroups and flatMapGroup -> flatMapGroups.
Based on feedback from Matei, this is more consistent with mapPartitions in Spark.

Also addresses some of the cleanups from a previous commit that renames the type variables.

Author: Reynold Xin <rxin@databricks.com>

Closes #9919 from rxin/SPARK-11933.
2015-11-23 22:22:15 -08:00
Marcelo Vanzin c2467dadae [SPARK-11140][CORE] Transfer files using network lib when using NettyRpcEnv.
This change abstracts the code that serves jars / files to executors so that
each RpcEnv can have its own implementation; the akka version uses the existing
HTTP-based file serving mechanism, while the netty versions uses the new
stream support added to the network lib, which makes file transfers benefit
from the easier security configuration of the network library, and should also
reduce overhead overall.

The change includes a small fix to TransportChannelHandler so that it propagates
user events to downstream handlers.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9530 from vanzin/SPARK-11140.
2015-11-23 13:54:19 -08:00
Reynold Xin ff442bbcff [SPARK-11899][SQL] API audit for GroupedDataset.
1. Renamed map to mapGroup, flatMap to flatMapGroup.
2. Renamed asKey -> keyAs.
3. Added more documentation.
4. Changed type parameter T to V on GroupedDataset.
5. Added since versions for all functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #9880 from rxin/SPARK-11899.
2015-11-21 15:00:37 -08:00
Nong Li 58b4e4f88a [SPARK-11787][SPARK-11883][SQL][FOLLOW-UP] Cleanup for this patch.
This mainly moves SqlNewHadoopRDD to the sql package. There is some state that is
shared between core and I've left that in core. This allows some other associated
minor cleanup.

Author: Nong Li <nong@databricks.com>

Closes #9845 from nongli/spark-11787.
2015-11-20 15:30:53 -08:00
Josh Rosen 89fd9bd061 [SPARK-11887] Close PersistenceEngine at the end of PersistenceEngineSuite tests
In PersistenceEngineSuite, we do not call `close()` on the PersistenceEngine at the end of the test. For the ZooKeeperPersistenceEngine, this causes us to leak a ZooKeeper client, causing the logs of unrelated tests to be periodically spammed with connection error messages from that client:

```
15/11/20 05:13:35.789 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) INFO ClientCnxn: Opening socket connection to server localhost/127.0.0.1:15741. Will not attempt to authenticate using SASL (unknown error)
15/11/20 05:13:35.790 pool-1-thread-1-ScalaTest-running-PersistenceEngineSuite-SendThread(localhost:15741) WARN ClientCnxn: Session 0x15124ff48dd0000 for server null, unexpected error, closing socket connection and attempting reconnect
java.net.ConnectException: Connection refused
	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
	at org.apache.zookeeper.ClientCnxnSocketNIO.doTransport(ClientCnxnSocketNIO.java:350)
	at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1068)
```

This patch fixes this by using a `finally` block.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9864 from JoshRosen/close-zookeeper-client-in-tests.
2015-11-20 14:31:26 -08:00
Josh Rosen 652def318e [SPARK-11650] Reduce RPC timeouts to speed up slow AkkaUtilsSuite test
This patch reduces some RPC timeouts in order to speed up the slow "AkkaUtilsSuite.remote fetch ssl on - untrusted server", which used to take two minutes to run.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9869 from JoshRosen/SPARK-11650.
2015-11-20 13:17:35 -08:00
Tathagata Das b2cecb80ec [SPARK-11845][STREAMING][TEST] Added unit test to verify TrackStateRDD is correctly checkpointed
To make sure that all lineage is correctly truncated for TrackStateRDD when checkpointed.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9831 from tdas/SPARK-11845.
2015-11-19 16:50:08 -08:00
Marcelo Vanzin 880128f37e [SPARK-4134][CORE] Lower severity of some executor loss logs.
Don't log ERROR messages when executors are explicitly killed or when
the exit reason is not yet known.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9780 from vanzin/SPARK-11789.
2015-11-19 16:49:18 -08:00
hushan 01403aa97b [SPARK-11746][CORE] Use cache-aware method dependencies
a small change

Author: hushan <hushan@xiaomi.com>

Closes #9691 from suyanNone/unify-getDependency.
2015-11-19 14:56:00 -08:00
Marcelo Vanzin f7135ed719 [SPARK-11828][CORE] Register DAGScheduler metrics source after app id is known.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9820 from vanzin/SPARK-11828.
2015-11-19 14:53:58 -08:00
Srinivasa Reddy Vundela 3bd77b213a [SPARK-11799][CORE] Make it explicit in executor logs that uncaught e…
…xceptions are thrown during executor shutdown

This commit will make sure that when uncaught exceptions are prepended with [Container in shutdown] when JVM is shutting down.

Author: Srinivasa Reddy Vundela <vsr@cloudera.com>

Closes #9809 from vundela/master_11799.
2015-11-19 14:51:40 -08:00
Shixiong Zhu 90d384dcbc [SPARK-11831][CORE][TESTS] Use port 0 to avoid port conflicts in tests
Use port 0 to fix port-contention-related flakiness

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9841 from zsxwing/SPARK-11831.
2015-11-19 14:49:25 -08:00
zsxwing 72d150c271 [SPARK-11830][CORE] Make NettyRpcEnv bind to the specified host
This PR includes the following change:

1. Bind NettyRpcEnv to the specified host
2. Fix the port information in the log for NettyRpcEnv.
3. Fix the service name of NettyRpcEnv.

Author: zsxwing <zsxwing@gmail.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9821 from zsxwing/SPARK-11830.
2015-11-19 11:57:50 -08:00
Nong Li 6d0848b53b [SPARK-11787][SQL] Improve Parquet scan performance when using flat schemas.
This patch adds an alternate to the Parquet RecordReader from the parquet-mr project
that is much faster for flat schemas. Instead of using the general converter mechanism
from parquet-mr, this directly uses the lower level APIs from parquet-columnar and a
customer RecordReader that directly assembles into UnsafeRows.

This is optionally disabled and only used for supported schemas.

Using the tpcds store sales table and doing a sum of increasingly more columns, the results
are:

For 1 Column:
  Before: 11.3M rows/second
  After: 18.2M rows/second

For 2 Columns:
  Before: 7.2M rows/second
  After: 11.2M rows/second

For 5 Columns:
  Before: 2.9M rows/second
  After: 4.5M rows/second

Author: Nong Li <nong@databricks.com>

Closes #9774 from nongli/parquet.
2015-11-18 18:38:45 -08:00
Josh Rosen 4b11712190 [SPARK-11495] Fix potential socket / file handle leaks that were found via static analysis
The HP Fortify Opens Source Review team (https://www.hpfod.com/open-source-review-project) reported a handful of potential resource leaks that were discovered using their static analysis tool. We should fix the issues identified by their scan.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9455 from JoshRosen/fix-potential-resource-leaks.
2015-11-18 16:00:35 -08:00
Derek Dagit c07a50b862 [SPARK-10930] History "Stages" page "duration" can be confusing
Author: Derek Dagit <derekd@yahoo-inc.com>

Closes #9051 from d2r/spark-10930-ui-max-task-dur.
2015-11-18 15:56:54 -08:00
Josh Rosen 3a9851936d [SPARK-11649] Properly set Akka frame size in SparkListenerSuite test
SparkListenerSuite's _"onTaskGettingResult() called when result fetched remotely"_ test was extremely slow (1 to 4 minutes to run) and recently became extremely flaky, frequently failing with OutOfMemoryError.

The root cause was the fact that this was using `System.setProperty` to set the Akka frame size, which was not actually modifying the frame size. As a result, this test would allocate much more data than necessary. The fix here is to simply use SparkConf in order to configure the frame size.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9822 from JoshRosen/SPARK-11649.
2015-11-18 15:55:41 -08:00
Shixiong Zhu 7c5b641808 [SPARK-10745][CORE] Separate configs between shuffle and RPC
[SPARK-6028](https://issues.apache.org/jira/browse/SPARK-6028) uses network module to implement RPC. However, there are some configurations named with `spark.shuffle` prefix in the network module.

This PR refactors them to make sure the user can control them in shuffle and RPC separately. The user can use `spark.rpc.*` to set the configuration for netty RPC.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9481 from zsxwing/SPARK-10745.
2015-11-18 12:53:22 -08:00
Reynold Xin a416e41e28 [SPARK-11809] Switch the default Mesos mode to coarse-grained mode
Based on my conversions with people, I believe the consensus is that the coarse-grained mode is more stable and easier to reason about. It is best to use that as the default rather than the more flaky fine-grained mode.

Author: Reynold Xin <rxin@databricks.com>

Closes #9795 from rxin/SPARK-11809.
2015-11-18 12:50:29 -08:00
Bryan Cutler 31921e0f0b [SPARK-4557][STREAMING] Spark Streaming foreachRDD Java API method should accept a VoidFunction<...>
Currently streaming foreachRDD Java API uses a function prototype requiring a return value of null.  This PR deprecates the old method and uses VoidFunction to allow for more concise declaration.  Also added VoidFunction2 to Java API in order to use in Streaming methods.  Unit test is added for using foreachRDD with VoidFunction, and changes have been tested with Java 7 and Java 8 using lambdas.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9488 from BryanCutler/foreachRDD-VoidFunction-SPARK-4557.
2015-11-18 12:09:54 -08:00
Yin Huai 6f99522d13 [SPARK-11792] [SQL] [FOLLOW-UP] Change SizeEstimation to KnownSizeEstimation and make estimatedSize return Long instead of Option[Long]
https://issues.apache.org/jira/browse/SPARK-11792

The main changes include:
* Renaming `SizeEstimation` to `KnownSizeEstimation`. Hopefully this new name has more information.
* Making `estimatedSize` return `Long` instead of `Option[Long]`.
* In `UnsaveHashedRelation`, `estimatedSize` will delegate the work to `SizeEstimator` if we have not created a `BytesToBytesMap`.

Since we will put `UnsaveHashedRelation` to `BlockManager`, it is generally good to let it provide a more accurate size estimation. Also, if we do not put `BytesToBytesMap` directly into `BlockerManager`, I feel it is not really necessary to make `BytesToBytesMap` extends `KnownSizeEstimation`.

Author: Yin Huai <yhuai@databricks.com>

Closes #9813 from yhuai/SPARK-11792-followup.
2015-11-18 11:49:12 -08:00
Hurshal Patel 3cca5ffb3d [SPARK-11195][CORE] Use correct classloader for TaskResultGetter
Make sure we are using the context classloader when deserializing failed TaskResults instead of the Spark classloader.

The issue is that `enqueueFailedTask` was using the incorrect classloader which results in `ClassNotFoundException`.

Adds a test in TaskResultGetterSuite that compiles a custom exception, throws it on the executor, and asserts that Spark handles the TaskResult deserialization instead of returning `UnknownReason`.

See #9367 for previous comments
See SPARK-11195 for a full repro

Author: Hurshal Patel <hpatel516@gmail.com>

Closes #9779 from choochootrain/spark-11195-master.
2015-11-18 09:28:59 -08:00
Jean-Baptiste Onofré e62820c85f [SPARK-6541] Sort executors by ID (numeric)
"Force" the executor ID sort with Int.

Author: Jean-Baptiste Onofré <jbonofre@apache.org>

Closes #9165 from jbonofre/SPARK-6541.
2015-11-18 08:57:58 +00:00
Yin Huai 1714350bdd [SPARK-11792][SQL] SizeEstimator cannot provide a good size estimation of UnsafeHashedRelations
https://issues.apache.org/jira/browse/SPARK-11792

Right now, SizeEstimator will "think" a small UnsafeHashedRelation is several GBs.

Author: Yin Huai <yhuai@databricks.com>

Closes #9788 from yhuai/SPARK-11792.
2015-11-18 00:42:52 -08:00
tedyu 446738e51f [SPARK-11761] Prevent the call to StreamingContext#stop() in the listener bus's thread
See discussion toward the tail of https://github.com/apache/spark/pull/9723
From zsxwing :
```
The user should not call stop or other long-time work in a listener since it will block the listener thread, and prevent from stopping SparkContext/StreamingContext.

I cannot see an approach since we need to stop the listener bus's thread before stopping SparkContext/StreamingContext totally.
```
Proposed solution is to prevent the call to StreamingContext#stop() in the listener bus's thread.

Author: tedyu <yuzhihong@gmail.com>

Closes #9741 from tedyu/master.
2015-11-17 22:47:53 -08:00
Kent Yao e33053ee00 [SPARK-11583] [CORE] MapStatus Using RoaringBitmap More Properly
This PR upgrade the version of RoaringBitmap to 0.5.10, to optimize the memory layout, will be much smaller when most of blocks are empty.

This PR is based on #9661 (fix conflicts), see all of the comments at https://github.com/apache/spark/pull/9661 .

Author: Kent Yao <yaooqinn@hotmail.com>
Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>

Closes #9746 from davies/roaring_mapstatus.
2015-11-17 19:44:29 -08:00
Davies Liu bf25f9bdfc [SPARK-11016] Move RoaringBitmap to explicit Kryo serializer
Fix the serialization of RoaringBitmap with Kyro serializer

This PR came from https://github.com/metamx/spark/pull/1, thanks to drcrallen

Author: Davies Liu <davies@databricks.com>
Author: Charles Allen <charles@allen-net.com>

Closes #9748 from davies/SPARK-11016.
2015-11-17 19:39:39 -08:00
Jacek Lewandowski b362d50fca [SPARK-11726] Throw exception on timeout when waiting for REST server response
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #9692 from jacek-lewandowski/SPARK-11726.
2015-11-17 16:00:00 -08:00
Grace 965245d087 [SPARK-9552] Add force control for killExecutors to avoid false killing for those busy executors
By using the dynamic allocation, sometimes it occurs false killing for those busy executors. Some executors with assignments will be killed because of being idle for enough time (say 60 seconds). The root cause is that the Task-Launch listener event is asynchronized.

For example, some executors are under assigning tasks, but not sending out the listener notification yet. Meanwhile, the dynamic allocation's executor idle time is up (e.g., 60 seconds). It will trigger killExecutor event at the same time.
 1. the timer expiration starts before the listener event arrives.
 2. Then, the task is going to run on top of that killed/killing executor. It will lead to task failure finally.

Here is the proposal to fix it. We can add the force control for killExecutor. If the force control is not set (i.e., false), we'd better to check if the executor under killing is idle or busy. If the current executor has some assignment, we should not kill that executor and return back false (to indicate killing failure). In dynamic allocation, we'd better to turn off force killing (i.e., force = false), we will meet killing failure if tries to kill a busy executor. And then, the executor timer won't be invalid. Later on, the task assignment event arrives, we can remove the idle timer accordingly. So that we can avoid false killing for those busy executors in dynamic allocation.

For the rest of usages, the end users can decide if to use force killing or not by themselves.  If to turn on that option, the killExecutor will do the action without any status checking.

Author: Grace <jie.huang@intel.com>
Author: Andrew Or <andrew@databricks.com>
Author: Jie Huang <jie.huang@intel.com>

Closes #7888 from GraceH/forcekill.
2015-11-17 15:43:35 -08:00
Marcelo Vanzin 936bc0bcbf [SPARK-11786][CORE] Tone down messages from akka error monitor.
There events happen normally during the app's lifecycle, so printing
out ERROR logs all the time is misleading, and can actually affect usability
of interactive shells.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9772 from vanzin/SPARK-11786.
2015-11-17 14:23:28 -08:00
Chris Bannister cc567b6634 [SPARK-11695][CORE] Set s3a credentials
Set s3a credentials when creating a new default hadoop configuration.

Author: Chris Bannister <chris.bannister@swiftkey.com>

Closes #9663 from Zariel/set-s3a-creds.
2015-11-17 10:03:46 -08:00
Kousuke Saruta 30f3cfda1c [SPARK-11480][CORE][WEBUI] Wrong callsite is displayed when using AsyncRDDActions#takeAsync
When we call AsyncRDDActions#takeAsync, actually another DAGScheduler#runJob is called from another thread so we cannot get proper callsite infomation.

Following screenshots are before this patch applied and after.

Before:
<img width="1268" alt="2015-11-04 1 26 40" src="https://cloud.githubusercontent.com/assets/4736016/10914069/0ffc1306-8294-11e5-8e89-c4fadf58dd12.png">
<img width="1258" alt="2015-11-04 1 26 52" src="https://cloud.githubusercontent.com/assets/4736016/10914070/0ffe84ce-8294-11e5-8b2a-69d36276bedb.png">

After:
<img width="1268" alt="2015-11-04 0 48 07" src="https://cloud.githubusercontent.com/assets/4736016/10914080/1d8cfb7a-8294-11e5-9e09-ede25c2563e8.png">
<img width="1269" alt="2015-11-04 0 48 26" src="https://cloud.githubusercontent.com/assets/4736016/10914081/1d934e3a-8294-11e5-8b5e-e3dc37aaced3.png">

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #9437 from sarutak/SPARK-11480.
2015-11-16 16:59:16 -08:00
Shivaram Venkataraman ea6f53e48a [SPARKR][HOTFIX] Disable flaky SparkR package build test
See https://github.com/apache/spark/pull/9390#issuecomment-157160063 and https://gist.github.com/shivaram/3a2fecce60768a603dac for more information

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #9744 from shivaram/sparkr-package-test-disable.
2015-11-16 16:57:50 -08:00
Wenchen Fan fd14936be7 [SPARK-11625][SQL] add java test for typed aggregate
Author: Wenchen Fan <wenchen@databricks.com>

Closes #9591 from cloud-fan/agg-test.
2015-11-16 15:32:49 -08:00
Davies Liu 3c025087b5 Revert "[SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap to reduce memory usage"
This reverts commit e209fa271a.
2015-11-16 14:50:38 -08:00
jerryshao 24477d2705 [SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issue
Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.

The problem is  `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done.

One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple.

Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.

Here this PR chooses solution 2.

Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.

Author: jerryshao <sshao@hortonworks.com>

Closes #9684 from jerryshao/SPARK-11718.
2015-11-16 11:43:18 -08:00
xin Wu 0e79604aed [SPARK-11522][SQL] input_file_name() returns "" for external tables
When computing partition for non-parquet relation, `HadoopRDD.compute` is used. but it does not set the thread local variable `inputFileName` in `NewSqlHadoopRDD`, like `NewSqlHadoopRDD.compute` does.. Yet, when getting the `inputFileName`, `NewSqlHadoopRDD.inputFileName` is exptected, which is empty now.
Adding the setting inputFileName in HadoopRDD.compute resolves this issue.

Author: xin Wu <xinwu@us.ibm.com>

Closes #9542 from xwu0226/SPARK-11522.
2015-11-16 08:10:48 -08:00
Josh Rosen fd50fa4c3e Revert "[SPARK-11572] Exit AsynchronousListenerBus thread when stop() is called"
This reverts commit 3e0a6cf1e0.
2015-11-15 22:38:30 -08:00
Sun Rui 835a79d78e [SPARK-10500][SPARKR] sparkr.zip cannot be created if /R/lib is unwritable
The basic idea is that:
The archive of the SparkR package itself, that is sparkr.zip, is created during build process and is contained in the Spark binary distribution. No change to it after the distribution is installed as the directory it resides ($SPARK_HOME/R/lib) may not be writable.

When there is R source code contained in jars or Spark packages specified with "--jars" or "--packages" command line option, a temporary directory is created by calling Utils.createTempDir() where the R packages built from the R source code will be installed. The temporary directory is writable, and won't interfere with each other when there are multiple SparkR sessions, and will be deleted when this SparkR session ends. The R binary packages installed in the temporary directory then are packed into an archive named rpkg.zip.

sparkr.zip and rpkg.zip are distributed to the cluster in YARN modes.

The distribution of rpkg.zip in Standalone modes is not supported in this PR, and will be address in another PR.

Various R files are updated to accept multiple lib paths (one is for SparkR package, the other is for other R packages)  so that these package can be accessed in R.

Author: Sun Rui <rui.sun@intel.com>

Closes #9390 from sun-rui/SPARK-10500.
2015-11-15 19:29:09 -08:00
Yu Gao 72c1d68b4a [SPARK-10181][SQL] Do kerberos login for credentials during hive client initialization
On driver process start up, UserGroupInformation.loginUserFromKeytab is called with the principal and keytab passed in, and therefore static var UserGroupInfomation,loginUser is set to that principal with kerberos credentials saved in its private credential set, and all threads within the driver process are supposed to see and use this login credentials to authenticate with Hive and Hadoop. However, because of IsolatedClientLoader, UserGroupInformation class is not shared for hive metastore clients, and instead it is loaded separately and of course not able to see the prepared kerberos login credentials in the main thread.

The first proposed fix would cause other classloader conflict errors, and is not an appropriate solution. This new change does kerberos login during hive client initialization, which will make credentials ready for the particular hive client instance.

 yhuai Please take a look and let me know. If you are not the right person to talk to, could you point me to someone responsible for this?

Author: Yu Gao <ygao@us.ibm.com>
Author: gaoyu <gaoyu@gaoyu-macbookpro.roam.corp.google.com>
Author: Yu Gao <crystalgaoyu@gmail.com>

Closes #9272 from yolandagao/master.
2015-11-15 14:53:59 -08:00
nitin goyal c939c70ac1 [SPARK-7970] Skip closure cleaning for SQL operations
Also introduces new spark private API in RDD.scala with name 'mapPartitionsInternal' which doesn't closure cleans the RDD elements.

Author: nitin goyal <nitin.goyal@guavus.com>
Author: nitin.goyal <nitin.goyal@guavus.com>

Closes #9253 from nitin2goyal/master.
2015-11-13 18:09:08 -08:00
Davies Liu ad960885bf [SPARK-8029] Robust shuffle writer
Currently, all the shuffle writer will write to target path directly, the file could be corrupted by other attempt of the same partition on the same executor. They should write to temporary file then rename to target path, as what we do in output committer. In order to make the rename atomic, the temporary file should be created in the same local directory (FileSystem).

This PR is based on #9214 , thanks to squito . Closes #9214

Author: Davies Liu <davies@databricks.com>

Closes #9610 from davies/safe_shuffle.
2015-11-12 22:44:57 -08:00
Shixiong Zhu 0f1d00a905 [SPARK-11663][STREAMING] Add Java API for trackStateByKey
TODO
- [x] Add Java API
- [x] Add API tests
- [x] Add a function test

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9636 from zsxwing/java-track.
2015-11-12 17:48:43 -08:00
Xiangrui Meng bc092966f8 [SPARK-11709] include creation site info in SparkContext.assertNotStopped error message
This helps debug issues caused by multiple SparkContext instances. JoshRosen andrewor14

~~~
scala> sc.stop()

scala> sc.parallelize(0 until 10)
java.lang.IllegalStateException: Cannot call methods on a stopped SparkContext.
This stopped SparkContext was created at:

org.apache.spark.SparkContext.<init>(SparkContext.scala:82)
org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:1017)
$iwC$$iwC.<init>(<console>:9)
$iwC.<init>(<console>:18)
<init>(<console>:20)
.<init>(<console>:24)
.<clinit>(<console>)
.<init>(<console>:7)
.<clinit>(<console>)
$print(<console>)
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
java.lang.reflect.Method.invoke(Method.java:606)
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1340)
org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)

The active context was created at:

(No active SparkContext.)
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #9675 from mengxr/SPARK-11709.
2015-11-12 16:43:04 -08:00
Jean-Baptiste Onofré 74c30049a8 [SPARK-2533] Add locality levels on stage summary view
Author: Jean-Baptiste Onofré <jbonofre@apache.org>

Closes #9487 from jbonofre/SPARK-2533-2.
2015-11-12 15:46:21 -08:00
Marcelo Vanzin 767d288b6b [SPARK-11655][CORE] Fix deadlock in handling of launcher stop().
The stop() callback was trying to close the launcher connection in the
same thread that handles connection data, which ended up causing a
deadlock. So avoid that by dispatching the stop() request in its own
thread.

On top of that, add some exception safety to a few parts of the code,
and use "destroyForcibly" from Java 8 if it's available, to force
kill the child process. The flip side is that "kill()" may not actually
work if running Java 7.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9633 from vanzin/SPARK-11655.
2015-11-12 14:29:16 -08:00
Reynold Xin e1bcf6af9b [SPARK-10827] replace volatile with Atomic* in AppClient.scala.
This is a followup for #9317 to replace volatile fields with AtomicBoolean and AtomicReference.

Author: Reynold Xin <rxin@databricks.com>

Closes #9611 from rxin/SPARK-10827.
2015-11-11 15:30:21 -08:00
Josh Rosen 2d76e44b1a [SPARK-11647] Attempt to reduce time/flakiness of Thriftserver CLI and SparkSubmit tests
This patch aims to reduce the test time and flakiness of HiveSparkSubmitSuite, SparkSubmitSuite, and CliSuite.

Key changes:

- Disable IO synchronization calls for Derby writes, since durability doesn't matter for tests. This was done for HiveCompatibilitySuite in #6651 and resulted in huge test speedups.
- Add a few missing `--conf`s to disable various Spark UIs. The CliSuite, in particular, never disabled these UIs, leaving it prone to port-contention-related flakiness.
- Fix two instances where tests defined `beforeAll()` methods which were never called because the appropriate traits were not mixed in. I updated these tests suites to extend `BeforeAndAfterEach` so that they play nicely with our `ResetSystemProperties` trait.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9623 from JoshRosen/SPARK-11647.
2015-11-11 14:30:38 -08:00
Josh Rosen 529a1d3380 [SPARK-6152] Use shaded ASM5 to support closure cleaning of Java 8 compiled classes
This patch modifies Spark's closure cleaner (and a few other places) to use ASM 5, which is necessary in order to support cleaning of closures that were compiled by Java 8.

In order to avoid ASM dependency conflicts, Spark excludes ASM from all of its dependencies and uses a shaded version of ASM 4 that comes from `reflectasm` (see [SPARK-782](https://issues.apache.org/jira/browse/SPARK-782) and #232). This patch updates Spark to use a shaded version of ASM 5.0.4 that was published by the Apache XBean project; the POM used to create the shaded artifact can be found at https://github.com/apache/geronimo-xbean/blob/xbean-4.4/xbean-asm5-shaded/pom.xml.

http://movingfulcrum.tumblr.com/post/80826553604/asm-framework-50-the-missing-migration-guide was a useful resource while upgrading the code to use the new ASM5 opcodes.

I also added a new regression tests in the `java8-tests` subproject; the existing tests were insufficient to catch this bug, which only affected Scala 2.11 user code which was compiled targeting Java 8.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9512 from JoshRosen/SPARK-6152.
2015-11-11 11:16:39 -08:00
Reynold Xin 95daff6459 [SPARK-11646] WholeTextFileRDD should return Text rather than String
If it returns Text, we can reuse this in Spark SQL to provide a WholeTextFile data source and directly convert the Text into UTF8String without extra string decoding and encoding.

Author: Reynold Xin <rxin@databricks.com>

Closes #9622 from rxin/SPARK-11646.
2015-11-11 10:17:54 -08:00
Josh Rosen fac53d8ec0 [SPARK-10192][HOTFIX] Fix NPE in test that was added in #8402
This fixes an NPE introduced in SPARK-10192 / #8402.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9620 from JoshRosen/SPARK-10192-hotfix.
2015-11-10 22:24:00 -08:00
Tathagata Das 6600786ddd [SPARK-11361][STREAMING] Show scopes of RDD operations inside DStream.foreachRDD and DStream.transform in DAG viz
Currently, when a DStream sets the scope for RDD generated by it, that scope is not allowed to be overridden by the RDD operations. So in case of `DStream.foreachRDD`, all the RDDs generated inside the foreachRDD get the same scope - `foreachRDD  <time>`, as set by the `ForeachDStream`. So it is hard to debug generated RDDs in the RDD DAG viz in the Spark UI.

This patch allows the RDD operations inside `DStream.transform` and `DStream.foreachRDD` to append their own scopes to the earlier DStream scope.

I have also slightly tweaked how callsites are set such that the short callsite reflects the RDD operation name and line number. This tweak is necessary as callsites are not managed through scopes (which support nesting and overriding) and I didnt want to add another local property to control nesting and overriding of callsites.

## Before:
![image](https://cloud.githubusercontent.com/assets/663212/10808548/fa71c0c4-7da9-11e5-9af0-5737793a146f.png)

## After:
![image](https://cloud.githubusercontent.com/assets/663212/10808659/37bc45b6-7dab-11e5-8041-c20be6a9bc26.png)

The code that was used to generate this is:
```
    val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
    val words = lines.flatMap(_.split(" "))
    val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
    wordCounts.foreachRDD { rdd =>
      val temp = rdd.map { _ -> 1 }.reduceByKey( _ + _)
      val temp2 = temp.map { _ -> 1}.reduceByKey(_ + _)
      val count = temp2.count
      println(count)
    }
```

Note
- The inner scopes of the RDD operations map/reduceByKey inside foreachRDD is visible
- The short callsites of stages refers to the line number of the RDD ops rather than the same line number of foreachRDD in all three cases.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #9315 from tdas/SPARK-11361.
2015-11-10 16:54:06 -08:00
tedyu 9009175416 [SPARK-11615] Drop @VisibleForTesting annotation
See http://search-hadoop.com/m/q3RTtjpe8r1iRbTj2 for discussion.

Summary: addition of VisibleForTesting annotation resulted in spark-shell malfunctioning.

Author: tedyu <yuzhihong@gmail.com>

Closes #9585 from tedyu/master.
2015-11-10 16:52:59 -08:00
tedyu 3e0a6cf1e0 [SPARK-11572] Exit AsynchronousListenerBus thread when stop() is called
As vonnagy reported in the following thread:
http://search-hadoop.com/m/q3RTtk982kvIow22

Attempts to join the thread in AsynchronousListenerBus resulted in lock up because AsynchronousListenerBus thread was still getting messages `SparkListenerExecutorMetricsUpdate` from the DAGScheduler

Author: tedyu <yuzhihong@gmail.com>

Closes #9546 from ted-yu/master.
2015-11-10 16:51:25 -08:00
Imran Rashid 33112f9c48 [SPARK-10192][CORE] simple test w/ failure involving a shared dependency
just trying to increase test coverage in the scheduler, this already works.  It includes a regression test for SPARK-9809

copied some test utils from https://github.com/apache/spark/pull/5636, we can wait till that is merged first

Author: Imran Rashid <irashid@cloudera.com>

Closes #8402 from squito/test_retry_in_shared_shuffle_dep.
2015-11-10 16:50:22 -08:00
Bryan Cutler a3989058c0 [SPARK-10827][CORE] AppClient should not use askWithReply in receiveAndReply
Changed AppClient to be non-blocking in `receiveAndReply` by using a separate thread to wait for response and reply to the context.  The threads are managed by a thread pool.  Also added unit tests for the AppClient interface.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9317 from BryanCutler/appClient-receiveAndReply-SPARK-10827.
2015-11-10 16:32:32 -08:00
Lianhui Wang 6e5fc37883 [SPARK-11252][NETWORK] ShuffleClient should release connection after fetching blocks had been completed for external shuffle
with yarn's external shuffle, ExternalShuffleClient of executors reserve its connections for yarn's NodeManager until application has been completed. so it will make NodeManager and executors have many socket connections.
in order to reduce network pressure of NodeManager's shuffleService, after registerWithShuffleServer or fetchBlocks have been completed in ExternalShuffleClient, connection for NM's shuffleService needs to be closed.andrewor14 rxin vanzin

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #9227 from lianhuiwang/spark-11252.
2015-11-10 10:40:08 -08:00
Paul Chandler 5507a9d093 Fix typo in driver page
"Comamnd property" => "Command property"

Author: Paul Chandler <pestilence669@users.noreply.github.com>

Closes #9578 from pestilence669/fix_spelling.
2015-11-10 12:59:53 +01:00
Wenchen Fan fcb57e9c73 [SPARK-11564][SQL][FOLLOW-UP] improve java api for GroupedDataset
created `MapGroupFunction`, `FlatMapGroupFunction`, `CoGroupFunction`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #9564 from cloud-fan/map.
2015-11-09 15:16:47 -08:00
Felix Bechstein 5039a49b63 [SPARK-10471][CORE][MESOS] prevent getting offers for unmet constraints
this change rejects offers for slaves with unmet constraints for 120s to mitigate offer starvation.
this prevents mesos to send us these offers again and again.
in return, we get more offers for slaves which might meet our constraints.
and it enables mesos to send the rejected offers to other frameworks.

Author: Felix Bechstein <felix.bechstein@otto.de>

Closes #8639 from felixb/decline_offers_constraint_mismatch.
2015-11-09 13:36:14 -08:00
tedyu 404a28f4ed [SPARK-11112] Fix Scala 2.11 compilation error in RDDInfo.scala
As shown in https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Compile/job/Spark-Master-Scala211-Compile/1946/console , compilation fails with:
```
[error] /home/jenkins/workspace/Spark-Master-Scala211-Compile/core/src/main/scala/org/apache/spark/storage/RDDInfo.scala:25: in class RDDInfo, multiple overloaded alternatives of constructor RDDInfo define default arguments.
[error] class RDDInfo(
[error]
```
This PR tries to fix the compilation error

Author: tedyu <yuzhihong@gmail.com>

Closes #9538 from tedyu/master.
2015-11-09 10:07:58 -08:00
Charles Yeh 08a7a836c3 [SPARK-10565][CORE] add missing web UI stats to /api/v1/applications JSON
I looked at the other endpoints, and they don't seem to be missing any fields.
Added fields:
![image](https://cloud.githubusercontent.com/assets/613879/10948801/58159982-82e4-11e5-86dc-62da201af910.png)

Author: Charles Yeh <charlesyeh@dropbox.com>

Closes #9472 from CharlesYeh/api_vars.
2015-11-09 11:59:32 -06:00
Reynold Xin 97b7080cf2 [SPARK-11564][SQL] Dataset Java API audit
A few changes:

1. Removed fold, since it can be confusing for distributed collections.
2. Created specific interfaces for each Dataset function (e.g. MapFunction, ReduceFunction, MapPartitionsFunction)
3. Added more documentation and test cases.

The other thing I'm considering doing is to have a "collector" interface for FlatMapFunction and MapPartitionsFunction, similar to MapReduce's map function.

Author: Reynold Xin <rxin@databricks.com>

Closes #9531 from rxin/SPARK-11564.
2015-11-08 20:57:09 -08:00
Andrew Or 7f741905b0 [SPARK-11112] DAG visualization: display RDD callsite
<img width="548" alt="screen shot 2015-11-01 at 9 42 33 am" src="https://cloud.githubusercontent.com/assets/2133137/10870343/2a8cd070-807d-11e5-857a-4ebcace77b5b.png">
mateiz sarutak

Author: Andrew Or <andrew@databricks.com>

Closes #9398 from andrewor14/rdd-callsite.
2015-11-07 05:35:53 +01:00
Josh Rosen 30b706b7b3 [SPARK-11389][CORE] Add support for off-heap memory to MemoryManager
In order to lay the groundwork for proper off-heap memory support in SQL / Tungsten, we need to extend our MemoryManager to perform bookkeeping for off-heap memory.

## User-facing changes

This PR introduces a new configuration, `spark.memory.offHeapSize` (name subject to change), which specifies the absolute amount of off-heap memory that Spark and Spark SQL can use. If Tungsten is configured to use off-heap execution memory for allocating data pages, then all data page allocations must fit within this size limit.

## Internals changes

This PR contains a lot of internal refactoring of the MemoryManager. The key change at the heart of this patch is the introduction of a `MemoryPool` class (name subject to change) to manage the bookkeeping for a particular category of memory (storage, on-heap execution, and off-heap execution). These MemoryPools are not fixed-size; they can be dynamically grown and shrunk according to the MemoryManager's policies. In StaticMemoryManager, these pools have fixed sizes, proportional to the legacy `[storage|shuffle].memoryFraction`. In the new UnifiedMemoryManager, the sizes of these pools are dynamically adjusted according to its policies.

There are two subclasses of `MemoryPool`: `StorageMemoryPool` manages storage memory and `ExecutionMemoryPool` manages execution memory. The MemoryManager creates two execution pools, one for on-heap memory and one for off-heap. Instances of `ExecutionMemoryPool` manage the logic for fair sharing of their pooled memory across running tasks (in other words, the ShuffleMemoryManager-like logic has been moved out of MemoryManager and pushed into these ExecutionMemoryPool instances).

I think that this design is substantially easier to understand and reason about than the previous design, where most of these responsibilities were handled by MemoryManager and its subclasses. To see this, take at look at how simple the logic in `UnifiedMemoryManager` has become: it's now very easy to see when memory is dynamically shifted between storage and execution.

## TODOs

- [x] Fix handful of test failures in the MemoryManagerSuites.
- [x] Fix remaining TODO comments in code.
- [ ] Document new configuration.
- [x] Fix commented-out tests / asserts:
  - [x] UnifiedMemoryManagerSuite.
- [x] Write tests that exercise the new off-heap memory management policies.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9344 from JoshRosen/offheap-memory-accounting.
2015-11-06 18:17:34 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Jacek Laskowski 62bb290773 Typo fixes + code readability improvements
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9501 from jaceklaskowski/typos-with-style.
2015-11-06 20:05:18 +00:00
Davies Liu eec74ba8bd [SPARK-7542][SQL] Support off-heap index/sort buffer
This brings the support of off-heap memory for array inside BytesToBytesMap and InMemorySorter, then we could allocate all the memory from off-heap for execution.

Closes #8068

Author: Davies Liu <davies@databricks.com>

Closes #9477 from davies/unsafe_timsort.
2015-11-05 19:02:18 -08:00
Srinivasa Reddy Vundela c76865c622 [SPARK-11484][WEBUI] Using proxyBase set by spark AM
Use the proxyBase set by the AM, if not found then use env. This is to fix the issue if somebody accidentally set APPLICATION_WEB_PROXY_BASE to wrong proxyBase

Author: Srinivasa Reddy Vundela <vsr@cloudera.com>

Closes #9448 from vundela/master.
2015-11-05 11:30:44 -08:00
Nishkam Ravi a4b5cefcf1 [SPARK-11501][CORE][YARN] Propagate spark.rpc config to executors
spark.rpc is supposed to be configurable but is not currently (doesn't get propagated to executors because RpcEnv.create is done before driver properties are fetched).

Author: Nishkam Ravi <nishkamravi@gmail.com>

Closes #9460 from nishkamravi2/master_akka.
2015-11-05 09:35:49 -08:00
Herman van Hovell 7bdc92197c [SPARK-11449][CORE] PortableDataStream should be a factory
```PortableDataStream``` maintains some internal state. This makes it tricky to reuse a stream (one needs to call ```close``` on both the ```PortableDataStream``` and the ```InputStream``` it produces).

This PR removes all state from ```PortableDataStream``` and effectively turns it into an ```InputStream```/```Array[Byte]``` factory. This makes the user responsible for managing the ```InputStream``` it returns.

cc srowen

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #9417 from hvanhovell/SPARK-11449.
2015-11-05 09:23:09 +00:00
Sean Owen 6f81eae24f [SPARK-11440][CORE][STREAMING][BUILD] Declare rest of @Experimental items non-experimental if they've existed since 1.2.0
Remove `Experimental` annotations in core, streaming for items that existed in 1.2.0 or before. The changes are:

* SparkContext
  * binary{Files,Records} : 1.2.0
  * submitJob : 1.0.0
* JavaSparkContext
  * binary{Files,Records} : 1.2.0
* DoubleRDDFunctions, JavaDoubleRDD
  * {mean,sum}Approx : 1.0.0
* PairRDDFunctions, JavaPairRDD
  * sampleByKeyExact : 1.2.0
  * countByKeyApprox : 1.0.0
* PairRDDFunctions
  * countApproxDistinctByKey : 1.1.0
* RDD
  * countApprox, countByValueApprox, countApproxDistinct : 1.0.0
* JavaRDDLike
  * countApprox : 1.0.0
* PythonHadoopUtil.Converter : 1.1.0
* PortableDataStream : 1.2.0 (related to binaryFiles)
* BoundedDouble : 1.0.0
* PartialResult : 1.0.0
* StreamingContext, JavaStreamingContext
  * binaryRecordsStream : 1.2.0
* HiveContext
  * analyze : 1.2.0

Author: Sean Owen <sowen@cloudera.com>

Closes #9396 from srowen/SPARK-11440.
2015-11-05 09:08:53 +00:00
Davies Liu 81498dd5c8 [SPARK-11425] [SPARK-11486] Improve hybrid aggregation
After aggregation, the dataset could be smaller than inputs, so it's better to do hash based aggregation for all inputs, then using sort based aggregation to merge them.

Author: Davies Liu <davies@databricks.com>

Closes #9383 from davies/fix_switch.
2015-11-04 21:30:21 -08:00
Josh Rosen d0b5633962 [SPARK-11307] Reduce memory consumption of OutputCommitCoordinator
OutputCommitCoordinator uses a map in a place where an array would suffice, increasing its memory consumption for result stages with millions of tasks.

This patch replaces that map with an array. The only tricky part of this is reasoning about the range of possible array indexes in order to make sure that we never index out of bounds.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9274 from JoshRosen/SPARK-11307.
2015-11-04 17:19:52 -08:00
Davies Liu 1b6a5d4af9 [SPARK-11493] remove bitset from BytesToBytesMap
Since we have 4 bytes as number of records in the beginning of a page, the address can not be zero, so we do not need the bitset.

For performance concerns, the bitset could help speed up false lookup if the slot is empty (because bitset is smaller than longArray, cache hit rate will be higher). In practice, the map is filled with 35% - 70% (use 50% as average), so only half of the false lookups can benefit of it, all others will pay the cost of load the bitset (still need to access the longArray anyway).

For aggregation, we always need to access the longArray (insert a new key after false lookup), also confirmed by a benchmark.

 For broadcast hash join, there could be a regression, but a simple benchmark showed that it may not (most of lookup are false):

```
sqlContext.range(1<<20).write.parquet("small")
df = sqlContext.read.parquet('small')
for i in range(3):
    t = time.time()
    df2 = sqlContext.range(1<<26).selectExpr("id * 1111111111 % 987654321 as id2")
    df2.join(df, df.id == df2.id2).count()
    print time.time() -t
```

Having bitset (used time in seconds):
```
17.5404241085
10.2758829594
10.5786800385
```
After removing bitset (used time in seconds):
```
21.8939979076
12.4132959843
9.97224712372
```

cc rxin nongli

Author: Davies Liu <davies@databricks.com>

Closes #9452 from davies/remove_bitset.
2015-11-04 14:45:02 -08:00
Adam Roberts 701fb50520 [SPARK-10949] Update Snappy version to 1.1.2
This is an updated version of #8995 by a-roberts. Original description follows:

Snappy now supports concatenation of serialized streams, this patch contains a version number change and the "does not support" test is now a "supports" test.

Snappy 1.1.2 changelog mentions:

> snappy-java-1.1.2 (22 September 2015)
> This is a backward compatible release for 1.1.x.
> Add AIX (32-bit) support.
> There is no upgrade for the native libraries of the other platforms.

> A major change since 1.1.1 is a support for reading concatenated results of SnappyOutputStream(s)
> snappy-java-1.1.2-RC2 (18 May 2015)
> Fix #107: SnappyOutputStream.close() is not idempotent
> snappy-java-1.1.2-RC1 (13 May 2015)
> SnappyInputStream now supports reading concatenated compressed results of SnappyOutputStream
> There has been no compressed format change since 1.0.5.x. So You can read the compressed results > interchangeablly between these versions.
> Fixes a problem when java.io.tmpdir does not exist.

Closes #8995.

Author: Adam Roberts <aroberts@uk.ibm.com>
Author: Josh Rosen <joshrosen@databricks.com>

Closes #9439 from JoshRosen/update-snappy.
2015-11-04 14:03:31 -08:00
Reynold Xin d19f4fda63 [SPARK-11505][SQL] Break aggregate functions into multiple files
functions.scala was getting pretty long. I broke it into multiple files.

I also added explicit data types for some public vals, and renamed aggregate function pretty names to lower case, which is more consistent with rest of the functions.

Author: Reynold Xin <rxin@databricks.com>

Closes #9471 from rxin/SPARK-11505.
2015-11-04 13:44:07 -08:00
Marcelo Vanzin 8790ee6d69 [SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors.
In YARN mode, when preemption is enabled, we may leave executors in a
zombie state while we wait to retrieve the reason for which the executor
exited. This is so that we don't account for failed tasks that were
running on a preempted executor.

The issue is that while we wait for this information, the scheduler
might decide to schedule tasks on the executor, which will never be
able to run them. Other side effects include the block manager still
considering the executor available to cache blocks, for example.

So, when we know that an executor went down but we don't know why,
stop everything related to the executor, except its running tasks.
Only when we know the reason for the exit (or give up waiting for
it) we do update the running tasks.

This is achieved by a new `disableExecutor()` method in the
`Schedulable` interface. For managers that do not behave like this
(i.e. every one but YARN), the existing `executorLost()` method
will behave the same way it did before.

On top of that change, a few minor changes that made debugging easier,
and fixed some other minor issues:
- The cluster-mode AM was printing a misleading log message every
  time an executor disconnected from the driver (because the akka
  actor system was shared between driver and AM).
- Avoid sending unnecessary requests for an executor's exit reason
  when we already know it was explicitly disabled / killed. This
  avoids both multiple requests, and unnecessary requests that would
  just cause warning messages on the AM (in the explicit kill case).
- Tone down a log message about the executor being lost when it
  exited normally (e.g. preemption)
- Wake up the AM monitor thread when requests for executor loss
  reasons arrive too, so that we can more quickly remove executors
  from this zombie state.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8887 from vanzin/SPARK-10622.
2015-11-04 09:07:22 -08:00
tedyu c09e513987 [SPARK-11442] Reduce numSlices for local metrics test of SparkListenerSuite
In the thread, http://search-hadoop.com/m/q3RTtcQiFSlTxeP/test+failed+due+to+OOME&subj=test+failed+due+to+OOME, it was discussed that memory consumption for SparkListenerSuite should be brought down.

This is an attempt in that direction by reducing numSlices for local metrics test.

Author: tedyu <yuzhihong@gmail.com>

Closes #9384 from tedyu/master.
2015-11-04 10:51:40 +00:00
Marcelo Vanzin 53e9cee3e4 [SPARK-11466][CORE] Avoid mockito in multi-threaded FsHistoryProviderSuite test.
The test functionality should be the same, but without using mockito; logs don't
really say anything useful but I suspect it may be the cause of the flakiness,
since updating mocks when multiple threads may be using it doesn't work very
well. It also allows some other cleanup (= less test code in FsHistoryProvider).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9425 from vanzin/SPARK-11466.
2015-11-03 16:26:28 -08:00
Jacek Laskowski 680b4e7bca Fix typo in WebUI
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9444 from jaceklaskowski/TImely-fix.
2015-11-03 15:26:35 -08:00
Mark Grover b2e4b314d9 [SPARK-9790][YARN] Expose in WebUI if NodeManager is the reason why executors were killed.
Author: Mark Grover <grover.markgrover@gmail.com>

Closes #8093 from markgrover/nm2.
2015-11-03 08:51:40 -08:00
Reynold Xin 57446eb69c [SPARK-11256] Mark all Stage/ResultStage/ShuffleMapStage internal state as private.
Author: Reynold Xin <rxin@databricks.com>

Closes #9219 from rxin/stage-cleanup1.
2015-11-03 07:06:00 -08:00
Jacek Lewandowski 233e534ac4 [SPARK-11344] Made ApplicationDescription and DriverDescription case classes
DriverDescription refactored to case class because it included no mutable fields.

ApplicationDescription had one mutable field, which was appUiUrl. This field was set by the driver to point to the driver web UI. Master was modifying this field when the application was removed to redirect requests to history server. This was wrong because objects which are sent over the wire should be immutable. Now appUiUrl is immutable in ApplicationDescription and always points to the driver UI even if it is already shutdown. The UI url which master exposes to the user and modifies dynamically is now included into ApplicationInfo - a data object which describes the application state internally in master. That URL in ApplicationInfo is initialised with the value from ApplicationDescription.

ApplicationDescription also included value user, which is now a part of case class fields.

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #9299 from jacek-lewandowski/SPARK-11344.
2015-11-03 12:46:11 +00:00
Calvin Jia 476f4348e2 [SPARK-11236] [TEST-MAVEN] [TEST-HADOOP1.0] [CORE] Update Tachyon dependency 0.7.1 -> 0.8.1
This is a reopening of #9204 which failed hadoop1 sbt tests.

With the original PR, a classpath issue would occur due to the MIMA plugin pulling in hadoop-2.2 dependencies regardless of the hadoop version when building the `oldDeps` project. These affect the hadoop1 sbt build because they are placed in `lib_managed` and Tachyon 0.8.0's default hadoop version is 2.2.

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #9395 from calvinjia/spark-11236.
2015-11-02 17:02:31 -08:00
Marcelo Vanzin 71d1c907de [SPARK-10997][CORE] Add "client mode" to netty rpc env.
"Client mode" means the RPC env will not listen for incoming connections.
This allows certain processes in the Spark stack (such as Executors or
tha YARN client-mode AM) to act as pure clients when using the netty-based
RPC backend, reducing the number of sockets needed by the app and also the
number of open ports.

Client connections are also preferred when endpoints that actually have
a listening socket are involved; so, for example, if a Worker connects
to a Master and the Master needs to send a message to a Worker endpoint,
that client connection will be used, even though the Worker is also
listening for incoming connections.

With this change, the workaround for SPARK-10987 isn't necessary anymore, and
is removed. The AM connects to the driver in "client mode", and that connection
is used for all driver <-> AM communication, and so the AM is properly notified
when the connection goes down.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9210 from vanzin/SPARK-10997.
2015-11-02 10:26:36 -08:00
Liang-Chi Hsieh e209fa271a [SPARK-11271][SPARK-11016][CORE] Use Spark BitSet instead of RoaringBitmap to reduce memory usage
JIRA: https://issues.apache.org/jira/browse/SPARK-11271

As reported in the JIRA ticket, when there are too many tasks, the memory usage of MapStatus will cause problem. Use BitSet instead of RoaringBitMap should be more efficient in memory usage.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9243 from viirya/mapstatus-bitset.
2015-11-02 08:52:52 +00:00
Marcelo Vanzin f8d93edec8 [SPARK-11073][CORE][YARN] Remove akka dependency in secret key generation.
Use standard JDK APIs for that (with a little help from Guava). Most of the
changes here are in test code, since there were no tests specific to that
part of the code.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9257 from vanzin/SPARK-11073.
2015-11-01 15:57:42 -08:00
Marcelo Vanzin cf04fdfe71 [SPARK-11020][CORE] Wait for HDFS to leave safe mode before initializing HS.
Large HDFS clusters may take a while to leave safe mode when starting; this change
makes the HS wait for that before doing checks about its configuraton. This means
the HS won't stop right away if HDFS is in safe mode and the configuration is not
correct, but that should be a very uncommon situation.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9043 from vanzin/SPARK-11020.
2015-11-01 14:42:18 -08:00
Christian Kadner dc7e399fc0 [SPARK-11338] [WEBUI] Prepend app links on HistoryPage with uiRoot path
[SPARK-11338: HistoryPage not multi-tenancy enabled ...](https://issues.apache.org/jira/browse/SPARK-11338)
- `HistoryPage.scala` ...prepending all page links with the web proxy (`uiRoot`) path
- `HistoryServerSuite.scala` ...adding a test case to verify all site-relative links are prefixed when the environment variable `APPLICATION_WEB_PROXY_BASE` (or System property `spark.ui.proxyBase`) is set

Author: Christian Kadner <ckadner@us.ibm.com>

Closes #9291 from ckadner/SPARK-11338 and squashes the following commits:

01d2f35 [Christian Kadner] [SPARK-11338][WebUI] nit fixes
d054bd7 [Christian Kadner] [SPARK-11338][WebUI] prependBaseUri in method makePageLink
8bcb3dc [Christian Kadner] [SPARK-11338][WebUI] Prepend application links on HistoryPage with uiRoot path
2015-11-01 13:09:42 -08:00
Josh Rosen ac4118db2d [SPARK-11424] Guard against double-close() of RecordReaders
**TL;DR**: We can rule out one rare but potential cause of input stream corruption via defensive programming.

## Background

[MAPREDUCE-5918](https://issues.apache.org/jira/browse/MAPREDUCE-5918) is a bug where an instance of a decompressor ends up getting placed into a pool multiple times. Since the pool is backed by a list instead of a set, this can lead to the same decompressor being used in different places at the same time, which is not safe because those decompressors will overwrite each other's buffers. Sometimes this buffer sharing will lead to exceptions but other times it will might silently result in invalid / garbled input.

That Hadoop bug is fixed in Hadoop 2.7 but is still present in many Hadoop versions that we wish to support. As a result, I think that we should try to work around this issue in Spark via defensive programming to prevent RecordReaders from being closed multiple times.

So far, I've had a hard time coming up with explanations of exactly how double-`close()`s occur in practice, but I do have a couple of explanations that work on paper.

For instance, it looks like https://github.com/apache/spark/pull/7424, added in 1.5, introduces at least one extremely~rare corner-case path where Spark could double-close() a LineRecordReader instance in a way that triggers the bug. Here are the steps involved in the bad execution that I brainstormed up:

* [The task has finished reading input, so we call close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L168).
* [While handling the close call and trying to close the reader, reader.close() throws an exception]( https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L190)
* We don't set `reader = null` after handling this exception, so the [TaskCompletionListener also ends up calling NewHadoopRDD.close()](https://github.com/apache/spark/blob/v1.5.1/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L156), which, in turn, closes the record reader again.

In this hypothetical situation, `LineRecordReader.close()` could [fail with an exception if its InputStream failed to close](https://github.com/apache/hadoop/blob/release-1.2.1/src/mapred/org/apache/hadoop/mapred/LineRecordReader.java#L212).
I googled for "Exception in RecordReader.close()" and it looks like it's possible for a closed Hadoop FileSystem to trigger an error there: [SPARK-757](https://issues.apache.org/jira/browse/SPARK-757), [SPARK-2491](https://issues.apache.org/jira/browse/SPARK-2491)

Looking at [SPARK-3052](https://issues.apache.org/jira/browse/SPARK-3052), it seems like it's possible to get spurious exceptions there when there is an error reading from Hadoop. If the Hadoop FileSystem were to get into an error state _right_ after reading the last record then it looks like we could hit the bug here in 1.5.

## The fix

This patch guards against these issues by modifying `HadoopRDD.close()` and `NewHadoopRDD.close()` so that they set `reader = null` even if an exception occurs in the `reader.close()` call. In addition, I modified `NextIterator. closeIfNeeded()` to guard against double-close if the first `close()` call throws an exception.

I don't have an easy way to test this, since I haven't been able to reproduce the bug that prompted this patch, but these changes seem safe and seem to rule out the on-paper reproductions that I was able to brainstorm up.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9382 from JoshRosen/hadoop-decompressor-pooling-fix and squashes the following commits:

5ec97d7 [Josh Rosen] Add SqlNewHadoopRDD.unsetInputFileName() that I accidentally deleted.
ae46cf4 [Josh Rosen] Merge remote-tracking branch 'origin/master' into hadoop-decompressor-pooling-fix
087aa63 [Josh Rosen] Guard against double-close() of RecordReaders.
2015-10-31 10:47:22 -07:00
Yin Huai e8ec2a7b01 Revert "[SPARK-11236][CORE] Update Tachyon dependency from 0.7.1 -> 0.8.0."
This reverts commit 4f5e60c647.
2015-10-30 16:12:33 -07:00
Davies Liu 45029bfdea [SPARK-11423] remove MapPartitionsWithPreparationRDD
Since we do not need to preserve a page before calling compute(), MapPartitionsWithPreparationRDD is not needed anymore.

This PR basically revert #8543, #8511, #8038, #8011

Author: Davies Liu <davies@databricks.com>

Closes #9381 from davies/remove_prepare2.
2015-10-30 15:47:40 -07:00
Sun Rui fab710a917 [SPARK-11414][SPARKR] Forgot to update usage of 'spark.sparkr.r.command' in RRDD in the PR for SPARK-10971.
Author: Sun Rui <rui.sun@intel.com>

Closes #9368 from sun-rui/SPARK-11414.
2015-10-30 10:51:11 -07:00
Iulian Dragos 0451b00148 [SPARK-10986][MESOS] Set the context class loader in the Mesos executor backend.
See [SPARK-10986](https://issues.apache.org/jira/browse/SPARK-10986) for details.

This fixes the `ClassNotFoundException` for Spark classes in the serializer.

I am not sure this is the right way to handle the class loader, but I couldn't find any documentation on how the context class loader is used and who relies on it. It seems at least the serializer uses it to instantiate classes during deserialization.

I am open to suggestions (I tried this fix on a real Mesos cluster and it *does* fix the issue).

tnachen andrewor14

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #9282 from dragos/issue/mesos-classloader.
2015-10-30 16:51:32 +00:00
Davies Liu 56419cf11f [SPARK-10342] [SPARK-10309] [SPARK-10474] [SPARK-10929] [SQL] Cooperative memory management
This PR introduce a mechanism to call spill() on those SQL operators that support spilling (for example, BytesToBytesMap, UnsafeExternalSorter and ShuffleExternalSorter) if there is not enough memory for execution. The preserved first page is needed anymore, so removed.

Other Spillable objects in Spark core (ExternalSorter and AppendOnlyMap) are not included in this PR, but those could benefit from this (trigger others' spilling).

The PrepareRDD may be not needed anymore, could be removed in follow up PR.

The following script will fail with OOM before this PR, finished in 150 seconds with 2G heap (also works in 1.5 branch, with similar duration).

```python
sqlContext.setConf("spark.sql.shuffle.partitions", "1")
df = sqlContext.range(1<<25).selectExpr("id", "repeat(id, 2) as s")
df2 = df.select(df.id.alias('id2'), df.s.alias('s2'))
j = df.join(df2, df.id==df2.id2).groupBy(df.id).max("id", "id2")
j.explain()
print j.count()
```

For thread-safety, here what I'm got:

1) Without calling spill(), the operators should only be used by single thread, no safety problems.

2) spill() could be triggered in two cases, triggered by itself, or by other operators. we can check trigger == this in spill(), so it's still in the same thread, so safety problems.

3) if it's triggered by other operators (right now cache will not trigger spill()), we only spill the data into disk when it's in scanning stage (building is finished), so the in-memory sorter or memory pages are read-only, we only need to synchronize the iterator and change it.

4) During scanning, the iterator will only use one record in one page, we can't free this page, because the downstream is currently using it (used by UnsafeRow or other objects). In BytesToBytesMap, we just skip the current page, and dump all others into disk. In UnsafeExternalSorter, we keep the page that is used by current record (having the same baseObject), free it when loading the next record. In ShuffleExternalSorter, the spill() will not trigger during scanning.

5) In order to avoid deadlock, we didn't call acquireMemory during spill (so we reused the pointer array in InMemorySorter).

Author: Davies Liu <davies@databricks.com>

Closes #9241 from davies/force_spill.
2015-10-29 23:38:06 -07:00
Calvin Jia 4f5e60c647 [SPARK-11236][CORE] Update Tachyon dependency from 0.7.1 -> 0.8.0.
Upgrades the tachyon-client version to the latest release.

No new dependencies are added and no spark facing APIs are changed. The removal of the `tachyon-underfs-s3` exclusion will enable users to use S3 out of the box and there are no longer any additional external dependencies added by the module.

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #9204 from calvinjia/spark-11236.
2015-10-29 15:13:38 -07:00
Kay Ousterhout b960a89056 [SPARK-11178] Improving naming around task failures.
Commit af3bc59d1f introduced new
functionality so that if an executor dies for a reason that's not
caused by one of the tasks running on the executor (e.g., due to
pre-emption), Spark doesn't count the failure towards the maximum
number of failures for the task.  That commit introduced some vague
naming that this commit attempts to fix; in particular:

(1) The variable "isNormalExit", which was used to refer to cases where
the executor died for a reason unrelated to the tasks running on the
machine, has been renamed (and reversed) to "exitCausedByApp". The problem
with the existing name is that it's not clear (at least to me!) what it
means for an exit to be "normal"; the new name is intended to make the
purpose of this variable more clear.

(2) The variable "shouldEventuallyFailJob" has been renamed to
"countTowardsTaskFailures". This variable is used to determine whether
a task's failure should be counted towards the maximum number of failures
allowed for a task before the associated Stage is aborted. The problem
with the existing name is that it can be confused with implying that
the task's failure should immediately cause the stage to fail because it
is somehow fatal (this is the case for a fetch failure, for example: if
a task fails because of a fetch failure, there's no point in retrying,
and the whole stage should be failed).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9164 from kayousterhout/SPARK-11178.
2015-10-27 16:55:10 -07:00
zsxwing 9fbd75ab5d [SPARK-11212][CORE][STREAMING] Make preferred locations support ExecutorCacheTaskLocation and update…
… ReceiverTracker and ReceiverSchedulingPolicy to use it

This PR includes the following changes:

1. Add a new preferred location format, `executor_<host>_<executorID>` (e.g., "executor_localhost_2"), to support specifying the executor locations for RDD.
2. Use the new preferred location format in `ReceiverTracker` to optimize the starting time of Receivers when there are multiple executors in a host.

The goal of this PR is to enable the streaming scheduler to place receivers (which run as tasks) in specific executors. Basically, I want to have more control on the placement of the receivers such that they are evenly distributed among the executors. We tried to do this without changing the core scheduling logic. But it does not allow specifying particular executor as preferred location, only at the host level. So if there are two executors in the same host, and I want two receivers to run on them (one on each executor), I cannot specify that. Current code only specifies the host as preference, which may end up launching both receivers on the same executor. We try to work around it but restarting a receiver when it does not launch in the desired executor and hope that next time it will be started in the right one. But that cause lots of restarts, and delays in correctly launching the receiver.

So this change, would allow the streaming scheduler to specify the exact executor as the preferred location. Also this is not exposed to the user, only the streaming scheduler uses this.

Author: zsxwing <zsxwing@gmail.com>

Closes #9181 from zsxwing/executor-location.
2015-10-27 16:14:33 -07:00
Kay Ousterhout 9fc16a82ad [SPARK-11306] Fix hang when JVM exits.
This commit fixes a bug where, in Standalone mode, if a task fails and crashes the JVM, the
failure is considered a "normal failure" (meaning it's considered unrelated to the task), so
the failure isn't counted against the task's maximum number of failures:
af3bc59d1f (diff-a755f3d892ff2506a7aa7db52022d77cL138).
As a result, if a task fails in a way that results in it crashing the JVM, it will continuously be
re-launched, resulting in a hang. This commit fixes that problem.

This bug was introduced by #8007; andrewor14 mccheah vanzin can you take a look at this?

This error is hard to trigger because we handle executor losses through 2 code paths (the second is via Akka, where Akka notices that the executor endpoint is disconnected).  In my setup, the Akka code path completes first, and doesn't have this bug, so things work fine (see my recent email to the dev list about this).  If I manually disable the Akka code path, I can see the hang (and this commit fixes the issue).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9273 from kayousterhout/SPARK-11306.
2015-10-27 10:46:43 -07:00
Sem Mulder feb8d6a44f [SPARK-11276][CORE] SizeEstimator prevents class unloading
The SizeEstimator keeps a cache of ClassInfos but this cache uses Class objects as keys.
Which results in strong references to the Class objects. If these classes are dynamically created
this prevents the corresponding ClassLoader from being GCed. Leading to PermGen exhaustion.

We use a Map with WeakKeys to prevent this issue.

Author: Sem Mulder <sem.mulder@site2mobile.com>

Closes #9244 from SemMulder/fix-sizeestimator-classunloading.
2015-10-27 07:55:10 +00:00
Sun Rui dc3220ce11 [SPARK-11209][SPARKR] Add window functions into SparkR [step 1].
Author: Sun Rui <rui.sun@intel.com>

Closes #9193 from sun-rui/SPARK-11209.
2015-10-26 20:58:18 -07:00
Kevin Yu 616be29c7f [SPARK-5966][WIP] Spark-submit deploy-mode cluster is not compatible with master local>
… master local>

Author: Kevin Yu <qyu@us.ibm.com>

Closes #9220 from kevinyu98/working_on_spark-5966.
2015-10-26 09:35:19 +00:00
Josh Rosen 85e654c5ec [SPARK-10984] Simplify *MemoryManager class structure
This patch refactors the MemoryManager class structure. After #9000, Spark had the following classes:

- MemoryManager
- StaticMemoryManager
- ExecutorMemoryManager
- TaskMemoryManager
- ShuffleMemoryManager

This is fairly confusing. To simplify things, this patch consolidates several of these classes:

- ShuffleMemoryManager and ExecutorMemoryManager were merged into MemoryManager.
- TaskMemoryManager is moved into Spark Core.

**Key changes and tasks**:

- [x] Merge ExecutorMemoryManager into MemoryManager.
  - [x] Move pooling logic into Allocator.
- [x] Move TaskMemoryManager from `spark-unsafe` to `spark-core`.
- [x] Refactor the existing Tungsten TaskMemoryManager interactions so Tungsten code use only this and not both this and ShuffleMemoryManager.
- [x] Refactor non-Tungsten code to use the TaskMemoryManager instead of ShuffleMemoryManager.
- [x] Merge ShuffleMemoryManager into MemoryManager.
  - [x] Move code
  - [x] ~~Simplify 1/n calculation.~~ **Will defer to followup, since this needs more work.**
- [x] Port ShuffleMemoryManagerSuite tests.
- [x] Move classes from `unsafe` package to `memory` package.
- [ ] Figure out how to handle the hacky use of the memory managers in HashedRelation's broadcast variable construction.
- [x] Test porting and cleanup: several tests relied on mock functionality (such as `TestShuffleMemoryManager.markAsOutOfMemory`) which has been changed or broken during the memory manager consolidation
  - [x] AbstractBytesToBytesMapSuite
  - [x] UnsafeExternalSorterSuite
  - [x] UnsafeFixedWidthAggregationMapSuite
  - [x] UnsafeKVExternalSorterSuite

**Compatiblity notes**:

- This patch introduces breaking changes in `ExternalAppendOnlyMap`, which is marked as `DevloperAPI` (likely for legacy reasons): this class now cannot be used outside of a task.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #9127 from JoshRosen/SPARK-10984.
2015-10-25 21:19:52 -07:00
Bryan Cutler 80279ac187 [SPARK-11287] Fixed class name to properly start TestExecutor from deploy.client.TestClient
Executing deploy.client.TestClient fails due to bad class name for TestExecutor in ApplicationDescription.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9255 from BryanCutler/fix-TestClient-classname-SPARK-11287.
2015-10-25 19:05:45 +00:00
Jacek Laskowski 146da0d810 Fix typos
Two typos squashed.

BTW Let me know how to proceed with other typos if I ran across any. I don't feel well to leave them aside as much as sending pull requests with such tiny changes. Guide me.

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9250 from jaceklaskowski/typos-hunting.
2015-10-25 01:33:22 +01:00
Jeff Zhang ffed00493a [SPARK-11125] [SQL] Uninformative exception when running spark-sql witho…
…ut building with -Phive-thriftserver and SPARK_PREPEND_CLASSES is set

This is the exception after this patch. Please help review.
```
java.lang.NoClassDefFoundError: org/apache/hadoop/hive/cli/CliDriver
	at java.lang.ClassLoader.defineClass1(Native Method)
	at java.lang.ClassLoader.defineClass(ClassLoader.java:800)
	at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
	at java.net.URLClassLoader.defineClass(URLClassLoader.java:449)
	at java.net.URLClassLoader.access$100(URLClassLoader.java:71)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:361)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:412)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:270)
	at org.apache.spark.util.Utils$.classForName(Utils.scala:173)
	at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:647)
	at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
	at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hive.cli.CliDriver
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	... 21 more
Failed to load hive class.
You need to build Spark with -Phive and -Phive-thriftserver.
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9134 from zjffdu/SPARK-11125.
2015-10-23 22:56:55 -07:00
Sun Rui 2462dbcce8 [SPARK-10971][SPARKR] RRunner should allow setting path to Rscript.
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.

The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).

BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.

For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON	      Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON	Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.

Author: Sun Rui <rui.sun@intel.com>

Closes #9179 from sun-rui/SPARK-10971.
2015-10-23 21:38:04 -07:00
Jacek Laskowski b1c1597e3c Fix a (very tiny) typo
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #9230 from jaceklaskowski/utils-seconds-typo.
2015-10-22 22:42:15 -07:00
Marcelo Vanzin fa6a4fbf08 [SPARK-11134][CORE] Increase LauncherBackendSuite timeout.
This test can take a little while to finish on slow / loaded machines.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9235 from vanzin/SPARK-11134.
2015-10-22 22:41:21 -07:00
zsxwing a88c66ca87 [SPARK-11098][CORE] Add Outbox to cache the sending messages to resolve the message disorder issue
The current NettyRpc has a message order issue because it uses a thread pool to send messages. E.g., running the following two lines in the same thread,

```
ref.send("A")
ref.send("B")
```

The remote endpoint may see "B" before "A" because sending "A" and "B" are in parallel.
To resolve this issue, this PR added an outbox for each connection, and if we are connecting to the remote node when sending messages, just cache the sending messages in the outbox and send them one by one when the connection is established.

Author: zsxwing <zsxwing@gmail.com>

Closes #9197 from zsxwing/rpc-outbox.
2015-10-22 21:01:01 -07:00
Andrew Or 34e71c6d89 [SPARK-11251] Fix page size calculation in local mode
```
// My machine only has 8 cores
$ bin/spark-shell --master local[32]
scala> val df = sc.parallelize(Seq((1, 1), (2, 2))).toDF("a", "b")
scala> df.as("x").join(df.as("y"), $"x.a" === $"y.a").count()

Caused by: java.io.IOException: Unable to acquire 2097152 bytes of memory
	at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPage(UnsafeExternalSorter.java:351)
```

Author: Andrew Or <andrew@databricks.com>

Closes #9209 from andrewor14/fix-local-page-size.
2015-10-22 15:58:08 -07:00
Kay Ousterhout 3535b91ddc [SPARK-11163] Remove unnecessary addPendingTask calls.
This commit removes unnecessary calls to addPendingTask in
TaskSetManager.executorLost. These calls are unnecessary: for
tasks that are still pending and haven't been launched, they're
still in all of the correct pending lists, so calling addPendingTask
has no effect. For tasks that are currently running (which may still be
in the pending lists, depending on how they were scheduled), we call
addPendingTask in handleFailedTask, so the calls at the beginning
of executorLost are redundant.

I think these calls are left over from when we re-computed the locality
levels in addPendingTask; now that we call recomputeLocality separately,
I don't think these are necessary.

Now that those calls are removed, the readding parameter in addPendingTask
is no longer necessary, so this commit also removes that parameter.

markhamstra can you take a look at this?

cc vanzin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #9154 from kayousterhout/SPARK-11163.
2015-10-22 11:39:06 -07:00
zsxwing 7bb6d31cff [SPARK-11232][CORE] Use 'offer' instead of 'put' to make sure calling send won't be interrupted
The current `NettyRpcEndpointRef.send` can be interrupted because it uses `LinkedBlockingQueue.put`, which may hang the application.

Image the following execution order:

  | thread 1: TaskRunner.kill | thread 2: TaskRunner.run
------------- | ------------- | -------------
1 | killed = true |
2 |  | if (killed) {
3 |  | throw new TaskKilledException
4 |  | case _: TaskKilledException  _: InterruptedException if task.killed =>
5 | task.kill(interruptThread): interruptThread is true |
6 | | execBackend.statusUpdate(taskId, TaskState.KILLED, ser.serialize(TaskKilled))
7 | | localEndpoint.send(StatusUpdate(taskId, state, serializedData)): in LocalBackend

Then `localEndpoint.send(StatusUpdate(taskId, state, serializedData))` will throw `InterruptedException`. This will prevent the executor from updating the task status and hang the application.

An failure caused by the above issue here: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44062/consoleFull

Since `receivers` is an unbounded `LinkedBlockingQueue`, we can just use `LinkedBlockingQueue.offer` to resolve this issue.

Author: zsxwing <zsxwing@gmail.com>

Closes #9198 from zsxwing/dont-interrupt-send.
2015-10-22 11:31:47 -07:00
Josh Rosen f6d06adf05 [SPARK-10708] Consolidate sort shuffle implementations
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
2015-10-22 09:46:30 -07:00
zhichao.li c03b6d1158 [SPARK-11121][CORE] Correct the TaskLocation type
Correct the logic to return `HDFSCacheTaskLocation` instance when the input `str` is a in memory location.

Author: zhichao.li <zhichao.li@intel.com>

Closes #9096 from zhichao-li/uselessBranch.
2015-10-22 03:59:26 -07:00
Reynold Xin 555b2086a1 Minor cleanup of ShuffleMapStage.outputLocs code.
I was looking at this code and found the documentation to be insufficient. I added more documentation, and refactored some relevant code path slightly to improve encapsulation. There are more that I want to do, but I want to get these changes in before doing more work.

My goal is to reduce exposing internal fields directly in ShuffleMapStage to improve encapsulation. After this change, DAGScheduler no longer directly writes outputLocs. There are still 3 places that reads outputLocs directly, but we can change those later.

Author: Reynold Xin <rxin@databricks.com>

Closes #9175 from rxin/stage-cleanup.
2015-10-21 15:33:13 -07:00
Holden Karau e18b571c33 [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9
Upgrade to Py4j0.9

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
2015-10-20 10:52:49 -07:00
Jakob Odersky 60851bc7bf [SPARK-11110][BUILD] Remove transient annotation for parameters.
`transient` annotations on class parameters (not case class parameters or vals) causes compilation errors during compilation with Scala 2.11.
I understand that transient *parameters* make no sense, however I don't quite understand why the 2.10 compiler accepted them.

Note: in case it is preferred to keep the annotations in case someone would in the future want to redefine them as vals, it would also be possible to just add `val` after the annotation, e.g. `class Foo(transient x: Int)` becomes `class Foo(transient private val x: Int)`.

I chose to remove the annotation as it also reduces needles clutter, however please feel free to tell me if you prefer the second option and I'll update the PR

Author: Jakob Odersky <jodersky@gmail.com>

Closes #9126 from jodersky/sbt-scala-2.11.
2015-10-20 08:54:34 +01:00
Jean-Baptiste Onofré 8f74aa6397 [SPARK-10876] Display total uptime for completed applications
Author: Jean-Baptiste Onofré <jbonofre@apache.org>

Closes #9059 from jbonofre/SPARK-10876.
2015-10-20 08:45:39 +01:00
Ryan Williams 16906ef23a [SPARK-11120] Allow sane default number of executor failures when dynamically allocating in YARN
I also added some information to container-failure error msgs about what host they failed on, which would have helped me identify the problem that lead me to this JIRA and PR sooner.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #9147 from ryan-williams/dyn-exec-failures.
2015-10-19 16:34:15 -07:00
Chris Bannister fc26f32cf1 [SPARK-9708][MESOS] Spark should create local temporary directories in Mesos sandbox when launched with Mesos
This is my own original work and I license this to the project under the project's open source license

Author: Chris Bannister <chris.bannister@swiftkey.com>
Author: Chris Bannister <chris.bannister@swiftkey.net>

Closes #8358 from Zariel/mesos-local-dir.
2015-10-19 16:24:40 -07:00
Liang-Chi Hsieh a1413b3662 [SPARK-11051][CORE] Do not allow local checkpointing after the RDD is materialized and checkpointed
JIRA: https://issues.apache.org/jira/browse/SPARK-11051

When a `RDD` is materialized and checkpointed, its partitions and dependencies are cleared. If we allow local checkpointing on it and assign `LocalRDDCheckpointData` to its `checkpointData`. Next time when the RDD is materialized again, the error will be thrown.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9072 from viirya/no-localcheckpoint-after-checkpoint.
2015-10-19 16:16:31 -07:00
Marcelo Vanzin 7ab0ce6501 [SPARK-11131][CORE] Fix race in worker registration protocol.
Because the registration RPC was not really an RPC, but a bunch of
disconnected messages, it was possible for other messages to be
sent before the reply to the registration arrived, and that would
confuse the Worker. Especially in local-cluster mode, the worker was
succeptible to receiving an executor request before it received a
message from the master saying registration succeeded.

On top of the above, the change also fixes a ClassCastException when
the registration fails, which also affects the executor registration
protocol. Because the `ask` is issued with a specific return type,
if the error message (of a different type) was returned instead, the
code would just die with an exception. This is fixed by having a common
base trait for these reply messages.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9138 from vanzin/SPARK-11131.
2015-10-19 16:14:50 -07:00
Alex Angelini dfa41e63b9 [SPARK-9643] Upgrade pyrolite to 4.9
Includes: https://github.com/irmen/Pyrolite/pull/23 which fixes datetimes with timezones.

JoshRosen

https://issues.apache.org/jira/browse/SPARK-9643

Author: Alex Angelini <alex.louis.angelini@gmail.com>

Closes #7950 from angelini/upgrade_pyrolite_up.
2015-10-19 10:07:39 -07:00
Jacek Laskowski bd64c2d550 [SPARK-10921][YARN] Completely remove the use of SparkContext.prefer…
…redNodeLocationData

Author: Jacek Laskowski <jacek.laskowski@deepsense.io>

Closes #8976 from jaceklaskowski/SPARK-10921.
2015-10-19 09:59:18 +01:00
ph 022a8f6a1f [SPARK-11129] [MESOS] Link Spark WebUI from Mesos WebUI
Mesos has a feature for linking to frameworks running on top of Mesos
from the Mesos WebUI. This commit enables Spark to make use of this
feature so one can directly visit the running Spark WebUIs from the
Mesos WebUI.

Author: ph <ph@plista.com>

Closes #9135 from philipphoffmann/SPARK-11129.
2015-10-17 15:37:51 -07:00
Reynold Xin 2549374206 [SPARK-11165] Logging trait should be private - not DeveloperApi.
Its classdoc actually says; "NOTE: DO NOT USE this class outside of Spark. It is intended as an internal utility."

Author: Reynold Xin <rxin@databricks.com>

Closes #9155 from rxin/private-logging-trait.
2015-10-17 12:41:42 -07:00
gweidner ac09a3a465 [SPARK-11109] [CORE] Move FsHistoryProvider off deprecated AccessControlException
Switched from deprecated org.apache.hadoop.fs.permission.AccessControlException to org.apache.hadoop.security.AccessControlException.

Author: gweidner <gweidner@us.ibm.com>

Closes #9144 from gweidner/SPARK-11109.
2015-10-16 14:02:12 -07:00
navis.ryu b9c5e5d4ac [SPARK-11124] JsonParser/Generator should be closed for resource recycle
Some json parsers are not closed. parser in JacksonParser#parseJson, for example.

Author: navis.ryu <navis@apache.org>

Closes #9130 from navis/SPARK-11124.
2015-10-16 11:19:37 -07:00
Andrew Or 3b364ff0a4 [SPARK-11078] Ensure spilling tests actually spill
#9084 uncovered that many tests that test spilling don't actually spill. This is a follow-up patch to fix that to ensure our unit tests actually catch potential bugs in spilling. The size of this patch is inflated by the refactoring of `ExternalSorterSuite`, which had a lot of duplicate code and logic.

Author: Andrew Or <andrew@databricks.com>

Closes #9124 from andrewor14/spilling-tests.
2015-10-15 14:50:01 -07:00
KaiXinXiaoLei 2d000124b7 [SPARK-10515] When killing executor, the pending replacement executors should not be lost
If the heartbeat receiver kills executors (and new ones are not registered to replace them), the idle timeout for the old executors will be lost (and then change a total number of executors requested by Driver), So new ones will be not to asked to replace them.
For example, executorsPendingToRemove=Set(1), and executor 2 is idle timeout before a new executor is asked to replace executor 1. Then driver kill executor 2, and sending RequestExecutors to AM. But executorsPendingToRemove=Set(1,2), So AM doesn't allocate a executor to replace 1.

see: https://github.com/apache/spark/pull/8668

Author: KaiXinXiaoLei <huleilei1@huawei.com>
Author: huleilei <huleilei1@huawei.com>

Closes #8945 from KaiXinXiaoLei/pendingexecutor.
2015-10-15 14:48:01 -07:00
Carson Wang d45a0d3ca2 [SPARK-11047] Internal accumulators miss the internal flag when replaying events in the history server
Internal accumulators don't write the internal flag to event log. So on the history server Web UI, all accumulators are not internal. This causes incorrect peak execution memory and unwanted accumulator table displayed on the stage page.
To fix it, I add the "internal" property of AccumulableInfo when writing the event log.

Author: Carson Wang <carson.wang@intel.com>

Closes #9061 from carsonwang/accumulableBug.
2015-10-15 10:36:54 -07:00
shellberg 523adc24a6 [SPARK-11066] Update DAGScheduler's "misbehaved ResultHandler"
Restrict tasks (of job) to only 1 to ensure that the causing Exception asserted for job failure is the deliberately thrown DAGSchedulerSuiteDummyException intended, not an UnsupportedOperationException from any second/subsequent tasks that can propagate from a race condition during code execution.

Author: shellberg <sah@zepler.org>

Closes #9076 from shellberg/shellberg-DAGSchedulerSuite-misbehavedResultHandlerTest-patch-1.
2015-10-15 18:07:10 +01:00
Adam Lewandowski 0f62c2282b [SPARK-11093] [CORE] ChildFirstURLClassLoader#getResources should return all found resources, not just those in the child classloader
Author: Adam Lewandowski <alewandowski@ipcoop.com>

Closes #9106 from alewando/childFirstFix.
2015-10-15 09:45:54 -07:00
Reynold Xin cf2e0ae720 [SPARK-11096] Post-hoc review Netty based RPC implementation - round 2
A few more changes:

1. Renamed IDVerifier -> RpcEndpointVerifier
2. Renamed NettyRpcAddress -> RpcEndpointAddress
3. Simplified NettyRpcHandler a bit by removing the connection count tracking. This is OK because I now force spark.shuffle.io.numConnectionsPerPeer to 1
4. Reduced spark.rpc.connect.threads to 64. It would be great to eventually remove this extra thread pool.
5. Minor cleanup & documentation.

Author: Reynold Xin <rxin@databricks.com>

Closes #9112 from rxin/SPARK-11096.
2015-10-14 12:41:02 -07:00
Tom Graves 135a2ce5b0 [SPARK-10619] Can't sort columns on Executor Page
should pick into spark 1.5.2 also.

https://issues.apache.org/jira/browse/SPARK-10619

looks like this was broken by commit: fb1d06fc24 (diff-b8adb646ef90f616c34eb5c98d1ebd16)
It looks like somethings were change to use the UIUtils.listingTable but executor page wasn't converted so when it removed sortable from the UIUtils. TABLE_CLASS_NOT_STRIPED it broke this page.

Simply add the sortable tag back in and it fixes both active UI and the history server UI.

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #9101 from tgravescs/SPARK-10619.
2015-10-14 10:12:25 -07:00
Andrew Or b3ffac5178 [SPARK-10983] Unified memory manager
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:

- **spark.memory.fraction (default 0.75)**: ​fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.

- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `s​park.memory.fraction`. ​Cached data may only be evicted if total storage exceeds this region.

- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.

For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.

Author: Andrew Or <andrew@databricks.com>

Closes #9084 from andrewor14/unified-memory-manager.
2015-10-13 13:49:59 -07:00
Sun Rui 5e3868ba13 [SPARK-10051] [SPARKR] Support collecting data of StructType in DataFrame
Two points in this PR:

1.    Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".

2.    SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build  Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.

Author: Sun Rui <rui.sun@intel.com>

Closes #8794 from sun-rui/SPARK-10051.
2015-10-13 10:02:21 -07:00
Reynold Xin 1797055dbf [SPARK-11079] Post-hoc review Netty-based RPC - round 1
I'm going through the implementation right now for post-doc review. Adding more comments and renaming things as I go through them.

I also want to write higher level documentation about how the whole thing works -- but those will come in other pull requests.

Author: Reynold Xin <rxin@databricks.com>

Closes #9091 from rxin/rpc-review.
2015-10-13 09:51:20 -07:00
Tom Graves 63c340a710 [SPARK-10858] YARN: archives/jar/files rename with # doesn't work unl
https://issues.apache.org/jira/browse/SPARK-10858

The issue here is that in resolveURI we default to calling new File(path).getAbsoluteFile().toURI().  But if the path passed in already has a # in it then File(path) will think that is supposed to be part of the actual file path and not a fragment so it changes # to %23. Then when we try to parse that  later in Client as a URI it doesn't recognize there is a fragment.

so to fix we just check if there is a fragment, still create the File like we did before and then add the fragment back on.

Author: Tom Graves <tgraves@yahoo-inc.com>

Closes #9035 from tgravescs/SPARK-10858.
2015-10-09 14:06:25 -07:00
Marcelo Vanzin 015f7ef503 [SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child apps.
This change adds an API that encapsulates information about an app
launched using the library. It also creates a socket-based communication
layer for apps that are launched as child processes; the launching
application listens for connections from launched apps, and once
communication is established, the channel can be used to send updates
to the launching app, or to send commands to the child app.

The change also includes hooks for local, standalone/client and yarn
masters.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #7052 from vanzin/SPARK-8673.
2015-10-09 15:28:09 -05:00
Andrew Or 67fbecbf32 [SPARK-10956] Common MemoryManager interface for storage and execution
This patch introduces a `MemoryManager` that is the central arbiter of how much memory to grant to storage and execution. This patch is primarily concerned only with refactoring while preserving the existing behavior as much as possible.

This is the first step away from the existing rigid separation of storage and execution memory, which has several major drawbacks discussed on the [issue](https://issues.apache.org/jira/browse/SPARK-10956). It is the precursor of a series of patches that will attempt to address those drawbacks.

Author: Andrew Or <andrew@databricks.com>
Author: Josh Rosen <joshrosen@databricks.com>
Author: andrewor14 <andrew@databricks.com>

Closes #9000 from andrewor14/memory-manager.
2015-10-08 21:44:59 -07:00
Marcelo Vanzin 56a9692fc0 [SPARK-10987] [YARN] Workaround for missing netty rpc disconnection event.
In YARN client mode, when the AM connects to the driver, it may be the case
that the driver never needs to send a message back to the AM (i.e., no
dynamic allocation or preemption). This triggers an issue in the netty rpc
backend where no disconnection event is sent to endpoints, and the AM never
exits after the driver shuts down.

The real fix is too complicated, so this is a quick hack to unblock YARN
client mode until we can work on the real fix. It forces the driver to
send a message to the AM when the AM registers, thus establishing that
connection and enabling the disconnection event when the driver goes
away.

Also, a minor side issue: when the executor is shutting down, it needs
to send an "ack" back to the driver when using the netty rpc backend; but
that "ack" wasn't being sent because the handler was shutting down the rpc
env before returning. So added a change to delay the shutdown a little bit,
allowing the ack to be sent back.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9021 from vanzin/SPARK-10987.
2015-10-08 09:47:58 -07:00
Marcelo Vanzin 94fc57afdf [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8775 from vanzin/SPARK-10300.
2015-10-07 14:11:21 -07:00
Marcelo Vanzin 6ca27f8550 [SPARK-10964] [YARN] Correctly register the AM with the driver.
The `self` method returns null when called from the constructor;
instead, registration should happen in the `onStart` method, at
which point the `self` reference has already been initialized.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #9005 from vanzin/SPARK-10964.
2015-10-07 11:38:47 -07:00
Marcelo Vanzin 744f03e700 [SPARK-10916] [YARN] Set perm gen size when launching containers on YARN.
This makes YARN containers behave like all other processes launched by
Spark, which launch with a default perm gen size of 256m unless
overridden by the user (or not needed by the vm).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #8970 from vanzin/SPARK-10916.
2015-10-06 10:17:12 -07:00
Reynold Xin ae6570ec2b Remove TODO in ShuffleMemoryManager. 2015-10-03 18:08:25 -07:00
zsxwing 107320c9bb [SPARK-6028] [CORE] Remerge #6457: new RPC implemetation and also pick #8905
This PR just reverted 02144d6745 to remerge #6457 and also included the commits in #8905.

Author: zsxwing <zsxwing@gmail.com>

Closes #8944 from zsxwing/SPARK-6028.
2015-10-03 01:04:35 -07:00
Joshi f85aa06464 [SPARK-10317] [CORE] Compatibility between history server script and functionality
Compatibility between history server script and functionality

The history server has its argument parsing class in HistoryServerArguments. However, this doesn't get involved in the start-history-server.sh codepath where the $0 arg is assigned to spark.history.fs.logDirectory and all other arguments discarded (e.g --property-file.)
This stops the other options being usable from this script

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #8758 from rekhajoshm/SPARK-10317.
2015-10-02 15:26:11 -07:00
Takeshi YAMAMURO 2272962eb0 [SPARK-9867] [SQL] Move utilities for binary data into ByteArray
The utilities such as Substring#substringBinarySQL and BinaryPrefixComparator#computePrefix for binary data are put together in ByteArray for easy-to-read.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #8122 from maropu/CleanUpForBinaryType.
2015-10-01 21:33:27 -04:00
zsxwing 9b3e7768a2 [SPARK-10058] [CORE] [TESTS] Fix the flaky tests in HeartbeatReceiverSuite
Fixed the test failure here: https://amplab.cs.berkeley.edu/jenkins/view/Spark-QA-Test/job/Spark-1.5-SBT/116/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=spark-test/testReport/junit/org.apache.spark/HeartbeatReceiverSuite/normal_heartbeat/

This failure is because `HeartbeatReceiverSuite. heartbeatReceiver` may receive `SparkListenerExecutorAdded("driver")` sent from [LocalBackend](8fb3a65cbb/core/src/main/scala/org/apache/spark/scheduler/local/LocalBackend.scala (L121)).

There are other race conditions in `HeartbeatReceiverSuite` because `HeartbeatReceiver.onExecutorAdded` and `HeartbeatReceiver.onExecutorRemoved` are asynchronous. This PR also fixed them.

Author: zsxwing <zsxwing@gmail.com>

Closes #8946 from zsxwing/SPARK-10058.
2015-10-01 07:09:31 -07:00
Sun Rui c7b29ae641 [SPARK-10851] [SPARKR] Exception not failing R applications (in yarn cluster mode)
The YARN backend doesn't like when user code calls System.exit, since it cannot know the exit status and thus cannot set an appropriate final status for the application.

This PR remove the usage of system.exit to exit the RRunner. Instead, when the R process running an SparkR script returns an exit code other than 0, throws SparkUserAppException which will be caught by ApplicationMaster and ApplicationMaster knows it failed. For other failures, throws SparkException.

Author: Sun Rui <rui.sun@intel.com>

Closes #8938 from sun-rui/SPARK-10851.
2015-09-30 11:03:08 -07:00
zsxwing dba95ea032 [SPARK-10825] [CORE] [TESTS] Fix race conditions in StandaloneDynamicAllocationSuite
Fix the following issues in StandaloneDynamicAllocationSuite:

1. It should not assume master and workers start in order
2. It should not assume master and workers get ready at once
3. It should not assume the application is already registered with master after creating SparkContext
4. It should not access Master.app and idToApp which are not thread safe

The changes includes:
* Use `eventually` to wait until master and workers are ready to fix 1 and 2
* Use `eventually`  to wait until the application is registered with master to fix 3
* Use `askWithRetry[MasterStateResponse](RequestMasterState)` to get the application info to fix 4

Author: zsxwing <zsxwing@gmail.com>

Closes #8914 from zsxwing/fix-StandaloneDynamicAllocationSuite.
2015-09-29 11:53:28 -07:00
Sean Owen bf4199e261 [SPARK-10833] [BUILD] Inline, organize BSD/MIT licenses in LICENSE
In the course of https://issues.apache.org/jira/browse/LEGAL-226 it came to light that the guidance at http://www.apache.org/dev/licensing-howto.html#permissive-deps means that permissively-licensed dependencies has a different interpretation than we (er, I) had been operating under. "pointer ... to the license within the source tree" specifically means a copy of the license within Spark's distribution, whereas at the moment, Spark's LICENSE has a pointer to the project's license in the other project's source tree.

The remedy is simply to inline all such license references (i.e. BSD/MIT licenses) or include their text in "licenses" subdirectory and point to that.

Along the way, we can also treat other BSD/MIT licenses, whose text has been inlined into LICENSE, in the same way.

The LICENSE file can continue to provide a helpful list of BSD/MIT licensed projects and a pointer to their sites. This would be over and above including license text in the distro, which is the essential thing.

Author: Sean Owen <sowen@cloudera.com>

Closes #8919 from srowen/SPARK-10833.
2015-09-28 22:56:43 -04:00
Holden Karau d8d50ed388 [SPARK-10812] [YARN] Spark hadoop util support switching to yarn
While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight.

```
[info] SampleMiniClusterTest:
[info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
[info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
[info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
[info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
[info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
[info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
[info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
[info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
[info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
[info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
[info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
[info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
```

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
2015-09-28 06:33:45 -07:00
Matei Zaharia 21fd12cb17 [SPARK-9852] Let reduce tasks fetch multiple map output partitions
This makes two changes:

- Allow reduce tasks to fetch multiple map output partitions -- this is a pretty small change to HashShuffleFetcher
- Move shuffle locality computation out of DAGScheduler and into ShuffledRDD / MapOutputTracker; this was needed because the code in DAGScheduler wouldn't work for RDDs that fetch multiple map output partitions from each reduce task

I also added an AdaptiveSchedulingSuite that creates RDDs depending on multiple map output partitions.

Author: Matei Zaharia <matei@databricks.com>

Closes #8844 from mateiz/spark-9852.
2015-09-24 23:39:04 -04:00
Josh Rosen 8023242e77 [SPARK-10761] Refactor DiskBlockObjectWriter to not require BlockId
The DiskBlockObjectWriter constructor took a BlockId parameter but never used it. As part of some general cleanup in these interfaces, this patch refactors its constructor to eliminate this parameter.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8871 from JoshRosen/disk-block-object-writer-blockid-cleanup.
2015-09-24 14:18:33 -07:00
Xiangrui Meng 02144d6745 Revert "[SPARK-6028][Core]A new RPC implemetation based on the network module"
This reverts commit 084e4e1262.
2015-09-24 08:25:44 -07:00
Andrew Or 83f6f54d12 [SPARK-10474] [SQL] Aggregation fails to allocate memory for pointer array (round 2)
This patch reverts most of the changes in a previous fix #8827.

The real cause of the issue is that in `TungstenAggregate`'s prepare method we only reserve 1 page, but later when we switch to sort-based aggregation we try to acquire 1 page AND a pointer array. The longer-term fix should be to reserve also the pointer array, but for now ***we will simply not track the pointer array***. (Note that elsewhere we already don't track the pointer array, e.g. [here](a18208047f/sql/core/src/main/java/org/apache/spark/sql/execution/UnsafeKVExternalSorter.java (L88)))

Note: This patch reuses the unit test added in #8827 so it doesn't show up in the diff.

Author: Andrew Or <andrew@databricks.com>

Closes #8888 from andrewor14/dont-track-pointer-array.
2015-09-23 19:34:31 -07:00
zsxwing 084e4e1262 [SPARK-6028][Core]A new RPC implemetation based on the network module
Design doc: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing

Author: zsxwing <zsxwing@gmail.com>

Closes #6457 from zsxwing/new-rpc.
2015-09-23 18:59:49 -07:00
Reynold Xin 9952217749 [SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame.
Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).

This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.

Author: Reynold Xin <rxin@databricks.com>

Closes #8876 from rxin/SPARK-10731.
2015-09-23 16:43:21 -07:00
tedyu 27bfa9ab3a [SPARK-10721] Log warning when file deletion fails
Author: tedyu <yuzhihong@gmail.com>

Closes #8843 from tedyu/master.
2015-09-23 10:01:28 +01:00
Tathagata Das 5548a25475 [SPARK-10652] [SPARK-10742] [STREAMING] Set meaningful job descriptions for all streaming jobs
Here is the screenshot after adding the job descriptions to threads that run receivers and the scheduler thread running the batch jobs.

## All jobs page
* Added job descriptions with links to relevant batch details page
![image](https://cloud.githubusercontent.com/assets/663212/9924165/cda4a372-5cb1-11e5-91ca-d43a32c699e9.png)

## All stages page
* Added stage descriptions with links to relevant batch details page
![image](https://cloud.githubusercontent.com/assets/663212/9923814/2cce266a-5cae-11e5-8a3f-dad84d06c50e.png)

## Streaming batch details page
* Added the +details link
![image](https://cloud.githubusercontent.com/assets/663212/9921977/24014a32-5c98-11e5-958e-457b6c38065b.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8791 from tdas/SPARK-10652.
2015-09-22 22:44:09 -07:00
Andrew Or 61d4c07f4b [SPARK-10640] History server fails to parse TaskCommitDenied
... simply because the code is missing!

Author: Andrew Or <andrew@databricks.com>

Closes #8828 from andrewor14/task-end-reason-json.
2015-09-22 16:35:43 -07:00
Reynold Xin a96ba40f7e [SPARK-10714] [SPARK-8632] [SPARK-10685] [SQL] Refactor Python UDF handling
This patch refactors Python UDF handling:

1. Extract the per-partition Python UDF calling logic from PythonRDD into a PythonRunner. PythonRunner itself expects iterator as input/output, and thus has no dependency on RDD. This way, we can use PythonRunner directly in a mapPartitions call, or in the future in an environment without RDDs.
2. Use PythonRunner in Spark SQL's BatchPythonEvaluation.
3. Updated BatchPythonEvaluation to only use its input once, rather than twice. This should fix Python UDF performance regression in Spark 1.5.

There are a number of small cleanups I wanted to do when I looked at the code, but I kept most of those out so the diff looks small.

This basically implements the approach in https://github.com/apache/spark/pull/8833, but with some code moving around so the correctness doesn't depend on the inner workings of Spark serialization and task execution.

Author: Reynold Xin <rxin@databricks.com>

Closes #8835 from rxin/python-iter-refactor.
2015-09-22 14:11:46 -07:00
Josh Rosen 1ca5e2e0b8 [SPARK-10704] Rename HashShuffleReader to BlockStoreShuffleReader
The current shuffle code has an interface named ShuffleReader with only one implementation, HashShuffleReader. This naming is confusing, since the same read path code is used for both sort- and hash-based shuffle. This patch addresses this by renaming HashShuffleReader to BlockStoreShuffleReader.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8825 from JoshRosen/shuffle-reader-cleanup.
2015-09-22 11:50:22 -07:00
xutingjun 2ea0f2e11b [SPARK-9585] Delete the input format caching because some input format are non thread safe
If we cache the  InputFormat, all tasks on the same executor will share it.
Some InputFormat is thread safety, but some are not, such as HiveHBaseTableInputFormat. If tasks share a non thread safe InputFormat, unexpected error may be occurs.
To avoid it, I think we should delete the input format  caching.

Author: xutingjun <xutingjun@huawei.com>
Author: meiyoula <1039320815@qq.com>
Author: Xutingjun <xutingjun@huawei.com>

Closes #7918 from XuTingjun/cached_inputFormat.
2015-09-22 11:01:32 -07:00