Commit graph

4701 commits

Author SHA1 Message Date
Andrew Or 5287eec5a6 [SPARK-7718] [SQL] Speed up partitioning by avoiding closure cleaning
According to yhuai we spent 6-7 seconds cleaning closures in a partitioning job that takes 12 seconds. Since we provide these closures in Spark we know for sure they are serializable, so we can bypass the cleaning.

Author: Andrew Or <andrew@databricks.com>

Closes #6256 from andrewor14/sql-partition-speed-up and squashes the following commits:

a82b451 [Andrew Or] Fix style
10f7e3e [Andrew Or] Avoid getting call sites and cleaning closures
17e2943 [Andrew Or] Merge branch 'master' of github.com:apache/spark into sql-partition-speed-up
523f042 [Andrew Or] Skip unnecessary Utils.getCallSites too
f7fe143 [Andrew Or] Avoid unnecessary closure cleaning
2015-05-21 14:33:11 -07:00
Sean Owen 6e53402696 [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative
Document current limitation of rdd.fold.

This does not resolve SPARK-6416 but just documents the issue.
CC JoshRosen

Author: Sean Owen <sowen@cloudera.com>

Closes #6231 from srowen/SPARK-6416 and squashes the following commits:

9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
da40d84 [Sean Owen] Document current limitation of rdd.fold.
2015-05-21 19:42:51 +01:00
Mingfei 04940c4975 [SPARK-7389] [CORE] Tachyon integration improvement
Two main changes:

Add two functions in ExternalBlockManager, which are putValues and getValues
because the implementation may not rely on the putBytes and getBytes

improve Tachyon integration.
Currently, when putting data into Tachyon, Spark first serialize all data in one partition into a ByteBuffer, and then write into Tachyon, this will uses much memory and increase GC overhead

when get data from Tachyon, getValues depends on getBytes, which also read all data into On heap byte arry, and result in much memory usage.
This PR changes the approach of the two functions, make them read / write data by stream to reduce memory usage.

In our testing,  when data size is huge, this patch reduces about 30% GC time and 70% full GC time, and total execution time reduces about 10%

Author: Mingfei <mingfei.shi@intel.com>

Closes #5908 from shimingfei/Tachyon-integration-rebase and squashes the following commits:

033bc57 [Mingfei] modify accroding to comments
747c69a [Mingfei] modify according to comments - format changes
ce52c67 [Mingfei] put close() in a finally block
d2c60bb [Mingfei] modify according to comments, some code style change
4c11591 [Mingfei] modify according to comments split putIntoExternalBlockStore into two functions add default implementation for getValues and putValues
cc0a32e [Mingfei] Make getValues read data from Tachyon by stream Make putValues write data to Tachyon by stream
017593d [Mingfei] add getValues and putValues in ExternalBlockManager's Interface
2015-05-20 22:33:03 -07:00
Hari Shreedharan a70bf06b79 [SPARK-7750] [WEBUI] Rename endpoints from json to api to allow fu…
…rther extension to non-json outputs too.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #6273 from harishreedharan/json-to-api and squashes the following commits:

e14b73b [Hari Shreedharan] Rename `getJsonServlet` to `getServletHandler` i
42f8acb [Hari Shreedharan] Import order fixes.
2ef852f [Hari Shreedharan] [SPARK-7750][WebUI] Rename endpoints from `json` to `api` to allow further extension to non-json outputs too.
2015-05-20 21:13:10 -05:00
Josh Rosen 5196efff53 [SPARK-7719] Re-add UnsafeShuffleWriterSuite test that was removed for Java 6 compat
This patch re-adds a test which was removed in 9ebb44f8ab due to a Java 6 compatibility issue.  We now use Guava's `Iterators.emptyIterator()` in place of `Collections.emptyIterator()`, which isn't present in all Java 6 versions.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6298 from JoshRosen/SPARK-7719-fix-java-6-test-code and squashes the following commits:

5c9bd85 [Josh Rosen] Re-add UnsafeShuffleWriterSuite.emptyIterator() test which was removed due to Java 6 issue
2015-05-20 17:52:50 -07:00
Tathagata Das 3c434cbfd0 [SPARK-7767] [STREAMING] Added test for checkpoint serialization in StreamingContext.start()
Currently, the background checkpointing thread fails silently if the checkpoint is not serializable. It is hard to debug and therefore its best to fail fast at `start()` when checkpointing is enabled and the checkpoint is not serializable.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #6292 from tdas/SPARK-7767 and squashes the following commits:

51304e6 [Tathagata Das] Addressed comments.
c35237b [Tathagata Das] Added test for checkpoint serialization in StreamingContext.start()
2015-05-20 16:21:23 -07:00
Andrew Or 9b84443dd4 [SPARK-7237] [SPARK-7741] [CORE] [STREAMING] Clean more closures that need cleaning
SPARK-7741 is the equivalent of SPARK-7237 in streaming. This is an alternative to #6268.

Author: Andrew Or <andrew@databricks.com>

Closes #6269 from andrewor14/clean-moar and squashes the following commits:

c51c9ab [Andrew Or] Add periods (trivial)
6c686ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
79a435b [Andrew Or] Fix tests
d18c9f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-moar
65ef07b [Andrew Or] Fix tests?
4b487a3 [Andrew Or] Add tests for closures passed to DStream operations
328139b [Andrew Or] Do not forget foreachRDD
5431f61 [Andrew Or] Clean streaming closures
72b7b73 [Andrew Or] Clean core closures
2015-05-20 15:39:32 -07:00
Davies Liu 4de74d2602 [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python
cc rxin, please take a quick look, I'm working on tests.

Author: Davies Liu <davies@databricks.com>

Closes #6238 from davies/readwrite and squashes the following commits:

c7200eb [Davies Liu] update tests
9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
f0c5a04 [Davies Liu] use sqlContext.read.load
5f68bc8 [Davies Liu] update tests
6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
bcc6668 [Davies Liu] add reader amd writer API in Python
2015-05-19 14:23:28 -07:00
Patrick Wendell 9ebb44f8ab [HOTFIX]: Java 6 Build Breaks
These were blocking RC1 so I fixed them manually.
2015-05-19 06:01:16 +00:00
Daoyuan Wang c2437de189 [SPARK-7150] SparkContext.range() and SQLContext.range()
This PR is based on #6081, thanks adrian-wang.

Closes #6081

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #6230 from davies/range and squashes the following commits:

d3ce5fe [Davies Liu] add tests
789eda5 [Davies Liu] add range() in Python
4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
cbf5200 [Daoyuan Wang] let's add python support in a separate PR
f45e3b2 [Daoyuan Wang] remove redundant toLong
617da76 [Daoyuan Wang] fix safe marge for corner cases
867c417 [Daoyuan Wang] fix
13dbe84 [Daoyuan Wang] update
bd998ba [Daoyuan Wang] update comments
d3a0c1b [Daoyuan Wang] add range api()
2015-05-18 21:43:12 -07:00
Davies Liu 4fb52f9545 [SPARK-7624] Revert #4147
Author: Davies Liu <davies@databricks.com>

Closes #6172 from davies/revert_4147 and squashes the following commits:

3bfbbde [Davies Liu] Revert #4147
2015-05-18 16:55:45 -07:00
Andrew Or b93c97d79b [SPARK-7501] [STREAMING] DAG visualization: show DStream operations
This is similar to #5999, but for streaming. Roughly 200 lines are tests.

One thing to note here is that we already do some kind of scoping thing for call sites, so this patch adds the new RDD operation scoping logic in the same place. Also, this patch adds a `try finally` block to set the relevant variables in a safer way.

tdas zsxwing

------------------------
**Before**
<img src="https://cloud.githubusercontent.com/assets/2133137/7625996/d88211b8-f9b4-11e4-90b9-e11baa52d6d7.png" width="450px"/>

--------------------------
**After**
<img src="https://cloud.githubusercontent.com/assets/2133137/7625997/e0878f8c-f9b4-11e4-8df3-7dd611b13c87.png" width="650px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6034 from andrewor14/dag-viz-streaming and squashes the following commits:

932a64a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
e685df9 [Andrew Or] Rename createRDDWith
84d0656 [Andrew Or] Review feedback
697c086 [Andrew Or] Fix tests
53b9936 [Andrew Or] Set scopes for foreachRDD properly
1881802 [Andrew Or] Refactor DStream scope names again
af4ba8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
fd07d22 [Andrew Or] Make MQTT lower case
f6de871 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
0ca1801 [Andrew Or] Remove a few unnecessary withScopes on aliases
fa4e5fb [Andrew Or] Pass in input stream name rather than defining it from within
1af0b0e [Andrew Or] Fix style
074c00b [Andrew Or] Review comments
d25a324 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
e4a93ac [Andrew Or] Fix tests?
25416dc [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
9113183 [Andrew Or] Add tests for DStream scopes
b3806ab [Andrew Or] Fix test
bb80bbb [Andrew Or] Fix MIMA?
5c30360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
5703939 [Andrew Or] Rename operations that create InputDStreams
7c4513d [Andrew Or] Group RDDs by DStream operations and batches
bf0ab6e [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
05c2676 [Andrew Or] Wrap many more methods in withScope
c121047 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-streaming
65ef3e9 [Andrew Or] Fix NPE
a0d3263 [Andrew Or] Scope streaming operations instead of RDD operations
2015-05-18 14:33:33 -07:00
Davies Liu 32fbd297dd [SPARK-6216] [PYSPARK] check python version of worker with driver
This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.

Author: Davies Liu <davies@databricks.com>

Closes #6203 from davies/py_version and squashes the following commits:

b8fb76e [Davies Liu] fix test
6ce5096 [Davies Liu] use string for version
47c6278 [Davies Liu] check python version of worker with driver
2015-05-18 12:55:13 -07:00
Andrew Or 563bfcc1ab [SPARK-7627] [SPARK-7472] DAG visualization: style skipped stages
This patch fixes two things:

**SPARK-7627.** Cached RDDs no longer light up on the job page. This is a simple fix.
**SPARK-7472.** Display skipped stages differently from normal stages.

The latter is a major UX issue. Because we link the job viz to the stage viz even for skipped stages, the user may inadvertently click into the stage page of a skipped stage, which is empty.

-------------------
<img src="https://cloud.githubusercontent.com/assets/2133137/7675241/de1a3da6-fcea-11e4-8101-88055cef78c5.png" width="300px" />

Author: Andrew Or <andrew@databricks.com>

Closes #6171 from andrewor14/dag-viz-skipped and squashes the following commits:

f261797 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
0eda358 [Andrew Or] Tweak skipped stage border color
c604150 [Andrew Or] Tweak grayscale colors
7010676 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
762b541 [Andrew Or] Use special prefix for stage clusters to avoid collisions
51c95b9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
b928cd4 [Andrew Or] Fix potential leak + write tests for it
7c4c364 [Andrew Or] Show skipped stages differently
7cc34ce [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-skipped
c121fa2 [Andrew Or] Fix cache color
2015-05-18 10:59:35 -07:00
zsxwing ff71d34e00 [SPARK-7693][Core] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
Learnt a lesson from SPARK-7655: Spark should avoid to use `scala.concurrent.ExecutionContext.Implicits.global` because the user may submit blocking actions to `scala.concurrent.ExecutionContext.Implicits.global` and exhaust all threads in it. This could crash Spark. So Spark should always use its own thread pools for safety.

This PR removes all usages of `scala.concurrent.ExecutionContext.Implicits.global` and uses proper thread pools to replace them.

Author: zsxwing <zsxwing@gmail.com>

Closes #6223 from zsxwing/SPARK-7693 and squashes the following commits:

a33ff06 [zsxwing] Decrease the max thread number from 1024 to 128
cf4b3fc [zsxwing] Remove "import scala.concurrent.ExecutionContext.Implicits.global"
2015-05-17 20:37:19 -07:00
Josh Rosen f2cc6b5bcc [SPARK-7660] Wrap SnappyOutputStream to work around snappy-java bug
This patch wraps `SnappyOutputStream` to ensure that `close()` is idempotent and to guard against write-after-`close()` bugs. This is a workaround for https://github.com/xerial/snappy-java/issues/107, a bug where a non-idempotent `close()` method can lead to stream corruption. We can remove this workaround if we upgrade to a snappy-java version that contains my fix for this bug, but in the meantime this patch offers a backportable Spark fix.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6176 from JoshRosen/SPARK-7660-wrap-snappy and squashes the following commits:

8b77aae [Josh Rosen] Wrap SnappyOutputStream to fix SPARK-7660
2015-05-17 09:30:49 -07:00
zsxwing 3b6ef2c539 [SPARK-7655][Core] Deserializing value should not hold the TaskSchedulerImpl lock
We should not call `DirectTaskResult.value` when holding the `TaskSchedulerImpl` lock. It may cost dozens of seconds to deserialize a large object.

Author: zsxwing <zsxwing@gmail.com>

Closes #6195 from zsxwing/SPARK-7655 and squashes the following commits:

21f502e [zsxwing] Add more comments
e25fa88 [zsxwing] Add comments
15010b5 [zsxwing] Deserialize value should not hold the TaskSchedulerImpl lock
2015-05-16 21:03:22 -07:00
zsxwing 47e7ffe36b [SPARK-7655][Core][SQL] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'
Because both `AkkaRpcEndpointRef.ask` and `BroadcastHashJoin` uses `scala.concurrent.ExecutionContext.Implicits.global`. However, because the tasks in `BroadcastHashJoin` are usually long-running tasks, which will occupy all threads in `global`. Then `ask` cannot get a chance to process the replies.

For `ask`, actually the tasks are very simple, so we can use `MoreExecutors.sameThreadExecutor()`. For `BroadcastHashJoin`, it's better to use `ThreadUtils.newDaemonCachedThreadPool`.

Author: zsxwing <zsxwing@gmail.com>

Closes #6200 from zsxwing/SPARK-7655-2 and squashes the following commits:

cfdc605 [zsxwing] Remove redundant imort and minor doc fix
cf83153 [zsxwing] Add "sameThread" and "newDaemonCachedThreadPool with maxThreadNumber" to ThreadUtils
08ad0ee [zsxwing] Remove 'scala.concurrent.ExecutionContext.Implicits.global' in 'ask' and 'BroadcastHashJoin'
2015-05-16 00:44:29 -07:00
Nishkam Ravi 0ac8b01a07 [SPARK-7672] [CORE] Use int conversion in translating kryoserializer.buffer.mb to kryoserializer.buffer
In translating spark.kryoserializer.buffer.mb to spark.kryoserializer.buffer, use of toDouble will lead to "Fractional values not supported" error even when spark.kryoserializer.buffer.mb is an integer.
ilganeli, andrewor14

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #6198 from nishkamravi2/master_nravi and squashes the following commits:

171a53c [nishkamravi2] Update SparkConfSuite.scala
5261bf6 [Nishkam Ravi] Add a test for deprecated config spark.kryoserializer.buffer.mb
5190f79 [Nishkam Ravi] In translating from deprecated spark.kryoserializer.buffer.mb to spark.kryoserializer.buffer use int conversion since fractions are not permissible
059ce82 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
eaa13b5 [nishkamravi2] Update Client.scala
981afd2 [Nishkam Ravi] Check for read permission before initiating copy
1b81383 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
0f1abd0 [nishkamravi2] Update Utils.scala
474e3bf [nishkamravi2] Update DiskBlockManager.scala
97c383e [nishkamravi2] Update Utils.scala
8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook
2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2015-05-16 08:24:21 +01:00
Josh Rosen 2c04c8a1ae [SPARK-7563] OutputCommitCoordinator.stop() should only run on the driver
This fixes a bug where an executor that exits can cause the driver's OutputCommitCoordinator to stop. To fix this, we use an `isDriver` flag and check it in `stop()`.

See https://issues.apache.org/jira/browse/SPARK-7563 for more details.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6197 from JoshRosen/SPARK-7563 and squashes the following commits:

04b2cc5 [Josh Rosen] [SPARK-7563] OutputCommitCoordinator.stop() should only be executed on the driver
2015-05-15 18:06:01 -07:00
Kay Ousterhout e745456476 [SPARK-7676] Bug fix and cleanup of stage timeline view
cc pwendell sarutak

This commit cleans up some unnecessary code, eliminates the feature where when you mouse-over a box in the timeline, the corresponding task is highlighted in the table (because that feature is only useful in the rare case when you have a very small number of tasks, in which case it's easy to figure out the mapping anyway), and fixes a bug where nothing shows up if you try to visualize a stage with only 1 task.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #6202 from kayousterhout/SPARK-7676 and squashes the following commits:

dfd29d4 [Kay Ousterhout] [SPARK-7676] Bug fix and cleanup of stage timeline view
2015-05-15 17:45:14 -07:00
Kousuke Saruta 9b6cf285d0 [SPARK-7296] Add timeline visualization for stages in the UI.
This PR builds on #2342 by adding a timeline view for the Stage page,
showing how tasks spend their time.

With this timeline, we can understand following things of a Stage.

* When/where each task ran
* Total duration of each task
* Proportion of the time each task spends

Also, this timeline view can scrollable and zoomable.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5843 from sarutak/stage-page-timeline and squashes the following commits:

4ba9604 [Kousuke Saruta] Fixed the order of legends
16bb552 [Kousuke Saruta] Removed border of legend area
2e5d605 [Kousuke Saruta] Modified warning message
16cb2e6 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into stage-page-timeline
7ae328f [Kousuke Saruta] Modified code style
d5f794a [Kousuke Saruta] Fixed performance issues more
64e6642 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into stage-page-timeline
e4a3354 [Kousuke Saruta] minor code style change
878e3b8 [Kousuke Saruta] Fixed a bug that tooltip remains
b9d8f1b [Kousuke Saruta] Fixed performance issue
ac8842b [Kousuke Saruta] Fixed layout
2319739 [Kousuke Saruta] Modified appearances more
81903ab [Kousuke Saruta] Modified appearances
a79dcc3 [Kousuke Saruta] Modified appearance
55a390c [Kousuke Saruta] Ignored scalastyle for a line-comment
29eae3e [Kousuke Saruta] limited to longest 1000 tasks
2a9e376 [Kousuke Saruta] Minor cleanup
385b6d2 [Kousuke Saruta] Added link feature
ba1ac3e [Kousuke Saruta] Fixed style
2ae8520 [Kousuke Saruta] Updated bootstrap-tooltip.js from 2.2.2 to 2.3.2
af430f1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into stage-page-timeline
e694b8e [Kousuke Saruta] Added timeline view to StagePage
8f6610c [Kousuke Saruta] Fixed conflict
b587cf2 [Kousuke Saruta] initial commit
11fe67d [Kousuke Saruta] Fixed conflict
79ac03d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
a91abd3 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into timeline-viewer-feature
ef34a5b [Kousuke Saruta] Implement tooltip using bootstrap
b09d0c5 [Kousuke Saruta] Move `stroke` and `fill` attribute of rect elements to css
d3c63c8 [Kousuke Saruta] Fixed a little bit bugs
a36291b [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into timeline-viewer-feature
28714b6 [Kousuke Saruta] Fixed highlight issue
0dc4278 [Kousuke Saruta] Addressed most of Patrics's feedbacks
8110acf [Kousuke Saruta] Added scroll limit to Job timeline
974a64a [Kousuke Saruta] Removed unused function
ee7a7f0 [Kousuke Saruta] Refactored
6a91872 [Kousuke Saruta] Temporary commit
6693f34 [Kousuke Saruta] Added link to job/stage box in the timeline in order to move to corresponding row when we click
8f88222 [Kousuke Saruta] Added job/stage description
aeed4b1 [Kousuke Saruta] Removed stage timeline
fc1696c [Kousuke Saruta] Merge branch 'timeline-viewer-feature' of github.com:sarutak/spark into timeline-viewer-feature
999ccd4 [Kousuke Saruta] Improved scalability
0fc6a31 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
19815ae [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
68b7540 [Kousuke Saruta] Merge branch 'timeline-viewer-feature' of github.com:sarutak/spark into timeline-viewer-feature
52b5f0b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
dec85db [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
fcdab7d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
dab7cc1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
09cce97 [Kousuke Saruta] Cleanuped
16f82cf [Kousuke Saruta] Cleanuped
9fb522e [Kousuke Saruta] Cleanuped
d05f2c2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
e85e9aa [Kousuke Saruta] Cleanup: Added TimelineViewUtils.scala
a76e569 [Kousuke Saruta] Removed unused setting in timeline-view.css
5ce1b21 [Kousuke Saruta] Added vis.min.js, vis.min.css and vis.map to .rat-exclude
082f709 [Kousuke Saruta] Added Timeline-View feature for Applications, Jobs and Stages
2015-05-15 13:54:26 -07:00
ehnalis 8e3822a079 [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode
Added a simple checking for SparkContext.
Also added two rational checking against null at AM object.

Author: ehnalis <zoltan.zvara@gmail.com>

Closes #6083 from ehnalis/cluster and squashes the following commits:

926bd96 [ehnalis] Moved check to SparkContext.
7c89b6e [ehnalis] Remove false line.
ea2a5fe [ehnalis] [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode
4924e01 [ehnalis] [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode
39e4fa3 [ehnalis] SPARK-7504 [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode
9f287c5 [ehnalis] [SPARK-7504] [YARN] NullPointerException when initializing SparkContext in YARN-cluster mode
2015-05-15 12:15:42 -07:00
Kousuke Saruta ad92af9dbb [SPARK-7664] [WEBUI] DAG visualization: Fix incorrect link paths of DAG.
In JobPage, we can jump a StagePage when we click corresponding box of DAG viz but the link path is incorrect.

When we click a box like as follows ...
![screenshot_from_2015-05-15 19 24 25](https://cloud.githubusercontent.com/assets/4736016/7651528/5f7ef824-fb3c-11e4-9518-8c9ade2dff7a.png)

We jump to index page.
![screenshot_from_2015-05-15 19 24 45](https://cloud.githubusercontent.com/assets/4736016/7651534/6d666274-fb3c-11e4-971c-c3f2dc2b1da2.png)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #6184 from sarutak/fix-link-path-of-dag-viz and squashes the following commits:

faba3ba [Kousuke Saruta] Fix a incorrect link
2015-05-15 11:54:32 -07:00
Tim Ellison 270d4b5181 [CORE] Protect additional test vars from early GC
Fix more places in which some test variables could be collected early by aggressive JVM optimization.
Added a couple of comments to note where existing references are sufficient in the same test pattern.

Author: Tim Ellison <t.p.ellison@gmail.com>

Closes #6187 from tellison/DefeatEarlyGC and squashes the following commits:

27329d9 [Tim Ellison] [CORE] Protect additional test vars from early GC
2015-05-15 11:27:24 -07:00
Oleksii Kostyliev b1b9d5802e [SPARK-7233] [CORE] Detect REPL mode once
<h3>Description</h3>
Detect REPL mode once per JVM lifespan.
Previous behavior was to check presence of interpreter mode every time a job was submitted. In the case of execution of multiple short-living jobs this was causing massive mutual blocks between submission threads.

For more details please refer to https://issues.apache.org/jira/browse/SPARK-7233.

<h3>Notes</h3>
* I inverted the return value in case of catching an exception from `true` to `false`. It seems more logical to assume that if the REPL class is not found, we aren't in the interpreter mode.
* I'd personally would call `classForName` with just a Spark classloader (`org.apache.spark.util.Utils#getSparkClassLoader`) but `org.apache.spark.util.Utils#getContextOrSparkClassLoader` is said to be preferable.
* I struggled to come up with a concise, readable and clear unit test. Suggestions are welcome if you feel necessary.

Author: Oleksii Kostyliev <etander@gmail.com>
Author: Oleksii Kostyliev <okostyliev@thunderhead.com>

Closes #5835 from preeze/SPARK-7233 and squashes the following commits:

69bb9e4 [Oleksii Kostyliev] SPARK-7527: fixed explanatory comment to meet style-checker requirements
26dcc24 [Oleksii Kostyliev] SPARK-7527: fixed explanatory comment to meet style-checker requirements
c6f9685 [Oleksii Kostyliev] Merge remote-tracking branch 'remotes/upstream/master' into SPARK-7233
b78a983 [Oleksii Kostyliev] SPARK-7527: revert the fix and let it be addressed separately at a later stage
b64d441 [Oleksii Kostyliev] SPARK-7233: inline inInterpreter parameter into instantiateClass
86e2606 [Oleksii Kostyliev] SPARK-7233, SPARK-7527: Handle interpreter mode properly.
c7ee69c [Oleksii Kostyliev] Merge remote-tracking branch 'upstream/master' into SPARK-7233
d6c07fc [Oleksii Kostyliev] SPARK-7233: properly handle the inverted meaning of isInInterpreter
c319039 [Oleksii Kostyliev] SPARK-7233: move inInterpreter to Utils and make it lazy
2015-05-15 11:20:19 -07:00
zsxwing cf842d42a7 [SPARK-7650] [STREAMING] [WEBUI] Move streaming css and js files to the streaming project
cc tdas

Author: zsxwing <zsxwing@gmail.com>

Closes #6160 from zsxwing/SPARK-7650 and squashes the following commits:

fe6ae15 [zsxwing] Fix the import order
a4ffd99 [zsxwing] Merge branch 'master' into SPARK-7650
dc402b6 [zsxwing] Move streaming css and js files to the streaming project
2015-05-14 23:51:41 -07:00
Kan Zhang daf4ae72fe [CORE] Remove unreachable Heartbeat message from Worker
It doesn't look to me Heartbeat is sent to Worker from anyone.

Author: Kan Zhang <kzhang@apache.org>

Closes #6163 from kanzhang/deadwood and squashes the following commits:

56be118 [Kan Zhang] [core] Remove unreachable Heartbeat message from Worker
2015-05-14 23:50:50 -07:00
Josh Rosen 7da33ce505 [HOTFIX] Add workaround for SPARK-7660 to fix JavaAPISuite failures. 2015-05-14 23:17:41 -07:00
zsxwing b208f998b5 [SPARK-7645] [STREAMING] [WEBUI] Show milliseconds in the UI if the batch interval < 1 second
I also updated the summary of the Streaming page.

![screen shot 2015-05-14 at 11 52 59 am](https://cloud.githubusercontent.com/assets/1000778/7640103/13cdf68e-fa36-11e4-84ec-e2a3954f4319.png)
![screen shot 2015-05-14 at 12 39 33 pm](https://cloud.githubusercontent.com/assets/1000778/7640151/4cc066ac-fa36-11e4-8494-2821d6a6f17c.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6154 from zsxwing/SPARK-7645 and squashes the following commits:

5db6ca1 [zsxwing] Add UIUtils.formatBatchTime
e4802df [zsxwing] Show milliseconds in the UI if the batch interval < 1 second
2015-05-14 16:58:36 -07:00
zsxwing 0a317c124c [SPARK-7649] [STREAMING] [WEBUI] Use window.localStorage to store the status rather than the url
Use window.localStorage to store the status rather than the url so that the url won't be changed.

cc tdas

Author: zsxwing <zsxwing@gmail.com>

Closes #6158 from zsxwing/SPARK-7649 and squashes the following commits:

3c56fef [zsxwing] Use window.localStorage to store the status rather than the url
2015-05-14 16:57:33 -07:00
Xiangrui Meng 57ed16cf93 [SPARK-7643] [UI] use the correct size in RDDPage for storage info and partitions
`dataDistribution` and `partitions` are `Option[Seq[_]]`. andrewor14 squito

Author: Xiangrui Meng <meng@databricks.com>

Closes #6157 from mengxr/SPARK-7643 and squashes the following commits:

99fe8a4 [Xiangrui Meng] use the correct size in RDDPage for storage info and partitions
2015-05-14 16:56:32 -07:00
Rex Xiong 93dbb3ad83 [SPARK-7598] [DEPLOY] Add aliveWorkers metrics in Master
In Spark Standalone setup, when some workers are DEAD, they will stay in master worker list for a while.
master.workers metrics for master is only showing the total number of workers, we need to monitor how many real ALIVE workers are there to ensure the cluster is healthy.

Author: Rex Xiong <pengx@microsoft.com>

Closes #6117 from twilightgod/add-aliveWorker-metrics and squashes the following commits:

6be69a5 [Rex Xiong] Fix comment for aliveWorkers metrics
a882f39 [Rex Xiong] Fix style for aliveWorkers metrics
38ce955 [Rex Xiong] Add aliveWorkers metrics in Master
2015-05-14 16:55:31 -07:00
Andrew Or 3113da9c70 [HOT FIX #6125] Do not wait for all stages to start rendering
zsxwing

Author: Andrew Or <andrew@databricks.com>

Closes #6138 from andrewor14/dag-viz-clean-properly and squashes the following commits:

19d4e98 [Andrew Or] Add synchronize
02542d6 [Andrew Or] Rename overloaded variable
d11bee1 [Andrew Or] Don't wait until all stages have started before rendering
2015-05-13 21:04:13 -07:00
Josh Rosen 73bed408fb [SPARK-7081] Faster sort-based shuffle path using binary processing cache-aware sort
This patch introduces a new shuffle manager that enhances the existing sort-based shuffle with a new cache-friendly sort algorithm that operates directly on binary data. The goals of this patch are to lower memory usage and Java object overheads during shuffle and to speed up sorting. It also lays groundwork for follow-up patches that will enable end-to-end processing of serialized records.

The new shuffle manager, `UnsafeShuffleManager`, can be enabled by setting `spark.shuffle.manager=tungsten-sort` in SparkConf.

The new shuffle manager uses directly-managed memory to implement several performance optimizations for certain types of shuffles. In cases where the new performance optimizations cannot be applied, the new shuffle manager delegates to SortShuffleManager to handle those shuffles.

UnsafeShuffleManager's optimizations will apply when _all_ of the following conditions hold:

 - The shuffle dependency specifies no aggregation or output ordering.
 - The shuffle serializer supports relocation of serialized values (this is currently supported
   by KryoSerializer and Spark SQL's custom serializers).
 - The shuffle produces fewer than 16777216 output partitions.
 - No individual record is larger than 128 MB when serialized.

In addition, extra spill-merging optimizations are automatically applied when the shuffle compression codec supports concatenation of serialized streams. This is currently supported by Spark's LZF serializer.

At a high-level, UnsafeShuffleManager's design is similar to Spark's existing SortShuffleManager.  In sort-based shuffle, incoming records are sorted according to their target partition ids, then written to a single map output file. Reducers fetch contiguous regions of this file in order to read their portion of the map output. In cases where the map output data is too large to fit in memory, sorted subsets of the output can are spilled to disk and those on-disk files are merged to produce the final output file.

UnsafeShuffleManager optimizes this process in several ways:

 - Its sort operates on serialized binary data rather than Java objects, which reduces memory consumption and GC overheads. This optimization requires the record serializer to have certain properties to allow serialized records to be re-ordered without requiring deserialization.  See SPARK-4550, where this optimization was first proposed and implemented, for more details.

 - It uses a specialized cache-efficient sorter (UnsafeShuffleExternalSorter) that sorts arrays of compressed record pointers and partition ids. By using only 8 bytes of space per record in the sorting array, this fits more of the array into cache.

 - The spill merging procedure operates on blocks of serialized records that belong to the same partition and does not need to deserialize records during the merge.

 - When the spill compression codec supports concatenation of compressed data, the spill merge simply concatenates the serialized and compressed spill partitions to produce the final output partition.  This allows efficient data copying methods, like NIO's `transferTo`, to be used and avoids the need to allocate decompression or copying buffers during the merge.

The shuffle read path is unchanged.

This patch is similar to [SPARK-4550](http://issues.apache.org/jira/browse/SPARK-4550) / #4450 but uses a slightly different implementation. The `unsafe`-based implementation featured in this patch lays the groundwork for followup patches that will enable sorting to operate on serialized data pages that will be prepared by Spark SQL's new `unsafe` operators (such as the new aggregation operator introduced in #5725).

### Future work

There are several tasks that build upon this patch, which will be left to future work:

- [SPARK-7271](https://issues.apache.org/jira/browse/SPARK-7271) Redesign / extend the shuffle interfaces to accept binary data as input. The goal here is to let us bypass serialization steps in cases where the sort input is produced by an operator that operates directly on binary data.
- Extension / redesign of the `Serializer` API. We can add new methods which allow serializers to determine the size requirements for serializing objects and for serializing objects directly to a specified memory address (similar to how `UnsafeRowConverter` works in Spark SQL).

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5868)
<!-- Reviewable:end -->

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5868 from JoshRosen/unsafe-sort and squashes the following commits:

ef0a86e [Josh Rosen] Fix scalastyle errors
7610f2f [Josh Rosen] Add tests for proper cleanup of shuffle data.
d494ffe [Josh Rosen] Fix deserialization of JavaSerializer instances.
52a9981 [Josh Rosen] Fix some bugs in the address packing code.
51812a7 [Josh Rosen] Change shuffle manager sort name to tungsten-sort
4023fa4 [Josh Rosen] Add @Private annotation to some Java classes.
de40b9d [Josh Rosen] More comments to try to explain metrics code
df07699 [Josh Rosen] Attempt to clarify confusing metrics update code
5e189c6 [Josh Rosen] Track time spend closing / flushing files; split TimeTrackingOutputStream into separate file.
d5779c6 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
c2ce78e [Josh Rosen] Fix a missed usage of MAX_PARTITION_ID
e3b8855 [Josh Rosen] Cleanup in UnsafeShuffleWriter
4a2c785 [Josh Rosen] rename 'sort buffer' to 'pointer array'
6276168 [Josh Rosen] Remove ability to disable spilling in UnsafeShuffleExternalSorter.
57312c9 [Josh Rosen] Clarify fileBufferSize units
2d4e4f4 [Josh Rosen] Address some minor comments in UnsafeShuffleExternalSorter.
fdcac08 [Josh Rosen] Guard against overflow when expanding sort buffer.
85da63f [Josh Rosen] Cleanup in UnsafeShuffleSorterIterator.
0ad34da [Josh Rosen] Fix off-by-one in nextInt() call
56781a1 [Josh Rosen] Rename UnsafeShuffleSorter to UnsafeShuffleInMemorySorter
e995d1a [Josh Rosen] Introduce MAX_SHUFFLE_OUTPUT_PARTITIONS.
e58a6b4 [Josh Rosen] Add more tests for PackedRecordPointer encoding.
4f0b770 [Josh Rosen] Attempt to implement proper shuffle write metrics.
d4e6d89 [Josh Rosen] Update to bit shifting constants
69d5899 [Josh Rosen] Remove some unnecessary override vals
8531286 [Josh Rosen] Add tests that automatically trigger spills.
7c953f9 [Josh Rosen] Add test that covers UnsafeShuffleSortDataFormat.swap().
e1855e5 [Josh Rosen] Fix a handful of misc. IntelliJ inspections
39434f9 [Josh Rosen] Avoid integer multiplication overflow in getMemoryUsage (thanks FindBugs!)
1e3ad52 [Josh Rosen] Delete unused ByteBufferOutputStream class.
ea4f85f [Josh Rosen] Roll back an unnecessary change in Spillable.
ae538dc [Josh Rosen] Document UnsafeShuffleManager.
ec6d626 [Josh Rosen] Add notes on maximum # of supported shuffle partitions.
0d4d199 [Josh Rosen] Bump up shuffle.memoryFraction to make tests pass.
b3b1924 [Josh Rosen] Properly implement close() and flush() in DummySerializerInstance.
1ef56c7 [Josh Rosen] Revise compression codec support in merger; test cross product of configurations.
b57c17f [Josh Rosen] Disable some overly-verbose logs that rendered DEBUG useless.
f780fb1 [Josh Rosen] Add test demonstrating which compression codecs support concatenation.
4a01c45 [Josh Rosen] Remove unnecessary log message
27b18b0 [Josh Rosen] That for inserting records AT the max record size.
fcd9a3c [Josh Rosen] Add notes + tests for maximum record / page sizes.
9d1ee7c [Josh Rosen] Fix MiMa excludes for ShuffleWriter change
fd4bb9e [Josh Rosen] Use own ByteBufferOutputStream rather than Kryo's
67d25ba [Josh Rosen] Update Exchange operator's copying logic to account for new shuffle manager
8f5061a [Josh Rosen] Strengthen assertion to check partitioning
01afc74 [Josh Rosen] Actually read data in UnsafeShuffleWriterSuite
1929a74 [Josh Rosen] Update to reflect upstream ShuffleBlockManager -> ShuffleBlockResolver rename.
e8718dd [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
9b7ebed [Josh Rosen] More defensive programming RE: cleaning up spill files and memory after errors
7cd013b [Josh Rosen] Begin refactoring to enable proper tests for spilling.
722849b [Josh Rosen] Add workaround for transferTo() bug in merging code; refactor tests.
9883e30 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
b95e642 [Josh Rosen] Refactor and document logic that decides when to spill.
1ce1300 [Josh Rosen] More minor cleanup
5e8cf75 [Josh Rosen] More minor cleanup
e67f1ea [Josh Rosen] Remove upper type bound in ShuffleWriter interface.
cfe0ec4 [Josh Rosen] Address a number of minor review comments:
8a6fe52 [Josh Rosen] Rename UnsafeShuffleSpillWriter to UnsafeShuffleExternalSorter
11feeb6 [Josh Rosen] Update TODOs related to shuffle write metrics.
b674412 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-sort
aaea17b [Josh Rosen] Add comments to UnsafeShuffleSpillWriter.
4f70141 [Josh Rosen] Fix merging; now passes UnsafeShuffleSuite tests.
133c8c9 [Josh Rosen] WIP towards testing UnsafeShuffleWriter.
f480fb2 [Josh Rosen] WIP in mega-refactoring towards shuffle-specific sort.
57f1ec0 [Josh Rosen] WIP towards packed record pointers for use in optimized shuffle sort.
69232fd [Josh Rosen] Enable compressible address encoding for off-heap mode.
7ee918e [Josh Rosen] Re-order imports in tests
3aeaff7 [Josh Rosen] More refactoring and cleanup; begin cleaning iterator interfaces
3490512 [Josh Rosen] Misc. cleanup
f156a8f [Josh Rosen] Hacky metrics integration; refactor some interfaces.
2776aca [Josh Rosen] First passing test for ExternalSorter.
5e100b2 [Josh Rosen] Super-messy WIP on external sort
595923a [Josh Rosen] Remove some unused variables.
8958584 [Josh Rosen] Fix bug in calculating free space in current page.
f17fa8f [Josh Rosen] Add missing newline
c2fca17 [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
b8a09fe [Josh Rosen] Back out accidental log4j.properties change
bfc12d3 [Josh Rosen] Add tests for serializer relocation property.
240864c [Josh Rosen] Remove PrefixComputer and require prefix to be specified as part of insert()
1433b42 [Josh Rosen] Store record length as int instead of long.
026b497 [Josh Rosen] Re-use a buffer in UnsafeShuffleWriter
0748458 [Josh Rosen] Port UnsafeShuffleWriter to Java.
87e721b [Josh Rosen] Renaming and comments
d3cc310 [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
e2d96ca [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
e267cee [Josh Rosen] Fix compilation of UnsafeSorterSuite
9c6cf58 [Josh Rosen] Refactor to use DiskBlockObjectWriter.
253f13e [Josh Rosen] More cleanup
8e3ec20 [Josh Rosen] Begin code cleanup.
4d2f5e1 [Josh Rosen] WIP
3db12de [Josh Rosen] Minor simplification and sanity checks in UnsafeSorter
767d3ca [Josh Rosen] Fix invalid range in UnsafeSorter.
e900152 [Josh Rosen] Add test for empty iterator in UnsafeSorter
57a4ea0 [Josh Rosen] Make initialSize configurable in UnsafeSorter
abf7bfe [Josh Rosen] Add basic test case.
81d52c5 [Josh Rosen] WIP on UnsafeSorter
2015-05-13 17:07:31 -07:00
Andrew Or aa1837875a [SPARK-7502] DAG visualization: gracefully handle removed stages
Old stages are removed without much feedback to the user. This happens very often in streaming. See screenshots below for more detail. zsxwing

**Before**

<img src="https://cloud.githubusercontent.com/assets/2133137/7621031/643cc1e0-f978-11e4-8f42-09decaac44a7.png" width="500px"/>

-------------------------
**After**
<img src="https://cloud.githubusercontent.com/assets/2133137/7621037/6e37348c-f978-11e4-84a5-e44e154f9b13.png" width="400px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6132 from andrewor14/dag-viz-remove-gracefully and squashes the following commits:

43175cd [Andrew Or] Handle removed jobs and stages gracefully
2015-05-13 16:29:52 -07:00
Andrew Or 44403414d3 [SPARK-7464] DAG visualization: highlight the same RDDs on hover
This is pretty useful for MLlib.

<img src="https://cloud.githubusercontent.com/assets/2133137/7599650/c7d03dd8-f8b8-11e4-8c0a-0a89e786c90f.png" width="400px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6100 from andrewor14/dag-viz-hover and squashes the following commits:

fefe2af [Andrew Or] Link tooltips for nodes that belong to the same RDD
90c6a7e [Andrew Or] Assign classes to clusters and nodes, not IDs
2015-05-13 16:29:10 -07:00
Andrew Or f88ac70155 [SPARK-7399] Spark compilation error for scala 2.11
Subsequent fix following #5966. I tried this out locally.

Author: Andrew Or <andrew@databricks.com>

Closes #6129 from andrewor14/211-compilation and squashes the following commits:

713868f [Andrew Or] Fix compilation issue for scala 2.11
2015-05-13 16:28:37 -07:00
Andrew Or f6e18388d9 [SPARK-7608] Clean up old state in RDDOperationGraphListener
This is necessary for streaming and long-running Spark applications. zsxwing tdas

Author: Andrew Or <andrew@databricks.com>

Closes #6125 from andrewor14/viz-listener-leak and squashes the following commits:

8660949 [Andrew Or] Fix thing + add tests
33c0843 [Andrew Or] Clean up old job state
2015-05-13 16:27:48 -07:00
Tim Ellison 51030b8a9d [MINOR] [CORE] Accept alternative mesos unsatisfied link error in test.
The IBM JVM reports an failed library load with a slightly different error message to Oracle's JVM.  Update the test case to allow for either form.

Author: Tim Ellison <tellison@users.noreply.github.com>
Author: Tim Ellison <t.p.ellison@gmail.com>

Closes #6119 from tellison/LibraryLoading and squashes the following commits:

2c5cd4e [Tim Ellison] Reduce assertion to check for the mesos library name
f48c194 [Tim Ellison] Split long line
b1079d7 [Tim Ellison] [MINOR] [CORE] Accept alternative mesos unsatisfied link error in test.
2015-05-13 21:16:32 +01:00
Tim Ellison 3cd9ad2406 [MINOR] Enhance SizeEstimator to detect IBM compressed refs and s390 …
…arch.

 - zSeries 64-bit Java reports its architecture as s390x, so enhance the 64-bit check to accommodate that value.

 - SizeEstimator can detect whether IBM Java is using compressed object pointers using info in the "java.vm.info" property, so will do a better job than failing on the HotSpot MBean and guessing.

Author: Tim Ellison <t.p.ellison@gmail.com>

Closes #6085 from tellison/SizeEstimator and squashes the following commits:

1b6ff6a [Tim Ellison] Merge branch 'master' of https://github.com/apache/spark into SizeEstimator
0968989 [Tim Ellison] [MINOR] Enhance SizeEstimator to detect IBM compressed refs and s390 arch.
2015-05-13 21:01:42 +01:00
zsxwing bec938f777 [SPARK-7589] [STREAMING] [WEBUI] Make "Input Rate" in the Streaming page consistent with other pages
This PR makes "Input Rate" in the Streaming page consistent with Job and Stage pages.

![screen shot 2015-05-12 at 5 03 35 pm](https://cloud.githubusercontent.com/assets/1000778/7601444/f943f8ac-f8ca-11e4-8280-a715d814f434.png)
![screen shot 2015-05-12 at 5 07 25 pm](https://cloud.githubusercontent.com/assets/1000778/7601445/f9571c0c-f8ca-11e4-9b12-9317cb55c002.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6102 from zsxwing/SPARK-7589 and squashes the following commits:

2745225 [zsxwing] Make "Input Rate" in the Streaming page consistent with other pages
2015-05-13 10:01:26 -07:00
Masayoshi TSUZUKI 50c7270801 [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
escape spaces in the arguments.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5447 from tsudukim/feature/SPARK-6568-2 and squashes the following commits:

3f9a188 [Masayoshi TSUZUKI] modified some errors.
ed46047 [Masayoshi TSUZUKI] avoid scalastyle errors.
1784239 [Masayoshi TSUZUKI] removed Utils.formatPath.
e03f289 [Masayoshi TSUZUKI] removed testWindows from Utils.resolveURI and Utils.resolveURIs. replaced SystemUtils.IS_OS_WINDOWS to Utils.isWindows. removed Utils.formatPath from PythonRunner.scala.
84c33d0 [Masayoshi TSUZUKI] - use resolveURI in nonLocalPaths - run tests for Windows path only on Windows
016128d [Masayoshi TSUZUKI] fixed to use File.toURI()
2c62e3b [Masayoshi TSUZUKI] Merge pull request #1 from sarutak/SPARK-6568-2
7019a8a [Masayoshi TSUZUKI] Merge branch 'master' of https://github.com/apache/spark into feature/SPARK-6568-2
45946ee [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6568-2
10f1c73 [Kousuke Saruta] Added a comment
93c3c40 [Kousuke Saruta] Merge branch 'classpath-handling-fix' of github.com:sarutak/spark into SPARK-6568-2
649da82 [Kousuke Saruta] Fix classpath handling
c7ba6a7 [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
2015-05-13 09:43:40 +01:00
linweizhong 98195c3031 [SPARK-7526] [SPARKR] Specify ip of RBackend, MonitorServer and RRDD Socket server
These R process only used to communicate with JVM process on local, so binding to localhost is more reasonable then wildcard ip.

Author: linweizhong <linweizhong@huawei.com>

Closes #6053 from Sephiroth-Lin/spark-7526 and squashes the following commits:

5303af7 [linweizhong] bind to localhost rather than wildcard ip
2015-05-12 23:55:44 -07:00
zsxwing 1422e79e51 [SPARK-7406] [STREAMING] [WEBUI] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
Screenshots:
![screen shot 2015-05-06 at 2 29 03 pm](https://cloud.githubusercontent.com/assets/1000778/7504129/9c57f710-f3fc-11e4-9c6e-1b79c17c546d.png)

![screen shot 2015-05-06 at 2 24 35 pm](https://cloud.githubusercontent.com/assets/1000778/7504140/b63bb216-f3fc-11e4-83a5-6dfc6481d192.png)

tdas as we discussed offline

Author: zsxwing <zsxwing@gmail.com>

Closes #5952 from zsxwing/SPARK-7406 and squashes the following commits:

2b004ea [zsxwing] Merge branch 'master' into SPARK-7406
e9eb506 [zsxwing] Update tooltip contents
2215b2a [zsxwing] Add tooltips for "Scheduling Delay", "Processing Time" and "Total Delay"
2015-05-12 14:41:21 -07:00
Andrew Or b9b01f44f6 [HOT FIX #6076] DAG visualization: curve the edges 2015-05-12 12:06:30 -07:00
Andrew Or 65697bbeaf [SPARK-7500] DAG visualization: move cluster labeling to dagre-d3
This fixes the label bleeding issue described in the JIRA and pictured in the screenshots below. I also took the opportunity to move some code to the places that they belong more to. In particular:

(1) Drawing cluster labels is now implemented in my branch of dagre-d3 instead of in Spark
(2) All graph styling is now moved from Scala to JS

Note that these changes are related because our existing mechanism of "tacking on cluster labels" afterwards isn't flexible enough for us to fix issues like this one easily. For the other half of the changes, visit http://github.com/andrewor14/dagre-d3.

-------------------

**Before.**
<img src="https://cloud.githubusercontent.com/assets/2133137/7582769/b1423440-f845-11e4-8248-b3446a01bf79.png" width="300px"/>

-------------------

**After.**
<img src="https://cloud.githubusercontent.com/assets/2133137/7582742/74891ae6-f845-11e4-96c4-41c7b8aedbdf.png" width="400px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6076 from andrewor14/dag-viz-bleed and squashes the following commits:

5858d7a [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-bleed
c686dc4 [Andrew Or] Fix tooltip placement
d908c36 [Andrew Or] Add link to dagre-d3 changes (minor)
4a4fb58 [Andrew Or] Fix bleeding + move all styling to JS
2015-05-12 11:17:59 -07:00
Cheng Lian 0595b6de8f [SPARK-3928] [SPARK-5182] [SQL] Partitioning support for the data sources API
This PR adds partitioning support for the external data sources API. It aims to simplify development of file system based data sources, and provide first class partitioning support for both read path and write path.  Existing data sources like JSON and Parquet can be simplified with this work.

## New features provided

1. Hive compatible partition discovery

   This actually generalizes the partition discovery strategy used in Parquet data source in Spark 1.3.0.

1. Generalized partition pruning optimization

   Now partition pruning is handled during physical planning phase.  Specific data sources don't need to worry about this harness anymore.

   (This also implies that we can remove `CatalystScan` after migrating the Parquet data source, since now we don't need to pass Catalyst expressions to data source implementations.)

1. Insertion with dynamic partitions

   When inserting data to a `FSBasedRelation`, data can be partitioned dynamically by specified partition columns.

## New structures provided

### Developer API

1. `FSBasedRelation`

   Base abstract class for file system based data sources.

1. `OutputWriter`

   Base abstract class for output row writers, responsible for writing a single row object.

1. `FSBasedRelationProvider`

   A new relation provider for `FSBasedRelation` subclasses. Note that data sources extending `FSBasedRelation` don't need to extend `RelationProvider` and `SchemaRelationProvider`.

### User API

New overloaded versions of

1. `DataFrame.save()`
1. `DataFrame.saveAsTable()`
1. `SQLContext.load()`

are provided to allow users to save/load DataFrames with user defined dynamic partition columns.

### Spark SQL query planning

1. `InsertIntoFSBasedRelation`

   Used to implement write path for `FSBasedRelation`s.

1. New rules for `FSBasedRelation` in `DataSourceStrategy`

   These are added to hook `FSBasedRelation` into physical query plan in read path, and perform partition pruning.

## TODO

- [ ] Use scratch directories when overwriting a table with data selected from itself.

      Currently, this is not supported, because the table been overwritten is always deleted before writing any data to it.

- [ ] When inserting with dynamic partition columns, use external sorter to group the data first.

      This ensures that we only need to open a single `OutputWriter` at a time.  For data sources like Parquet, `OutputWriter`s can be quite memory consuming.  One issue is that, this approach breaks the row distribution in the original DataFrame.  However, we did't promise to preserve data distribution when writing a DataFrame.

- [x] More tests.  Specifically, test cases for

      - [x] Self-join
      - [x] Loading partitioned relations with a subset of partition columns stored in data files.
      - [x] `SQLContext.load()` with user defined dynamic partition columns.

## Parquet data source migration

Parquet data source migration is covered in PR https://github.com/liancheng/spark/pull/6, which is against this PR branch and for preview only. A formal PR need to be made after this one is merged.

Author: Cheng Lian <lian@databricks.com>

Closes #5526 from liancheng/partitioning-support and squashes the following commits:

5351a1b [Cheng Lian] Fixes compilation error introduced while rebasing
1f9b1a5 [Cheng Lian] Tweaks data schema passed to FSBasedRelations
43ba50e [Cheng Lian] Avoids serializing generated projection code
edf49e7 [Cheng Lian] Removed commented stale code block
348a922 [Cheng Lian] Adds projection in FSBasedRelation.buildScan(requiredColumns, inputPaths)
ad4d4de [Cheng Lian] Enables HDFS style globbing
8d12e69 [Cheng Lian] Fixes compilation error
c71ac6c [Cheng Lian] Addresses comments from @marmbrus
7552168 [Cheng Lian] Fixes typo in MimaExclude.scala
0349e09 [Cheng Lian] Fixes compilation error introduced while rebasing
52b0c9b [Cheng Lian] Adjusts project/MimaExclude.scala
c466de6 [Cheng Lian] Addresses comments
bc3f9b4 [Cheng Lian] Uses projection to separate partition columns and data columns while inserting rows
795920a [Cheng Lian] Fixes compilation error after rebasing
0b8cd70 [Cheng Lian] Adds Scala/Catalyst row conversion when writing non-partitioned tables
fa543f3 [Cheng Lian] Addresses comments
5849dd0 [Cheng Lian] Fixes doc typos.  Fixes partition discovery refresh.
51be443 [Cheng Lian] Replaces FSBasedRelation.outputCommitterClass with FSBasedRelation.prepareForWrite
c4ed4fe [Cheng Lian] Bug fixes and a new test suite
a29e663 [Cheng Lian] Bug fix: should only pass actuall data files to FSBaseRelation.buildScan
5f423d3 [Cheng Lian] Bug fixes. Lets data source to customize OutputCommitter rather than OutputFormat
54c3d7b [Cheng Lian] Enforces that FileOutputFormat must be used
be0c268 [Cheng Lian] Uses TaskAttempContext rather than Configuration in OutputWriter.init
0bc6ad1 [Cheng Lian] Resorts to new Hadoop API, and now FSBasedRelation can customize output format class
f320766 [Cheng Lian] Adds prepareForWrite() hook, refactored writer containers
422ff4a [Cheng Lian] Fixes style issue
ce52353 [Cheng Lian] Adds new SQLContext.load() overload with user defined dynamic partition columns
8d2ff71 [Cheng Lian] Merges partition columns when reading partitioned relations
ca1805b [Cheng Lian] Removes duplicated partition discovery code in new Parquet
f18dec2 [Cheng Lian] More strict schema checking
b746ab5 [Cheng Lian] More tests
9b487bf [Cheng Lian] Fixes compilation errors introduced while rebasing
ea6c8dd [Cheng Lian] Removes remote debugging stuff
327bb1d [Cheng Lian] Implements partitioning support for data sources API
3c5073a [Cheng Lian] Fixes SaveModes used in test cases
fb5a607 [Cheng Lian] Fixes compilation error
9d17607 [Cheng Lian] Adds the contract that OutputWriter should have zero-arg constructor
5de194a [Cheng Lian] Forgot Apache licence header
95d0b4d [Cheng Lian] Renames PartitionedSchemaRelationProvider to FSBasedRelationProvider
770b5ba [Cheng Lian] Adds tests for FSBasedRelation
3ba9bbf [Cheng Lian] Adds DataFrame.saveAsTable() overrides which support partitioning
1b8231f [Cheng Lian] Renames FSBasedPrunedFilteredScan to FSBasedRelation
aa8ba9a [Cheng Lian] Javadoc fix
012ed2d [Cheng Lian] Adds PartitioningOptions
7dd8dd5 [Cheng Lian] Adds new interfaces and stub methods for data sources API partitioning support
2015-05-13 01:32:28 +08:00
Daoyuan Wang d86ce84584 [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pyspark
Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #6003 from adrian-wang/pynareplace and squashes the following commits:

672efba [Daoyuan Wang] remove py2.7 feature
4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests
9e232e7 [Daoyuan Wang] rename scala map
af0268a [Daoyuan Wang] remove na
63ac579 [Daoyuan Wang] add na.replace in pyspark
2015-05-12 10:23:41 -07:00
Andrew Or f3e8e60063 [SPARK-7467] Dag visualization: treat checkpoint as an RDD operation
Such that a checkpoint RDD does not go into random scopes on the UI, e.g. `take`. We've seen this in streaming.

Author: Andrew Or <andrew@databricks.com>

Closes #6004 from andrewor14/dag-viz-checkpoint and squashes the following commits:

9217439 [Andrew Or] Fix checkpoints
4ae8806 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-checkpoint
19bc07b [Andrew Or] Treat checkpoint as an RDD operation
2015-05-12 01:40:55 -07:00
Marcelo Vanzin 82e890fb19 [SPARK-7485] [BUILD] Remove pyspark files from assembly.
The sbt part of the build is hacky; it basically tricks sbt
into generating the zip by using a generator, but returns
an empty list for the generated files so that nothing is
actually added to the assembly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:

22c1e04 [Marcelo Vanzin] Remove unneeded code.
4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
2015-05-12 01:39:21 -07:00
linweizhong 9847875266 [MINOR] [PYSPARK] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark
As PR #5580 we have created pyspark.zip on building and set PYTHONPATH to python/lib/pyspark.zip, so to keep consistence update this.

Author: linweizhong <linweizhong@huawei.com>

Closes #6047 from Sephiroth-Lin/pyspark_pythonpath and squashes the following commits:

8cc3d96 [linweizhong] Set PYTHONPATH to python/lib/pyspark.zip rather than python/pyspark as PR#5580 we have create pyspark.zip on build
2015-05-12 01:36:27 -07:00
zsxwing 8a4edecc48 [SPARK-7534] [CORE] [WEBUI] Fix the Stage table when a stage is missing
Just improved the Stage table when a stage is missing.

Before:

![screen shot 2015-05-11 at 10 11 51 am](https://cloud.githubusercontent.com/assets/1000778/7570842/2ba37380-f7c8-11e4-9b5f-cf1a6264b2a4.png)

After:

![screen shot 2015-05-11 at 10 26 09 am](https://cloud.githubusercontent.com/assets/1000778/7570848/33703152-f7c8-11e4-81a8-d53dd72d7b8d.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6061 from zsxwing/SPARK-7534 and squashes the following commits:

09fe862 [zsxwing] Leave it blank rather than '-'
6299197 [zsxwing] Fix the Stage table when a stage is missing
2015-05-12 01:34:33 -07:00
Steve Loughran 7ce2a33c3a [SPARK-7508] JettyUtils-generated servlets to log & report all errors
Patch for SPARK-7508

This logs  warn then generates a response which include the message body and stack trace as text/plain, no-cache. The exit code is 500.

In practise (in some tests in SPARK-1537 to be precise), jetty is getting in between this servlet and the web response the user sees —the body of the response is lost for any error response (500, even 404 and bad request). The standard Jetty handlers must be getting in the way.

This patch doesn't address that, it ensures that
1. if the jetty handlers were put to one side the users would see the errors
2. at least the exceptions appear in the server-side logs.

This is better to users saying "I saw a 500 error" and you not having anything in the logs to see what went wrong.

Author: Steve Loughran <stevel@hortonworks.com>

Closes #6033 from steveloughran/stevel/feature/SPARK-7508-JettyUtils and squashes the following commits:

584836f [Steve Loughran] SPARK-7508 drop trailing semicolon
ad6f185 [Steve Loughran] SPARK-7508: jetty handles exception reporting itself; spark just sets this up and logs exceptions before being relayed
258d9f9 [Steve Loughran] SPARK-7508 fix typo manually-edited before patch pushed
69c8263 [Steve Loughran] SPARK-7508 JettyUtils-generated servlets to log & report all errors
2015-05-11 13:35:42 -07:00
Kousuke Saruta 12b95abc70 [SPARK-7403] [WEBUI] Link URL in objects on Timeline View is wrong in case of running on YARN
When we use Spark on YARN and have AllJobPage via ResourceManager's proxy, the link URL in objects which represent each job on timeline view is wrong.

In timeline-view.js, the link is generated as follows.
```
window.location.href = "job/?id=" + getJobId(this);
```

This assumes the URL displayed on the web browser ends with "jobs/" but when we access AllJobPage via the proxy, the url displayed does not end with "jobs/"

The proxy doesn't return status code 301 or 302 so the url displayed still indicates the base url, not "/jobs" even though displaying AllJobPages.

![2015-05-07 3 34 37](https://cloud.githubusercontent.com/assets/4736016/7501079/a8507ad6-f46c-11e4-9bed-62abea170f4c.png)

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5947 from sarutak/fix-link-in-timeline and squashes the following commits:

aaf40e1 [Kousuke Saruta] Added Copyright for vis.js
01bee7b [Kousuke Saruta] Fixed timeline-view.js in order to get correct href
2015-05-09 10:10:29 +01:00
Vinod K C dda6d9f404 [SPARK-7438] [SPARK CORE] Fixed validation of relativeSD in countApproxDistinct
Author: Vinod K C <vinod.kc@huawei.com>

Closes #5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:

3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
8ddbfae [Vinod K C] Remove blank line
b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
122d378 [Vinod K C] Fixed validation of relativeSD in  countApproxDistinct
2015-05-09 10:03:15 +01:00
tedyu 54e6fa0563 [SPARK-7237] Clean function in several RDD methods
Author: tedyu <yuzhihong@gmail.com>

Closes #5959 from ted-yu/master and squashes the following commits:

f83d445 [tedyu] Move cleaning outside of mapPartitionsWithIndex
56d7c92 [tedyu] Consolidate import of Random
f6014c0 [tedyu] Remove cleaning in RDD#filterWith
36feb6c [tedyu] Try to get correct syntax
55d01eb [tedyu] Try to get correct syntax
c2786df [tedyu] Correct syntax
d92bfcf [tedyu] Correct syntax in test
164d3e4 [tedyu] Correct variable name
8b50d93 [tedyu] Address Andrew's review comments
0c8d47e [tedyu] Add test for mapWith()
6846e40 [tedyu] Add test for flatMapWith()
6c124a9 [tedyu] Clean function in several RDD methods
2015-05-08 17:16:38 -07:00
Andrew Or bd61f07039 [SPARK-7469] [SQL] DAG visualization: show SQL query operators
The DAG visualization currently displays only low-level Spark primitives (e.g. `map`, `reduceByKey`, `filter` etc.). For SQL, these aren't particularly useful. Instead, we should display higher level physical operators (e.g. `Filter`, `Exchange`, `ShuffleHashJoin`). cc marmbrus

-----------------
**Before**
<img src="https://issues.apache.org/jira/secure/attachment/12731586/before.png" width="600px"/>
-----------------
**After** (Pay attention to the words)
<img src="https://issues.apache.org/jira/secure/attachment/12731587/after.png" width="600px"/>
-----------------

Author: Andrew Or <andrew@databricks.com>

Closes #5999 from andrewor14/dag-viz-sql and squashes the following commits:

0db23a4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
1e211db [Andrew Or] Update comment
0d49fd6 [Andrew Or] Merge branch 'master' of github.com:apache/spark into dag-viz-sql
ffd237a [Andrew Or] Fix style
202dac1 [Andrew Or] Make ignoreParent false by default
e61b1ab [Andrew Or] Visualize SQL operators, not low-level Spark primitives
569034a [Andrew Or] Add a flag to ignore parent settings and scopes
2015-05-08 17:15:10 -07:00
Aaron Davidson ffdc40ce7a [SPARK-6955] Perform port retries at NettyBlockTransferService level
Currently we're doing port retries in the TransportServer level, but this is not specified by the TransportContext API and it has other further-reaching impacts like causing undesirable behavior for the Yarn and Standalone shuffle services.

Author: Aaron Davidson <aaron@databricks.com>

Closes #5575 from aarondav/port-bind and squashes the following commits:

3c2d6ed [Aaron Davidson] Oops, never do it.
a5d9432 [Aaron Davidson] Remove shouldHostShuffleServiceIfEnabled
e901eb2 [Aaron Davidson] fix local-cluster mode for ExternalShuffleServiceSuite
59e5e38 [Aaron Davidson] [SPARK-6955] Perform port retries at NettyBlockTransferService level
2015-05-08 17:13:55 -07:00
Marcelo Vanzin 5467c34c3d [SPARK-7378] [CORE] Handle deep links to unloaded apps.
The code was treating deep links as if they were attempt IDs, so
for example if you tried to load "/history/app1/jobs" directly,
that would fail because the code would treat "jobs" as an attempt id.

This change modifies the code to try both cases - first without an
attempt id, then with it, so that deep links are handled correctly.
This assumes that the links in the Spark UI do not clash with the
attempt id namespace, though, which is the case for YARN at least,
which is the only backend that currently publishes attempt IDs.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5922 from vanzin/SPARK-7378 and squashes the following commits:

96f648b [Marcelo Vanzin] Fix comparison.
ed3bcd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-7378
23483e4 [Marcelo Vanzin] Fat fingers.
b728f08 [Marcelo Vanzin] [SPARK-7378] [core] Handle deep links to unloaded apps.
2015-05-08 14:12:58 -07:00
Marcelo Vanzin 9042f8f378 [MINOR] [CORE] Allow History Server to read kerberos opts from config file.
Order of initialization code was wrong.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5998 from vanzin/hs-conf-fix and squashes the following commits:

00b6b6b [Marcelo Vanzin] [minor] [core] Allow History Server to read kerberos opts from config file.
2015-05-08 14:10:27 -07:00
Andrew Or 3b0c5e71f1 [SPARK-7466] DAG visualization: fix orphan nodes
Simple fix. We were comparing an option with `null`.

Before:
<img src="https://issues.apache.org/jira/secure/attachment/12731383/before.png" width="250px"/>
After:
<img src="https://issues.apache.org/jira/secure/attachment/12731384/after.png" width="250px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #6002 from andrewor14/dag-viz-orphan-nodes and squashes the following commits:

a1468dc [Andrew Or] Fix null check
2015-05-08 14:09:39 -07:00
Tim Ellison 31da40dfee [MINOR] Defeat early garbage collection of test suite variable
The JVM is free to collect references to variables that no longer participate in a computation.  This simple patch adds an operation to the variable 'rdd' to ensure it is not collected early in the test suite's explicit calls to GC.

ref: http://bugs.java.com/view_bug.do?bug_id=6721588

Author: Tim Ellison <t.p.ellison@gmail.com>

Closes #6010 from tellison/master and squashes the following commits:

77d1c8f [Tim Ellison] Defeat early garbage collection of test suite variable by aggressive JVMs
2015-05-08 14:08:58 -07:00
Kousuke Saruta c45c09b015 [WEBUI] Remove debug feature for vis.js
`vis.min.js` refers `vis.map` and this even refers `vis.js` which is used for debug `vis.js` but this debug feature is not needed for Spark itself.

This issue is really minor so I don't file this in JIRA.

/CC andrewor14

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5994 from sarutak/remove-debug-feature-for-vis and squashes the following commits:

8be038f [Kousuke Saruta] Remove vis.map entry from .rat-exclude
7404945 [Kousuke Saruta] Removed debug feature for vis.js
2015-05-08 14:06:37 -07:00
Evan Jones 25889d8d97 [SPARK-7490] [CORE] [Minor] MapOutputTracker.deserializeMapStatuses: close input streams
GZIPInputStream allocates native memory that is not freed until close() or
when the finalizer runs. It is best to close() these streams explicitly.

stephenh made the same change for serializeMapStatuses in commit b0d884f0. This is the same change for deserialize.

(I ran the unit test suite! it seems to have passed. I did not make a JIRA since this seems "trivial", and the guidelines suggest it is not required for trivial changes)

Author: Evan Jones <ejones@twitter.com>

Closes #5982 from evanj/master and squashes the following commits:

0d76e85 [Evan Jones] [CORE] MapOutputTracker.deserializeMapStatuses: close input streams
2015-05-08 22:00:39 +01:00
Kay Ousterhout 4b3bb0e43c [SPARK-6627] Finished rename to ShuffleBlockResolver
The previous cleanup-commit for SPARK-6627 renamed ShuffleBlockManager
to ShuffleBlockResolver, but didn't rename the associated subclasses and
variables; this commit does that.

I'm unsure whether it's ok to rename ExternalShuffleBlockManager, since that's technically a public class?

cc pwendell

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #5764 from kayousterhout/SPARK-6627 and squashes the following commits:

43add1e [Kay Ousterhout] Spacing fix
96080bf [Kay Ousterhout] Test fixes
d8a5d36 [Kay Ousterhout] [SPARK-6627] Finished rename to ShuffleBlockResolver
2015-05-08 12:24:06 -07:00
Matei Zaharia a1ec08f7ed [SPARK-7298] Harmonize style of new visualizations
- Colors on the timeline now match the rest of the UI
- The expandable buttons to show timeline view, DAG, etc are now more visible
- Timeline text is smaller
- DAG visualization text and colors are more consistent throughout
- Fix some JavaScript style issues
- Various small fixes throughout (e.g. inconsistent capitalization, some confusing names, HTML escaping, etc)

Author: Matei Zaharia <matei@databricks.com>

Closes #5942 from mateiz/ui and squashes the following commits:

def38d0 [Matei Zaharia] Add some tooltips
4c5a364 [Matei Zaharia] Reduce stage and rank separation slightly
43dcbe3 [Matei Zaharia] Some updates to DAG
fac734a [Matei Zaharia] tweaks
6a6705d [Matei Zaharia] More fixes
67629f5 [Matei Zaharia] Various small tweaks
2015-05-08 14:41:42 -04:00
Jacek Lewandowski 35d6a99cbe [SPARK-7436] Fixed instantiation of custom recovery mode factory and added tests
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #5977 from jacek-lewandowski/SPARK-7436 and squashes the following commits:

ff0a3c2 [Jacek Lewandowski] SPARK-7436: Fixed instantiation of custom recovery mode factory and added tests
2015-05-08 11:36:30 -07:00
Imran Rashid c796be70f3 [SPARK-3454] separate json endpoints for data in the UI
Exposes data available in the UI as json over http.  Key points:

* new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
* Uses jersey + jackson for routing & converting POJOs into json
* tests against known results in `HistoryServerSuite`
* also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.

Author: Imran Rashid <irashid@cloudera.com>

Closes #5940 from squito/SPARK-3454_better_test_files and squashes the following commits:

1a72ed6 [Imran Rashid] rats
85fdb3e [Imran Rashid] Merge branch 'no_php' into SPARK-3454
1fc65b0 [Imran Rashid] Revert "Revert "[SPARK-3454] separate json endpoints for data in the UI""
1276900 [Imran Rashid] get rid of giant event file, replace w/ smaller one; check both shuffle read & shuffle write
4e12013 [Imran Rashid] just use test case name for expectation file name
863ef64 [Imran Rashid] rename json files to avoid strange file names and not look like php
2015-05-08 16:54:32 +01:00
Lianhui Wang ebff7327af [SPARK-6869] [PYSPARK] Add pyspark archives path to PYTHONPATH
Based on https://github.com/apache/spark/pull/5478 that provide a PYSPARK_ARCHIVES_PATH env. within this PR, we just should export PYSPARK_ARCHIVES_PATH=/user/spark/pyspark.zip,/user/spark/python/lib/py4j-0.8.2.1-src.zip in conf/spark-env.sh when we don't install PySpark on each node of Yarn. i run python application successfully on yarn-client and yarn-cluster with this PR.
andrewor14 sryza Sephiroth-Lin Can you take a look at this?thanks.

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #5580 from lianhuiwang/SPARK-6869 and squashes the following commits:

66ffa43 [Lianhui Wang] Update Client.scala
c2ad0f9 [Lianhui Wang] Update Client.scala
1c8f664 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
008850a [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
f0b4ed8 [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
150907b [Lianhui Wang] Merge remote-tracking branch 'remotes/apache/master' into SPARK-6869
20402cd [Lianhui Wang] use ZipEntry
9d87c3f [Lianhui Wang] update scala style
e7bd971 [Lianhui Wang] address vanzin's comments
4b8a3ed [Lianhui Wang] use pyArchivesEnvOpt
e6b573b [Lianhui Wang] address vanzin's comments
f11f84a [Lianhui Wang] zip pyspark archives
5192cca [Lianhui Wang] update import path
3b1e4c8 [Lianhui Wang] address tgravescs's comments
9396346 [Lianhui Wang] put zip to make-distribution.sh
0d2baf7 [Lianhui Wang] update import paths
e0179be [Lianhui Wang] add zip pyspark archives in build or sparksubmit
31e8e06 [Lianhui Wang] update code style
9f31dac [Lianhui Wang] update code and add comments
f72987c [Lianhui Wang] add archives path to PYTHONPATH
2015-05-08 08:44:46 -05:00
Zhang, Liye c2f0821aad [SPARK-7392] [CORE] bugfix: Kryo buffer size cannot be larger than 2M
Author: Zhang, Liye <liye.zhang@intel.com>

Closes #5934 from liyezhang556520/kryoBufSize and squashes the following commits:

5707e04 [Zhang, Liye] fix import order
8693288 [Zhang, Liye] replace multiplier with ByteUnit methods
9bf93e9 [Zhang, Liye] add tests
d91e5ed [Zhang, Liye] change kb to mb
2015-05-08 09:10:58 +01:00
Andrew Or 88717ee4e7 [SPARK-7347] DAG visualization: add tooltips to RDDs
This is an addition to #5729.

Here's an example with ALS.
<img src="https://issues.apache.org/jira/secure/attachment/12731039/tooltip.png" width="400px"></img>

Author: Andrew Or <andrew@databricks.com>

Closes #5957 from andrewor14/viz-hover2 and squashes the following commits:

60e3758 [Andrew Or] Add tooltips for RDDs on job page
2015-05-07 12:29:56 -07:00
Andrew Or f1216514b8 [SPARK-7391] DAG visualization: auto expand if linked from another viz
This is an addition to #5729.

If you click into a stage from the DAG viz on the job page, you might expect to expand on the stage. However, once you get to the stage page, you actually have to expand the DAG viz there yourself.

This patch makes this happen automatically. It's a small UX improvement.

Author: Andrew Or <andrew@databricks.com>

Closes #5958 from andrewor14/viz-auto-expand and squashes the following commits:

03cd157 [Andrew Or] Automatically expand DAG viz if from job page
2015-05-07 12:29:18 -07:00
Timothy Chen 4eecf550aa [SPARK-7373] [MESOS] Add docker support for launching drivers in mesos cluster mode.
Using the existing docker support for mesos, also enabling the mesos cluster mode scheduler to launch Spark drivers in docker images as well.

This also allows the executors launched by the drivers to be also in the same Docker image by passing  the docker settings.

Author: Timothy Chen <tnachen@gmail.com>

Closes #5917 from tnachen/spark_cluster_docker and squashes the following commits:

1e842f5 [Timothy Chen] Add docker support for launching drivers in mesos cluster mode.
2015-05-07 12:23:16 -07:00
Tijo Thomas 0c33bf817c [SPARK-7399] [SPARK CORE] Fixed compilation error in scala 2.11
scala has deterministic naming-scheme for the generated methods which return default arguments . here one of the default argument of overloaded method has to be removed

Author: Tijo Thomas <tijoparacka@gmail.com>

Closes #5966 from tijoparacka/fix_compilation_error_in_scala2.11 and squashes the following commits:

c90bba8 [Tijo Thomas] Fixed compilation error in scala 2.11
2015-05-07 12:21:26 -07:00
Andrew Or 71a452b61c [HOT FIX] For DAG visualization #5954 2015-05-06 18:02:51 -07:00
Andrew Or 8fa6829f5e [SPARK-7371] [SPARK-7377] [SPARK-7408] DAG visualization addendum (#5729)
This is a follow-up patch for #5729.

**[SPARK-7408]** Move as much style code from JS to CSS as possible
**[SPARK-7377]** Fix JS error if a job / stage contains only one RDD
**[SPARK-7371]** Decrease emphasis on RDD on stage page as requested by mateiz pwendell

This patch also includes general code clean up.

<img src="https://issues.apache.org/jira/secure/attachment/12730992/before-after.png" width="500px"></img>

Author: Andrew Or <andrew@databricks.com>

Closes #5954 from andrewor14/viz-emphasize-rdd and squashes the following commits:

3c0d4f0 [Andrew Or] Guard against JS error by rendering arrows only if needed
f23e15b [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz-emphasize-rdd
565801f [Andrew Or] Clean up code
9dab5f0 [Andrew Or] Move styling from JS to CSS + clean up code
107c0b6 [Andrew Or] Tweak background color, stroke width, font size etc.
1610c62 [Andrew Or] Implement cluster padding for stage page
2015-05-06 17:52:34 -07:00
Andrew Or fbf1f342a0 [HOT FIX] [SPARK-7418] Ignore flaky SparkSubmitUtilsSuite test 2015-05-06 17:08:39 -07:00
Josh Rosen 845d1d4d0c Add Private annotation.
This was originally added as part of #4435, which was reverted.
2015-05-06 11:03:17 -07:00
Josh Rosen 002c12384d [SPARK-7311] Introduce internal Serializer API for determining if serializers support object relocation
This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output.

This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them.  The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property.  I also added tests and comments clarifying when this works for KryoSerializer.

This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5924 from JoshRosen/SPARK-7311 and squashes the following commits:

50a68ca [Josh Rosen] Address minor nits
0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer
123b992 [Josh Rosen] Cleanup for submitting as standalone patch.
4aa61b2 [Josh Rosen] Add missing newline
2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use:
0ba75e6 [Josh Rosen] Add tests for serializer relocation property.
450fa21 [Josh Rosen] Back out accidental log4j.properties change
86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation
b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
2015-05-06 10:52:55 -07:00
zsxwing 9f019c7223 [SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite
Fixed the following failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-1.3-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/452/testReport/junit/org.apache.spark.broadcast/BroadcastSuite/Unpersisting_HttpBroadcast_on_executors_and_driver_in_distributed_mode/

The tests should wait until all slaves are up. Otherwise, there may be only a part of `BlockManager`s registered, and fail the tests.

Author: zsxwing <zsxwing@gmail.com>

Closes #5925 from zsxwing/SPARK-7384 and squashes the following commits:

783cb7b [zsxwing] Add comments for _jobProgressListener and remove postfixOps
1009ef1 [zsxwing] [SPARK-7384][Core][Tests] Fix flaky tests for distributed mode in BroadcastSuite
2015-05-05 23:25:28 -07:00
Reynold Xin 51b3d41e16 Revert "[SPARK-3454] separate json endpoints for data in the UI"
This reverts commit d49735800d.

The commit broke Spark on Windows.
2015-05-05 19:27:30 -07:00
Sandy Ryza 0092abb47a Some minor cleanup after SPARK-4550.
JoshRosen this PR addresses the comments you left on #4450 after it got merged.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5916 from sryza/sandy-spark-4550-cleanup and squashes the following commits:

dee3d85 [Sandy Ryza] Some minor cleanup after SPARK-4550.
2015-05-05 18:32:16 -07:00
zsxwing 489700c809 [SPARK-6939] [STREAMING] [WEBUI] Add timeline and histogram graphs for streaming statistics
This is the initial work of SPARK-6939. Not yet ready for code review. Here are the screenshots:

![graph1](https://cloud.githubusercontent.com/assets/1000778/7165766/465942e0-e3dc-11e4-9b05-c184b09d75dc.png)

![graph2](https://cloud.githubusercontent.com/assets/1000778/7165779/53f13f34-e3dc-11e4-8714-a4a75b7e09ff.png)

TODOs:
- [x] Display more information on mouse hover
- [x] Align the timeline and distribution graphs
- [x] Clean up the codes

Author: zsxwing <zsxwing@gmail.com>

Closes #5533 from zsxwing/SPARK-6939 and squashes the following commits:

9f7cd19 [zsxwing] Merge branch 'master' into SPARK-6939
deacc3f [zsxwing] Remove unused import
cd03424 [zsxwing] Fix .rat-excludes
70cc87d [zsxwing] Streaming Scheduling Delay => Scheduling Delay
d457277 [zsxwing] Fix UIUtils in BatchPage
b3f303e [zsxwing] Add comments for unclear classes and methods
ff0bff8 [zsxwing] Make InputDStream.name private[streaming]
cc392c5 [zsxwing] Merge branch 'master' into SPARK-6939
e275e23 [zsxwing] Move time related methods to Streaming's UIUtils
d5d86f6 [zsxwing] Fix incorrect lastErrorTime
3be4b7a [zsxwing] Use InputInfo
b50fa32 [zsxwing] Jump to the batch page when clicking a point in the timeline graphs
203605d [zsxwing] Merge branch 'master' into SPARK-6939
74307cf [zsxwing] Reuse the data for histogram graphs to reduce the page size
2586916 [zsxwing] Merge branch 'master' into SPARK-6939
70d8533 [zsxwing] Remove BatchInfo.numRecords and a few renames
7bbdc0a [zsxwing] Hide the receiver sub table if no receiver
a2972e9 [zsxwing] Add some ui tests for StreamingPage
fd03ad0 [zsxwing] Add a test to verify no memory leak
4a8f886 [zsxwing] Merge branch 'master' into SPARK-6939
18607a1 [zsxwing] Merge branch 'master' into SPARK-6939
d0b0aec [zsxwing] Clean up the codes
a459f49 [zsxwing] Add a dash line to processing time graphs
8e4363c [zsxwing] Prepare for the demo
c81a1ee [zsxwing] Change time unit in the graphs automatically
4c0b43f [zsxwing] Update Streaming UI
04c7500 [zsxwing] Make the server and client use the same timezone
fed8219 [zsxwing] Move the x axis at the top and show a better tooltip
c23ce10 [zsxwing] Make two graphs close
d78672a [zsxwing] Make the X axis use the same range
881c907 [zsxwing] Use histogram for distribution
5688702 [zsxwing] Fix the unit test
ddf741a [zsxwing] Fix the unit test
ad93295 [zsxwing] Remove unnecessary codes
a0458f9 [zsxwing] Clean the codes
b82ed1e [zsxwing] Update the graphs as per comments
dd653a1 [zsxwing] Add timeline and histogram graphs for streaming statistics
2015-05-05 12:52:16 -07:00
jerryshao 9f1f9b1037 [SPARK-7007] [CORE] Add a metric source for ExecutorAllocationManager
Add a metric source to expose the internal status of ExecutorAllocationManager to better monitoring the resource usage of executors when dynamic allocation is enable. Please help to review, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #5589 from jerryshao/dynamic-allocation-source and squashes the following commits:

104d155 [jerryshao] rebase and address the comments
c501a2c [jerryshao] Address the comments
d237ba5 [jerryshao] Address the comments
2c3540f [jerryshao] Add a metric source for ExecutorAllocationManager
2015-05-05 09:43:49 -07:00
Andrew Or 57e9f29e17 [SPARK-7318] [STREAMING] DStream cleans objects that are not closures
I added a check in `ClosureCleaner#clean` to fail fast if this is detected in the future. tdas

Author: Andrew Or <andrew@databricks.com>

Closes #5860 from andrewor14/streaming-closure-cleaner and squashes the following commits:

8e971d7 [Andrew Or] Do not throw exception if object to clean is not closure
5ee4e25 [Andrew Or] Fix tests
eed3390 [Andrew Or] Merge branch 'master' of github.com:apache/spark into streaming-closure-cleaner
67eeff4 [Andrew Or] Add tests
a4fa768 [Andrew Or] Clean the closure, not the RDD
2015-05-05 09:37:49 -07:00
Andrew Or 1fdabf8dcd [SPARK-7237] Many user provided closures are not actually cleaned
Note: ~140 lines are tests.

In a nutshell, we never cleaned closures the user provided through the following operations:
- sortBy
- keyBy
- mapPartitions
- mapPartitionsWithIndex
- aggregateByKey
- foldByKey
- foreachAsync
- one of the aliases for runJob
- runApproximateJob

For more details on a reproduction and why they were not cleaned, please see [SPARK-7237](https://issues.apache.org/jira/browse/SPARK-7237).

Author: Andrew Or <andrew@databricks.com>

Closes #5787 from andrewor14/clean-more and squashes the following commits:

2f1f476 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-more
7265865 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-more
df3caa3 [Andrew Or] Address comments
7a3cc80 [Andrew Or] Merge branch 'master' of github.com:apache/spark into clean-more
6498f44 [Andrew Or] Add missing test for groupBy
e83699e [Andrew Or] Clean one more
8ac3074 [Andrew Or] Prevent NPE in tests when CC is used outside of an app
9ac5f9b [Andrew Or] Clean closures that are not currently cleaned
19e33b4 [Andrew Or] Add tests for all public RDD APIs that take in closures
2015-05-05 09:37:04 -07:00
zsxwing 5ffc73e68b [SPARK-5074] [CORE] [TESTS] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite
Test failure: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/2240/testReport/junit/org.apache.spark.scheduler/DAGSchedulerSuite/run_shuffle_with_map_stage_failure/

This is because many tests share the same `JobListener`. Because after each test, `scheduler` isn't stopped. So actually it's still running. When running the test `run shuffle with map stage failure`, some previous test may trigger [ResubmitFailedStages](ebc25a4ddf/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala (L1120)) logic, and report `jobFailed` and override the global `failure` variable.

This PR uses `after` to call `scheduler.stop()` for each test.

Author: zsxwing <zsxwing@gmail.com>

Closes #5903 from zsxwing/SPARK-5074 and squashes the following commits:

1e6f13e [zsxwing] Fix the flakey test 'run shuffle with map stage failure' in DAGSchedulerSuite
2015-05-05 15:04:14 +01:00
Imran Rashid d49735800d [SPARK-3454] separate json endpoints for data in the UI
Exposes data available in the UI as json over http.  Key points:

* new endpoints, handled independently of existing XyzPage classes.  Root entrypoint is `JsonRootResource`
* Uses jersey + jackson for routing & converting POJOs into json
* tests against known results in `HistoryServerSuite`
* also fixes some minor issues w/ the UI -- synchronizing on access to `StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ the way we handle retained jobs & stages.

Author: Imran Rashid <irashid@cloudera.com>

Closes #4435 from squito/SPARK-3454 and squashes the following commits:

da1e35f [Imran Rashid] typos etc.
5e78b4f [Imran Rashid] fix rendering problems
5ae02ad [Imran Rashid] Merge branch 'master' into SPARK-3454
f016182 [Imran Rashid] change all constructors json-pojo class constructors to be private[spark] to protect us from mima-false-positives if we add fields
3347b72 [Imran Rashid] mark EnumUtil as @Private
ec140a2 [Imran Rashid] create @Private
cc1febf [Imran Rashid] add docs on the metrics-as-json api
cbaf287 [Imran Rashid] Merge branch 'master' into SPARK-3454
56db31e [Imran Rashid] update tests for mulit-attempt
7f3bc4e [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
67008b4 [Imran Rashid] rats
9e51400 [Imran Rashid] style
c9bae1c [Imran Rashid] handle multiple attempts per app
b87cd63 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
188762c [Imran Rashid] multi-attempt
2af11e5 [Imran Rashid] Merge branch 'master' into SPARK-3454
befff0c [Imran Rashid] review feedback
14ac3ed [Imran Rashid] jersey-core needs to be explicit; move version & scope to parent pom.xml
f90680e [Imran Rashid] Merge branch 'master' into SPARK-3454
dc8a7fe [Imran Rashid] style, fix errant comments
acb7ef6 [Imran Rashid] fix indentation
7bf1811 [Imran Rashid] move MetricHelper so mima doesnt think its exposed; comments
9d889d6 [Imran Rashid] undo some unnecessary changes
f48a7b0 [Imran Rashid] docs
52bbae8 [Imran Rashid] StorageListener & StorageStatusListener needs to synchronize internally to be thread-safe
31c79ce [Imran Rashid] asm no longer needed for SPARK_PREPEND_CLASSES
b2f8b91 [Imran Rashid] @DeveloperApi
2e19be2 [Imran Rashid] lazily convert ApplicationInfo to avoid memory overhead
ba3d9d2 [Imran Rashid] upper case enums
39ac29c [Imran Rashid] move EnumUtil
d2bde77 [Imran Rashid] update error handling & scoping
4a234d3 [Imran Rashid] avoid jersey-media-json-jackson b/c of potential version conflicts
a157a2f [Imran Rashid] style
7bd4d15 [Imran Rashid] delete security test, since it doesnt do anything
a325563 [Imran Rashid] style
a9c5cf1 [Imran Rashid] undo changes superceeded by master
0c6f968 [Imran Rashid] update deps
1ed0d07 [Imran Rashid] Merge branch 'master' into SPARK-3454
4c92af6 [Imran Rashid] style
f2e63ad [Imran Rashid] Merge branch 'master' into SPARK-3454
c22b11f [Imran Rashid] fix compile error
9ea682c [Imran Rashid] go back to good ol' java enums
cf86175 [Imran Rashid] style
d493b38 [Imran Rashid] Merge branch 'master' into SPARK-3454
f05ae89 [Imran Rashid] add in ExecutorSummaryInfo for MiMa :(
101a698 [Imran Rashid] style
d2ef58d [Imran Rashid] revert changes that had HistoryServer refresh the application listing more often
b136e39b [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier to start & stop http servers in sbt"
e031719 [Imran Rashid] fixes from review
1f53a66 [Imran Rashid] style
b4a7863 [Imran Rashid] fix compile error
2c8b7ee [Imran Rashid] rats
1578a4a [Imran Rashid] doc
674f8dc [Imran Rashid] more explicit about total numbers of jobs & stages vs. number retained
9922be0 [Imran Rashid] Merge branch 'master' into stage_distributions
f5a5196 [Imran Rashid] undo removal of renderJson from MasterPage, since there is no substitute yet
db61211 [Imran Rashid] get JobProgressListener directly from UI
fdfc181 [Imran Rashid] stage/taskList
63eb4a6 [Imran Rashid] tests for taskSummary
ad27de8 [Imran Rashid] error handling on quantile values
b2efcaf [Imran Rashid] cleanup, combine stage-related paths into one resource
aaba896 [Imran Rashid] wire up task summary
a4b1397 [Imran Rashid] stage metric distributions
e48ba32 [Imran Rashid] rename
eaf3bbb [Imran Rashid] style
25cd894 [Imran Rashid] if only given day, assume GMT
51eaedb [Imran Rashid] more visibility fixes
9f28b7e [Imran Rashid] ack, more cleanup
99764e1 [Imran Rashid] Merge branch 'SPARK-3454_w_jersey' into SPARK-3454
a61a43c [Imran Rashid] oops, remove accidental checkin
a066055 [Imran Rashid] set visibility on a lot of classes
1f361c8 [Imran Rashid] update rat-excludes
0be5120 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
2382bef [Imran Rashid] switch to using new "enum"
fef6605 [Imran Rashid] some utils for working w/ new "enum" format
dbfc7bf [Imran Rashid] style
b86bcb0 [Imran Rashid] update test to look at one stage attempt
5f9df24 [Imran Rashid] style
7fd156a [Imran Rashid] refactor jsonDiff to avoid code duplication
73f1378 [Imran Rashid] test json; also add test cases for cleaned stages & jobs
97d411f [Imran Rashid] json endpoint for one job
0c96147 [Imran Rashid] better error msgs for bad stageId vs bad attemptId
dddbd29 [Imran Rashid] stages have attempt; jobs are sorted; resource for all attempts for one stage
190c17a [Imran Rashid] StagePage should distinguish no task data, from unknown stage
84cd497 [Imran Rashid] AllJobsPage should still report correct completed & failed job count, even if some have been cleaned, to make it consistent w/ AllStagesPage
36e4062 [Imran Rashid] SparkUI needs to know about startTime, so it can list its own applicationInfo
b4c75ed [Imran Rashid] fix merge conflicts; need to widen visibility in a few cases
e91750a [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
56d2fc7 [Imran Rashid] jersey needs asm for SPARK_PREPEND_CLASSES to work
f7df095 [Imran Rashid] add test for accumulables, and discover that I need update after all
9c0c125 [Imran Rashid] add accumulableInfo
00e9cc5 [Imran Rashid] more style
3377e61 [Imran Rashid] scaladoc
d05f7a9 [Imran Rashid] dont use case classes for status api POJOs, since they have binary compatibility issues
654cecf [Imran Rashid] move all the status api POJOs to one file
b86e2b0 [Imran Rashid] style
18a8c45 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
5598f19 [Imran Rashid] delete some unnecessary code, more to go
56edce0 [Imran Rashid] style
017c755 [Imran Rashid] add in metrics now available
1b78cb7 [Imran Rashid] fix some import ordering
0dc3ea7 [Imran Rashid] if app isnt found, reload apps from FS before giving up
c7d884f [Imran Rashid] fix merge conflicts
0c12b50 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
b6a96a8 [Imran Rashid] compare json by AST, not string
cd37845 [Imran Rashid] switch to using java.util.Dates for times
a4ab5aa [Imran Rashid] add in explicit dependency on jersey 1.9 -- maven wasn't happy before this
4fdc39f [Imran Rashid] refactor case insensitive enum parsing
cba1ef6 [Imran Rashid] add security (maybe?) for metrics json
f0264a7 [Imran Rashid] switch to using jersey for metrics json
bceb3a9 [Imran Rashid] set http response code on error, some testing
e0356b6 [Imran Rashid] put new test expectation files in rat excludes (is this OK?)
b252e7a [Imran Rashid] small cleanup of accidental changes
d1a8c92 [Imran Rashid] add sbt-revolved plugin, to make it easier to start & stop http servers in sbt
4b398d0 [Imran Rashid] expose UI data as json in new endpoints
2015-05-05 07:25:40 -05:00
Sandy Ryza 4222da68dc [SPARK-5112] Expose SizeEstimator as a developer api
"The best way to size the amount of memory consumption your dataset will require is to create an RDD, put it into cache, and look at the SparkContext logs on your driver program. The logs will tell you how much memory each partition is consuming, which you can aggregate to get the total size of the RDD."
-the Tuning Spark page

This is a pain. It would be much nicer to expose simply functionality for understanding the memory footprint of a Java object.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits:

8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
e21c1f4 [Sandy Ryza] Remove unused import
798ab88 [Sandy Ryza] Update documentation and add to SparkContext
34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api
2015-05-05 12:38:46 +01:00
Tathagata Das 8776fe0b93 [HOTFIX] [TEST] Ignoring flaky tests
org.apache.spark.DriverSuite.driver should exit after finishing without cleanup (SPARK-530)
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2267/

org.apache.spark.deploy.SparkSubmitSuite.includes jars passed in through --jars
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/

org.apache.spark.streaming.flume.FlumePollingStreamSuite.flume polling test
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2269/

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5901 from tdas/ignore-flaky-tests and squashes the following commits:

9cd8667 [Tathagata Das] Ignoring tests.
2015-05-05 01:58:51 -07:00
Tathagata Das 1854ac326a [SPARK-7139] [STREAMING] Allow received block metadata to be saved to WAL and recovered on driver failure
- Enabled ReceivedBlockTracker WAL by default
- Stored block metadata in the WAL
- Optimized WALBackedBlockRDD by skipping block fetch when the block is known to not exist in Spark

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5732 from tdas/SPARK-7139 and squashes the following commits:

575476e [Tathagata Das] Added more tests to get 100% coverage of the WALBackedBlockRDD
19668ba [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7139
685fab3 [Tathagata Das] Addressed comments in PR
637bc9c [Tathagata Das] Changed segment to handle
466212c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-7139
5f67a59 [Tathagata Das] Fixed HdfsUtils to handle append in local file system
1bc5bc3 [Tathagata Das] Fixed bug on unexpected recovery
d06fa21 [Tathagata Das] Enabled ReceivedBlockTracker by default, stored block metadata and optimized block fetching in WALBackedBlockRDD
2015-05-05 01:45:19 -07:00
Marcelo Vanzin c5790a2f77 [MINOR] [BUILD] Declare ivy dependency in root pom.
Without this, any dependency that pulls ivy transitively may override
the version and potentially cause issue. In my machine, the hive tests
were pulling an old version of ivy, and subsequently failing with a
"NoSuchMethodError".

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5893 from vanzin/ivy-dep-fix and squashes the following commits:

ea2112d [Marcelo Vanzin] [minor] [build] Declare ivy dependency in root pom.
2015-05-05 08:56:16 +01:00
Xiangrui Meng e9b16e67c6 [SPARK-7314] [SPARK-3524] [PYSPARK] upgrade Pyrolite to 4.4
This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 and some other performance improvements (e.g., SPARK-6288). The artifact is still under `org.spark-project` on Maven Central since there is no official release published there.

Author: Xiangrui Meng <meng@databricks.com>

Closes #5850 from mengxr/SPARK-7314 and squashes the following commits:

2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7314
da3c2dd [Xiangrui Meng] remove my repo
fe7e29b [Xiangrui Meng] switch to maven central
6ddac0e [Xiangrui Meng] reverse the machine code for float/double
d2d5b5b [Xiangrui Meng] change back to 4.4
7824a9c [Xiangrui Meng] use Pyrolite 3.1
cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing
2015-05-04 23:52:42 -07:00
Bryan Cutler 8aa5aea7fe [SPARK-7236] [CORE] Fix to prevent AkkaUtils askWithReply from sleeping on final attempt
Added a check so that if `AkkaUtils.askWithReply` is on the final attempt, it will not sleep for the `retryInterval`.  This should also prevent the thread from sleeping for `Int.Max` when using `askWithReply` with default values for `maxAttempts` and `retryInterval`.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #5896 from BryanCutler/askWithReply-sleep-7236 and squashes the following commits:

653a07b [Bryan Cutler] [SPARK-7236] Fix to prevent AkkaUtils askWithReply from sleeping on final attempt
2015-05-04 18:29:22 -07:00
Andrew Or fc8b58195a [SPARK-6943] [SPARK-6944] DAG visualization on SparkUI
This patch adds the functionality to display the RDD DAG on the SparkUI.

This DAG describes the relationships between
- an RDD and its dependencies,
- an RDD and its operation scopes, and
- an RDD's operation scopes and the stage / job hierarchy

An operation scope here refers to the existing public APIs that created the RDDs (e.g. `textFile`, `treeAggregate`). In the future, we can expand this to include higher level operations like SQL queries.

*Note: This blatantly stole a few lines of HTML and JavaScript from #5547 (thanks shroffpradyumn!)*

Here's what the job page looks like:
<img src="https://issues.apache.org/jira/secure/attachment/12730286/job-page.png" width="700px"/>
and the stage page:
<img src="https://issues.apache.org/jira/secure/attachment/12730287/stage-page.png" width="300px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #5729 from andrewor14/viz2 and squashes the following commits:

666c03b [Andrew Or] Round corners of RDD boxes on stage page (minor)
01ba336 [Andrew Or] Change RDD cache color to red (minor)
6f9574a [Andrew Or] Add tests for RDDOperationScope
1c310e4 [Andrew Or] Wrap a few more RDD functions in an operation scope
3ffe566 [Andrew Or] Restore "null" as default for RDD name
5fdd89d [Andrew Or] children -> child (minor)
0d07a84 [Andrew Or] Fix python style
afb98e2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
0d7aa32 [Andrew Or] Fix python tests
3459ab2 [Andrew Or] Fix tests
832443c [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
429e9e1 [Andrew Or] Display cached RDDs on the viz
b1f0fd1 [Andrew Or] Rename OperatorScope -> RDDOperationScope
31aae06 [Andrew Or] Extract visualization logic from listener
83f9c58 [Andrew Or] Implement a programmatic representation of operator scopes
5a7faf4 [Andrew Or] Rename references to viz scopes to viz clusters
ee33d52 [Andrew Or] Separate HTML generating code from listener
f9830a2 [Andrew Or] Refactor + clean up + document JS visualization code
b80cc52 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
0706992 [Andrew Or] Add link from jobs to stages
deb48a0 [Andrew Or] Translate stage boxes taking into account the width
5c7ce16 [Andrew Or] Connect RDDs across stages + update style
ab91416 [Andrew Or] Introduce visualization to the Job Page
5f07e9c [Andrew Or] Remove more return statements from scopes
5e388ea [Andrew Or] Fix line too long
43de96e [Andrew Or] Add parent IDs to StageInfo
6e2cfea [Andrew Or] Remove all return statements in `withScope`
d19c4da [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
7ef957c [Andrew Or] Fix scala style
4310271 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
aa868a9 [Andrew Or] Ensure that HadoopRDD is actually serializable
c3bfcae [Andrew Or] Re-implement scopes using closures instead of annotations
52187fc [Andrew Or] Rat excludes
09d361e [Andrew Or] Add ID to node label (minor)
71281fa [Andrew Or] Embed the viz in the UI in a toggleable manner
8dd5af2 [Andrew Or] Fill in documentation + miscellaneous minor changes
fe7816f [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz
205f838 [Andrew Or] Reimplement rendering with dagre-d3 instead of viz.js
5e22946 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz
6a7cdca [Andrew Or] Move RDD scope util methods and logic to its own file
494d5c2 [Andrew Or] Revert a few unintended style changes
9fac6f3 [Andrew Or] Re-implement scopes through annotations instead
f22f337 [Andrew Or] First working implementation of visualization with vis.js
2184348 [Andrew Or] Translate RDD information to dot file
5143523 [Andrew Or] Expose the necessary information in RDDInfo
a9ed4f9 [Andrew Or] Add a few missing scopes to certain RDD methods
6b3403b [Andrew Or] Scope all RDD methods
2015-05-04 16:21:36 -07:00
Michael Armbrust daa70bf135 [SPARK-6907] [SQL] Isolated client for HiveMetastore
This PR adds initial support for loading multiple versions of Hive in a single JVM and provides a common interface for extracting metadata from the `HiveMetastoreClient` for a given version.  This is accomplished by creating an isolated `ClassLoader` that operates according to the following rules:

 - __Shared Classes__: Java, Scala, logging, and Spark classes are delegated to `baseClassLoader`
  allowing the results of calls to the `ClientInterface` to be visible externally.
 - __Hive Classes__: new instances are loaded from `execJars`.  These classes are not
  accessible externally due to their custom loading.
 - __Barrier Classes__: Classes such as `ClientWrapper` are defined in Spark but must link to a specific version of Hive.  As a result, the bytecode is acquired from the Spark `ClassLoader` but a new copy is created for each instance of `IsolatedClientLoader`.
  This new instance is able to see a specific version of hive without using reflection where ever hive is consistent across versions. Since
  this is a unique instance, it is not visible externally other than as a generic
  `ClientInterface`, unless `isolationOn` is set to `false`.

In addition to the unit tests, I have also tested this locally against mysql instances of the Hive Metastore.  I've also successfully ported Spark SQL to run with this client, but due to the size of the changes, that will come in a follow-up PR.

By default, Hive jars are currently downloaded from Maven automatically for a given version to ease packaging and testing.  However, there is also support for specifying their location manually for deployments without internet.

Author: Michael Armbrust <michael@databricks.com>

Closes #5851 from marmbrus/isolatedClient and squashes the following commits:

c72f6ac [Michael Armbrust] rxins comments
1e271fa [Michael Armbrust] [SPARK-6907][SQL] Isolated client for HiveMetastore
2015-05-03 13:12:50 -07:00
Ye Xianjin bfcd528d6f [SPARK-6030] [CORE] Using simulated field layout method to compute class shellSize
SizeEstimator gives wrong result for Integer on 64bit JVM with UseCompressedOops on, this pr fixes that. For more details, please refer [SPARK-6030](https://issues.apache.org/jira/browse/SPARK-6030)
sryza, I noticed there is a pr to expose SizeEstimator, maybe that should be waited by this pr get merged if we confirm this problem.
And shivaram would you mind to review this pr since you contribute related code. Also cc to srowen and mateiz

Author: Ye Xianjin <advancedxy@gmail.com>

Closes #4783 from advancedxy/SPARK-6030 and squashes the following commits:

c4dcb41 [Ye Xianjin] Add super.beforeEach in the beforeEach method to make the trait stackable.. Remove useless leading whitespace.
3f80640 [Ye Xianjin] The size of Integer class changes from 24 to 16 on a 64-bit JVM with -UseCompressedOops flag on after the fix. I don't how 100000 was originally calculated, It looks like 100000 is the magic number which makes sure spilling. Because of the size change, It fails because there is no spilling at all. Change the number to a slightly larger number fixes that.
e849d2d [Ye Xianjin] Merge two shellSize assignments into one. Add some explanation to alignSizeUp method.
85a0b51 [Ye Xianjin] Fix typos and update wording in comments. Using alignSizeUp to compute alignSize.
d27eb77 [Ye Xianjin] Add some detailed comments in the code. Add some test cases. It's very difficult to design test cases as the final object alignment will hide a lot of filed layout details if we just considering the whole size.
842aed1 [Ye Xianjin] primitiveSize(cls) can just return Int. Use a simplified class field layout method to calculate class instance size. Will add more documents and test cases. Add a new alignSizeUp function which uses bitwise operators to speedup.
62e8ab4 [Ye Xianjin] Don't alignSize for objects' shellSize, alignSize when added to state.size. Add some primitive wrapper objects size tests.
2015-05-02 23:08:09 +01:00
Mridul Muralidharan da303526e5 [SPARK-7323] [SPARK CORE] Use insertAll instead of insert while merging combiners in reducer
Author: Mridul Muralidharan <mridulm@yahoo-inc.com>

Closes #5862 from mridulm/optimize_aggregator and squashes the following commits:

61cf43a [Mridul Muralidharan] Use insertAll instead of insert - much more expensive to do it per tuple
2015-05-02 23:05:51 +01:00
Andrew Or 7394e7adeb [SPARK-7120] [SPARK-7121] Closure cleaner nesting + documentation + tests
Note: ~600 lines of this is test code, and ~100 lines documentation.

**[SPARK-7121]** ClosureCleaner does not handle nested closures properly. For instance, in SparkContext, I tried to do the following:
```
def scope[T](body: => T): T = body // no-op
def myCoolMethod(path: String): RDD[String] = scope {
  parallelize(1 to 10).map { _ => path }
}
```
and I got an exception complaining that SparkContext is not serializable. The issue here is that the inner closure is getting its path from the outer closure (the scope), but the outer closure references the SparkContext object itself to get the `parallelize` method.

Note, however, that the inner closure doesn't actually need the SparkContext; it just needs a field from the outer closure. If we modify ClosureCleaner to clean the outer closure recursively using only the fields accessed by the inner closure, then we can serialize the inner closure.

**[SPARK-7120]** Also, the other thing is that this file is one of the least understood, partly because it is very low level and is written a long time ago. This patch attempts to change that by adding the missing documentation.

This is blocking my effort on a separate task #5729.

Author: Andrew Or <andrew@databricks.com>

Closes #5685 from andrewor14/closure-cleaner and squashes the following commits:

cd46230 [Andrew Or] Revert a small change that affected streaming
0bbe77f [Andrew Or] Fix style
ea874bc [Andrew Or] Fix tests
26c5072 [Andrew Or] Address comments
16fbcfd [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
26c7aba [Andrew Or] Revert "In sc.runJob, actually clean the inner closure"
6f75784 [Andrew Or] Revert "Guard against NPE if CC is used outside of an application"
e909a42 [Andrew Or] Guard against NPE if CC is used outside of an application
3998168 [Andrew Or] In sc.runJob, actually clean the inner closure
9187066 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
d889950 [Andrew Or] Revert "Bypass SerializationDebugger for now (SPARK-7180)"
9419efe [Andrew Or] Bypass SerializationDebugger for now (SPARK-7180)
6d4d3f1 [Andrew Or] Fix scala style?
4aab379 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
e45e904 [Andrew Or] More minor updates (wording, renaming etc.)
8b71cdb [Andrew Or] Update a few comments
eb127e5 [Andrew Or] Use private method tester for a few things
a3aa465 [Andrew Or] Add more tests for individual closure cleaner operations
e672170 [Andrew Or] Guard against potential infinite cycles in method visitor
6d36f38 [Andrew Or] Fix closure cleaner visibility
2106f12 [Andrew Or] Merge branch 'master' of github.com:apache/spark into closure-cleaner
263593d [Andrew Or] Finalize tests
06fd668 [Andrew Or] Make closure cleaning idempotent
a4866e3 [Andrew Or] Add tests (still WIP)
438c68f [Andrew Or] Minor changes
2390a60 [Andrew Or] Feature flag this new behavior
86f7823 [Andrew Or] Implement transitive cleaning + add missing documentation
2015-05-01 23:57:58 -07:00
Mridul Muralidharan b79aeb95b4 [SPARK-7317] [Shuffle] Expose shuffle handle
Details in JIRA, in a nut-shell, all machinary for custom RDD's to leverage spark shuffle directly (without exposing impl details of shuffle) exists - except for this small piece.

Exposing this will allow for custom dependencies to get a handle to ShuffleHandle - which they can then leverage on reduce side.

Author: Mridul Muralidharan <mridulm@yahoo-inc.com>

Closes #5857 from mridulm/expose_shuffle_handle and squashes the following commits:

d8b6bd4 [Mridul Muralidharan] Expose ShuffleHandle
2015-05-01 21:23:42 -07:00
Marcelo Vanzin 38d4e9e446 [SPARK-6229] Add SASL encryption to network library.
There are two main parts of this change:

- Extending the bootstrap mechanism in the network library to add a server-side
  bootstrap (which works a little bit differently than the client-side bootstrap), and
  to allow the  bootstraps to modify the underlying channel.

- Use SASL to encrypt data going through the RPC channel.

The second item requires some non-optimal code to be able to work around the
fact that the outbound path in netty is not thread-safe, and ordering is very important
when encryption is in the picture.

A lot of the changes outside the network/common library are just to adjust to the
changed API for initializing the RPC server.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5377 from vanzin/SPARK-6229 and squashes the following commits:

ff01966 [Marcelo Vanzin] Use fancy new size config style.
be53f32 [Marcelo Vanzin] Merge branch 'master' into SPARK-6229
47d4aff [Marcelo Vanzin] Merge branch 'master' into SPARK-6229
7a2a805 [Marcelo Vanzin] Clean up some unneeded changes.
2f92237 [Marcelo Vanzin] Add comment.
67bb0c6 [Marcelo Vanzin] Revert "Avoid exposing ByteArrayWritableChannel outside of test code."
065f684 [Marcelo Vanzin] Add test to verify chunking.
3d1695d [Marcelo Vanzin] Minor cleanups.
73cff0e [Marcelo Vanzin] Skip bytes in decode path too.
318ad23 [Marcelo Vanzin] Avoid exposing ByteArrayWritableChannel outside of test code.
346f829 [Marcelo Vanzin] Avoid trip through channel selector by not reporting 0 bytes written.
a4a5938 [Marcelo Vanzin] Review feedback.
4797519 [Marcelo Vanzin] Remove unused import.
9908ada [Marcelo Vanzin] Fix test, SASL backend disposal.
7fe1489 [Marcelo Vanzin] Add a test that makes sure encryption is actually enabled.
adb6f9d [Marcelo Vanzin] Review feedback.
cf2a605 [Marcelo Vanzin] Clean up some code.
8584323 [Marcelo Vanzin] Fix a comment.
e98bc55 [Marcelo Vanzin] Add option to only allow encrypted connections to the server.
dad42fc [Marcelo Vanzin] Make encryption thread-safe, less memory-intensive.
b00999a [Marcelo Vanzin] Consolidate ByteArrayWritableChannel, fix SASL code to match master changes.
b923cae [Marcelo Vanzin] Make SASL encryption handler thread-safe, handle FileRegion messages.
39539a7 [Marcelo Vanzin] Add config option to enable SASL encryption.
351a86f [Marcelo Vanzin] Add SASL encryption to network library.
fbe6ccb [Marcelo Vanzin] Add TransportServerBootstrap, make SASL code use it.
2015-05-01 19:01:46 -07:00
Chris Heller 8f50a07d21 [SPARK-2691] [MESOS] Support for Mesos DockerInfo
This patch adds partial support for running spark on mesos inside of a docker container. Only fine-grained mode is presently supported, and there is no checking done to ensure that the version of libmesos is recent enough to have a DockerInfo structure in the protobuf (other than pinning a mesos version in the pom.xml).

Author: Chris Heller <hellertime@gmail.com>

Closes #3074 from hellertime/SPARK-2691 and squashes the following commits:

d504af6 [Chris Heller] Assist type inference
f64885d [Chris Heller] Fix errant line length
17c41c0 [Chris Heller] Base Dockerfile on mesosphere/mesos image
8aebda4 [Chris Heller] Simplfy Docker image docs
1ae7f4f [Chris Heller] Style points
974bd56 [Chris Heller] Convert map to flatMap
5d8bdf7 [Chris Heller] Factor out the DockerInfo construction.
7b75a3d [Chris Heller] Align to styleguide
80108e7 [Chris Heller] Bend to the will of RAT
ba77056 [Chris Heller] Explicit RAT exclude
abda5e5 [Chris Heller] Wildcard .rat-excludes
2f2873c [Chris Heller] Exclude spark-mesos from RAT
a589a5b [Chris Heller] Add example Dockerfile
b6825ce [Chris Heller] Remove use of EasyMock
eae1b86 [Chris Heller] Move properties under 'spark.mesos.'
c184d00 [Chris Heller] Use map on Option to be consistent with non-coarse code
fb9501a [Chris Heller] Bumped mesos version to current release
fa11879 [Chris Heller] Add listenerBus to EasyMock
882151e [Chris Heller] Changes to scala style
b22d42d [Chris Heller] Exclude template from RAT
db536cf [Chris Heller] Remove unneeded mocks
dea1bd5 [Chris Heller] Force default protocol
7dac042 [Chris Heller] Add test for DockerInfo
5456c0c [Chris Heller] Adjust syntax style
521c194 [Chris Heller] Adjust version info
6e38f70 [Chris Heller] Document Mesos Docker properties
29572ab [Chris Heller] Support all DockerInfo fields
b8c0dea [Chris Heller] Support for mesos DockerInfo in coarse-mode.
482a9fd [Chris Heller] Support for mesos DockerInfo in fine-grained mode.
2015-05-01 18:41:22 -07:00
WangTaoTheTonic b4b43df8a3 [SPARK-6443] [SPARK SUBMIT] Could not submit app in standalone cluster mode when HA is enabled
**3/26 update:**
* Akka-based:
  Use an array of `ActorSelection` to represent multiple master. Add an `activeMasterActor` for query status of driver. And will add lost masters( including the standby one) to `lostMasters`.
  When size of `lostMasters` equals or greater than # of all masters, we should give an error that all masters are not avalible.

* Rest-based:
  When all masters are not available(throw an exception), we use akka gateway to submit apps.

I have tested simply on standalone HA cluster(with two masters alive and one alive/one dead), it worked.

There might remains some issues on style or message print, but we can check the solution then fix them together.

/cc srowen andrewor14

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5116 from WangTaoTheTonic/SPARK-6443 and squashes the following commits:

2a28aab [WangTaoTheTonic] based the newest change https://github.com/apache/spark/pull/5144
76fd411 [WangTaoTheTonic] rebase
f4f972b [WangTaoTheTonic] rebase...again
a41de0b [WangTaoTheTonic] rebase
220cb3c [WangTaoTheTonic] move connect exception inside
35119a0 [WangTaoTheTonic] style and compile issues
9d636be [WangTaoTheTonic] per Andrew's comments
979760c [WangTaoTheTonic] rebase
e4f4ece [WangTaoTheTonic] fix failed test
5d23958 [WangTaoTheTonic] refact some duplicated code, style and comments
7a881b3 [WangTaoTheTonic] when one of masters is gone, we still can submit
2b011c9 [WangTaoTheTonic] fix broken tests
60d97a4 [WangTaoTheTonic] rebase
fa1fa80 [WangTaoTheTonic] submit app to HA cluster in standalone cluster mode
2015-05-01 18:38:20 -07:00
Timothy Chen 2022193412 [SPARK-7216] [MESOS] Add driver details page to Mesos cluster UI.
Add a details page that displays Mesos driver in the Mesos cluster UI

Author: Timothy Chen <tnachen@gmail.com>

Closes #5763 from tnachen/mesos_cluster_page and squashes the following commits:

55f36eb [Timothy Chen] Add driver details page to Mesos cluster UI.
2015-05-01 18:36:42 -07:00
Sandy Ryza 099327d537 [SPARK-6954] [YARN] ExecutorAllocationManager can end up requesting a negative n...
...umber of executors

Author: Sandy Ryza <sandy@cloudera.com>

Closes #5704 from sryza/sandy-spark-6954 and squashes the following commits:

b7890fb [Sandy Ryza] Avoid ramping up to an existing number of executors
6eb516a [Sandy Ryza] SPARK-6954. ExecutorAllocationManager can end up requesting a negative number of executors
2015-05-01 18:33:15 -07:00
Holden Karau ae98eec730 [SPARK-3444] Provide an easy way to change log level
Add support for changing the log level at run time through the SparkContext. Based on an earlier PR, #2433 includes CR feedback from pwendel & davies

Author: Holden Karau <holden@pigscanfly.ca>

Closes #5791 from holdenk/SPARK-3444-provide-an-easy-way-to-change-log-level-r2 and squashes the following commits:

3bf3be9 [Holden Karau] fix exception
42ba873 [Holden Karau] fix exception
9117244 [Holden Karau] Only allow valid log levels, throw exception if invalid log level.
338d7bf [Holden Karau] rename setLoggingLevel to setLogLevel
fac14a0 [Holden Karau] Fix style errors
d9d03f3 [Holden Karau] Add support for changing the log level at run time through the SparkContext. Based on an earlier PR, #2433 includes CR feedback from @pwendel & @davies
2015-05-01 18:02:51 -07:00
zsxwing ebc25a4ddf [SPARK-7309] [CORE] [STREAMING] Shutdown the thread pools in ReceivedBlockHandler and DAGScheduler
Shutdown the thread pools in ReceivedBlockHandler and DAGScheduler when stopping them.

Author: zsxwing <zsxwing@gmail.com>

Closes #5845 from zsxwing/SPARK-7309 and squashes the following commits:

6c004fd [zsxwing] Shutdown the thread pools in ReceivedBlockHandler and DAGScheduler
2015-05-01 17:41:55 -07:00
Patrick Wendell 5c1fabafab Ignore flakey test in SparkSubmitUtilsSuite 2015-05-01 14:42:58 -07:00
Hari Shreedharan b1f4ca82d1 [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS
Take 2. Does the same thing as #4688, but fixes Hadoop-1 build.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #5823 from harishreedharan/kerberos-longrunning and squashes the following commits:

3c86bba [Hari Shreedharan] Import fixes. Import postfixOps explicitly.
4d04301 [Hari Shreedharan] Minor formatting fixes.
b5e7a72 [Hari Shreedharan] Remove reflection, use a method in SparkHadoopUtil to update the token renewer.
7bff6e9 [Hari Shreedharan] Make sure all required classes are present in the jar. Fix import order.
e851f70 [Hari Shreedharan] Move the ExecutorDelegationTokenRenewer to yarn module. Use reflection to use it.
36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
62c45ce [Hari Shreedharan] Relogin from keytab periodically.
fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
f4fd711 [Hari Shreedharan] Fix SparkConf usage.
2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
2015-05-01 15:32:09 -05:00
Patrick Wendell c6d9a42942 Revert "[SPARK-7224] added mock repository generator for --packages tests"
This reverts commit 7dacc08ab3.
2015-05-01 13:01:43 -07:00
Patrick Wendell 58d6584d34 Revert "[SPARK-7287] enabled fixed test"
This reverts commit 7cf1eb79b1.
2015-05-01 13:01:14 -07:00
Sean Owen 1262e310cd [SPARK-6846] [WEBUI] [HOTFIX] return to GET for kill link in UI since YARN AM won't proxy POST
Partial undoing of SPARK-6846; YARN AM proxy won't forward POSTs, so go back to GET for kill links in Spark UI. Standalone UIs are not affected.

Author: Sean Owen <sowen@cloudera.com>

Closes #5837 from srowen/SPARK-6846.2 and squashes the following commits:

c17c386 [Sean Owen] Partial undoing of SPARK-6846; YARN AM proxy won't forward POSTs, so go back to GET for kill links in Spark UI. Standalone UIs are not affected.
2015-05-01 19:57:37 +01:00
niranda 27de6fef6a changing persistence engine trait to an abstract class
Author: niranda <niranda.perera@gmail.com>

Closes #5832 from nirandaperera/PersistanceEngine_abstract_class and squashes the following commits:

67b9d5a [niranda] changing persistence engine trait to an abstract class
2015-05-01 11:27:45 -07:00
Marcelo Vanzin 3052f4916e [SPARK-4705] Handle multiple app attempts event logs, history server.
This change modifies the event logging listener to write the logs for different application
attempts to different files. The attempt ID is set by the scheduler backend, so as long
as the backend returns that ID to SparkContext, things should work. Currently, the
YARN backend does that.

The history server was also modified to model multiple attempts per application. Each
attempt has its own UI and a separate row in the listing table, so that users can look at
all the attempts separately. The UI "adapts" itself to avoid showing attempt-specific info
when all the applications being shown have a single attempt.

Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: twinkle sachdeva <twinkle@kite.ggn.in.guavus.com>
Author: twinkle.sachdeva <twinkle.sachdeva@guavus.com>
Author: twinkle sachdeva <twinkle.sachdeva@guavus.com>

Closes #5432 from vanzin/SPARK-4705 and squashes the following commits:

7e289fa [Marcelo Vanzin] Review feedback.
f66dcc5 [Marcelo Vanzin] Merge branch 'master' into SPARK-4705
bc885b7 [Marcelo Vanzin] Review feedback.
76a3651 [Marcelo Vanzin] Fix log cleaner, add test.
7c381ec [Marcelo Vanzin] Merge branch 'master' into SPARK-4705
1aa309d [Marcelo Vanzin] Improve sorting of app attempts.
2ad77e7 [Marcelo Vanzin] Missed a reference to the old property name.
9d59d92 [Marcelo Vanzin] Scalastyle...
d5a9c37 [Marcelo Vanzin] Update JsonProtocol test, make property name consistent.
ba34b69 [Marcelo Vanzin] Use Option[String] for attempt id.
f1cb9b3 [Marcelo Vanzin] Merge branch 'master' into SPARK-4705
c14ec19 [Marcelo Vanzin] Merge branch 'master' into SPARK-4705
9092d39 [Marcelo Vanzin] Merge branch 'master' into SPARK-4705
86de638 [Marcelo Vanzin] Merge branch 'master' into SPARK-4705
07446c6 [Marcelo Vanzin] Disable striping for app id / name when multiple attempts exist.
9092af5 [Marcelo Vanzin] Fix HistoryServer test.
3a14503 [Marcelo Vanzin] Argh scalastyle.
657ec18 [Marcelo Vanzin] Fix yarn history URL, app links.
c3e0a82 [Marcelo Vanzin] Move app name to app info, more UI fixes.
ce5ee5d [Marcelo Vanzin] Misc UI, test, style fixes.
cbe8bba [Marcelo Vanzin] Attempt ID in listener event should be an option.
88b1de8 [Marcelo Vanzin] Add a test for apps with multiple attempts.
3245aa2 [Marcelo Vanzin] Make app attempts part of the history server model.
5fd5c6f [Marcelo Vanzin] Fix my broken rebase.
318525a [twinkle.sachdeva] SPARK-4705: 1) moved from directory structure to single file, as per the master branch. 2) Added the attempt id inside the SparkListenerApplicationStart, to make the info available independent of directory structure. 3) Changes in History Server to render the UI as per the snaphot II
6b2e521 [twinkle sachdeva] SPARK-4705 Incorporating the review comments regarding formatting, will do the rest of the changes after this
4c1fc26 [twinkle sachdeva] SPARK-4705 Incorporating the review comments regarding formatting, will do the rest of the changes after this
0eb7722 [twinkle sachdeva] SPARK-4705: Doing cherry-pick of fix into master
2015-05-01 09:50:55 -05:00
Kousuke Saruta 7fe0f3f2b4 [SPARK-3468] [WEBUI] Timeline-View feature
I sometimes trouble-shoot and analyse the cause of long time spending job.

At the time, I find the stages which spends long time or fails, then I find the tasks which spends long time or fails, next I analyse the proportion of each phase in a task.

Another case, I find executors which spends long time for running a task and analyse the details of a task.

In such situation, I think it's helpful to visualize timeline  for each application, job, task and the details of proportion of activity for each task.

I added 3 timeline features to existing Web UI.

[Application Timeline View]
This view shows following things.

* When each executor was added/removed and the reason why it's removed.
* When each job  was started/finished.
* Status of each job.

![screenshot from 2015-04-01 16 49 25](https://cloud.githubusercontent.com/assets/4736016/6936886/e35fd582-d891-11e4-980d-8de13f50e442.png)

[Stage Timeline View]
Similar to Application Timeline View, this view shows following things.

* When each executor was added/removed and the reason why it's removed.
* When each job was started/finished.
* Status of each stage.

![screenshot from 2015-04-01 16 50 59](https://cloud.githubusercontent.com/assets/4736016/6936900/0dca6526-d892-11e4-84a8-efd9037af444.png)

[Task Assignment Timeline View]
This view shows following things.

* When each task started/finished
* How long each task spent and the proportion.
* Status of each task.
* Where each task ran on.

![screenshot from 2015-04-01 16 51 54](https://cloud.githubusercontent.com/assets/4736016/6936910/20fd5acc-d892-11e4-9018-80e463881fc2.png)

All the view above is zoomable by mouse wheel action and scrollable by drag action.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2342 from sarutak/timeline-viewer-feature and squashes the following commits:

11fe67d [Kousuke Saruta] Fixed conflict
79ac03d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
a91abd3 [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into timeline-viewer-feature
ef34a5b [Kousuke Saruta] Implement tooltip using bootstrap
b09d0c5 [Kousuke Saruta] Move `stroke` and `fill` attribute of rect elements to css
d3c63c8 [Kousuke Saruta] Fixed a little bit bugs
a36291b [Kousuke Saruta] Merge branch 'master' of https://github.com/apache/spark into timeline-viewer-feature
28714b6 [Kousuke Saruta] Fixed highlight issue
0dc4278 [Kousuke Saruta] Addressed most of Patrics's feedbacks
8110acf [Kousuke Saruta] Added scroll limit to Job timeline
974a64a [Kousuke Saruta] Removed unused function
ee7a7f0 [Kousuke Saruta] Refactored
6a91872 [Kousuke Saruta] Temporary commit
6693f34 [Kousuke Saruta] Added link to job/stage box in the timeline in order to move to corresponding row when we click
8f88222 [Kousuke Saruta] Added job/stage description
aeed4b1 [Kousuke Saruta] Removed stage timeline
fc1696c [Kousuke Saruta] Merge branch 'timeline-viewer-feature' of github.com:sarutak/spark into timeline-viewer-feature
999ccd4 [Kousuke Saruta] Improved scalability
0fc6a31 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
19815ae [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
68b7540 [Kousuke Saruta] Merge branch 'timeline-viewer-feature' of github.com:sarutak/spark into timeline-viewer-feature
52b5f0b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
dec85db [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
fcdab7d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
dab7cc1 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
09cce97 [Kousuke Saruta] Cleanuped
16f82cf [Kousuke Saruta] Cleanuped
9fb522e [Kousuke Saruta] Cleanuped
d05f2c2 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into timeline-viewer-feature
e85e9aa [Kousuke Saruta] Cleanup: Added TimelineViewUtils.scala
a76e569 [Kousuke Saruta] Removed unused setting in timeline-view.css
5ce1b21 [Kousuke Saruta] Added vis.min.js, vis.min.css and vis.map to .rat-exclude
082f709 [Kousuke Saruta] Added Timeline-View feature for Applications, Jobs and Stages
2015-05-01 01:39:56 -07:00
zsxwing 14b32886fa [SPARK-7291] [CORE] Fix a flaky test in AkkaRpcEnvSuite
Read the port from RpcEnv to check the result so that it will success even if port conflicts

Author: zsxwing <zsxwing@gmail.com>

Closes #5822 from zsxwing/SPARK-7291 and squashes the following commits:

e521b84 [zsxwing] Fix a flaky test in AkkaRpcEnvSuite
2015-04-30 23:44:33 -07:00
Burak Yavuz 7cf1eb79b1 [SPARK-7287] enabled fixed test
andrewor14 pwendell I reenabled the test. Let's see if it's fixed. I did also notice that `--jars` started to fail after this was ignored though in the JIRA. like [here](https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.0,label=centos/2238/consoleFull), you see that --jars fails for the same exact reason.

Has any change been made to Spark Submit recently? Did the test setup on Jenkins change? If we look into flaky tests last month, you wouldn't find this test among them.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5826 from brkyvz/restart-test and squashes the following commits:

f509f68 [Burak Yavuz] enabled fixed test
2015-04-30 23:39:58 -07:00
Sandy Ryza 0a2b15ce43 [SPARK-4550] In sort-based shuffle, store map outputs in serialized form
Refer to the JIRA for the design doc and some perf results.

I wanted to call out some of the more possibly controversial changes up front:
* Map outputs are only stored in serialized form when Kryo is in use.  I'm still unsure whether Java-serialized objects can be relocated.  At the very least, Java serialization writes out a stream header which causes problems with the current approach, so I decided to leave investigating this to future work.
* The shuffle now explicitly operates on key-value pairs instead of any object.  Data is written to shuffle files in alternating keys and values instead of key-value tuples.  `BlockObjectWriter.write` now accepts a key argument and a value argument instead of any object.
* The map output buffer can hold a max of Integer.MAX_VALUE bytes.  Though this wouldn't be terribly difficult to change.
* When spilling occurs, the objects that still in memory at merge time end up serialized and deserialized an extra time.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #4450 from sryza/sandy-spark-4550 and squashes the following commits:

8c70dd9 [Sandy Ryza] Fix serialization
9c16fe6 [Sandy Ryza] Fix a couple tests and move getAutoReset to KryoSerializerInstance
6c54e06 [Sandy Ryza] Fix scalastyle
d8462d8 [Sandy Ryza] SPARK-4550
2015-04-30 23:14:14 -07:00
Zhan Zhang 36a7a6807e [SPARK-6479] [BLOCK MANAGER] Create off-heap block storage API
This is the classes for creating off-heap block storage API. It also includes the migration for Tachyon. The diff seems to be big, but it mainly just rename tachyon to offheap. New implementation for hdfs will be submit for review in spark-6112.

Author: Zhan Zhang <zhazhan@gmail.com>

Closes #5430 from zhzhan/SPARK-6479 and squashes the following commits:

60acd84 [Zhan Zhang] minor change to kickoff the test
12f54c9 [Zhan Zhang] solve merge conflicts
a54132c [Zhan Zhang] solve review comments
ffb8e00 [Zhan Zhang] rebase to sparkcontext change
6e121e0 [Zhan Zhang] resolve review comments and restructure blockmanasger code
a7aed6c [Zhan Zhang] add Tachyon migration code
186de31 [Zhan Zhang] initial commit for off-heap block storage api
2015-04-30 22:24:31 -07:00
Patrick Wendell e0628f2fae Revert "[SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS"
This reverts commit 6c65da6bb7.
2015-04-30 14:59:20 -07:00
Hari Shreedharan 6c65da6bb7 [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS
Current Spark apps running on Secure YARN/HDFS would not be able to write data
to HDFS after 7 days, since delegation tokens cannot be renewed beyond that. This
means Spark Streaming apps will not be able to run on Secure YARN.

This commit adds basic functionality to fix this issue. In this patch:
- new parameters are added - principal and keytab, which can be used to login to a KDC
- the client logs in, and then get tokens to start the AM
- the keytab is copied to the staging directory
- the AM waits for 60% of the time till expiry of the tokens and then logs in using the keytab
- each time after 60% of the time, new tokens are created and sent to the executors

Currently, to avoid complicating the architecture, we set the keytab and principal in the
SparkHadoopUtil singleton, and schedule a login. Once the login is completed, a callback is scheduled.

This is being posted for feedback, so I can gather feedback on the general implementation.

There are currently a bunch of things to do:
- [x] logging
- [x] testing - I plan to manually test this soon. If you have ideas of how to add unit tests, comment.
- [x] add code to ensure that if these params are set in non-YARN cluster mode, we complain
- [x] documentation
- [x] Have the executors request for credentials from the AM, so that retries are possible.

Author: Hari Shreedharan <hshreedharan@apache.org>

Closes #4688 from harishreedharan/kerberos-longrunning and squashes the following commits:

36eb8a9 [Hari Shreedharan] Change the renewal interval config param. Fix a bunch of comments.
611923a [Hari Shreedharan] Make sure the namenodes are listed correctly for creating tokens.
09fe224 [Hari Shreedharan] Use token.renew to get token's renewal interval rather than using hdfs-site.xml
6963bbc [Hari Shreedharan] Schedule renewal in AM before starting user class. Else, a restarted AM cannot access HDFS if the user class tries to.
072659e [Hari Shreedharan] Fix build failure caused by thread factory getting moved to ThreadUtils.
f041dd3 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
42eead4 [Hari Shreedharan] Remove RPC part. Refactor and move methods around, use renewal interval rather than max lifetime to create new tokens.
ebb36f5 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
bc083e3 [Hari Shreedharan] Overload RegisteredExecutor to send tokens. Minor doc updates.
7b19643 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
8a4f268 [Hari Shreedharan] Added docs in the security guide. Changed some code to ensure that the renewer objects are created only if required.
e800c8b [Hari Shreedharan] Restore original RegisteredExecutor message, and send new tokens via NewTokens message.
0e9507e [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
7f1bc58 [Hari Shreedharan] Minor fixes, cleanup.
bcd11f9 [Hari Shreedharan] Refactor AM and Executor token update code into separate classes, also send tokens via akka on executor startup.
f74303c [Hari Shreedharan] Move the new logic into specialized classes. Add cleanup for old credentials files.
2f9975c [Hari Shreedharan] Ensure new tokens are written out immediately on AM restart. Also, pikc up the latest suffix from HDFS if the AM is restarted.
61b2b27 [Hari Shreedharan] Account for AM restarts by making sure lastSuffix is read from the files on HDFS.
62c45ce [Hari Shreedharan] Relogin from keytab periodically.
fa233bd [Hari Shreedharan] Adding logging, fixing minor formatting and ordering issues.
42813b4 [Hari Shreedharan] Remove utils.sh, which was re-added due to merge with master.
0de27ee [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
55522e3 [Hari Shreedharan] Fix failure caused by Preconditions ambiguity.
9ef5f1b [Hari Shreedharan] Added explanation of how the credentials refresh works, some other minor fixes.
f4fd711 [Hari Shreedharan] Fix SparkConf usage.
2debcea [Hari Shreedharan] Change the file structure for credentials files. I will push a followup patch which adds a cleanup mechanism for old credentials files. The credentials files are small and few enough for it to cause issues on HDFS.
af6d5f0 [Hari Shreedharan] Cleaning up files where changes weren't required.
f0f54cb [Hari Shreedharan] Be more defensive when updating the credentials file.
f6954da [Hari Shreedharan] Got rid of Akka communication to renew, instead the executors check a known file's modification time to read the credentials.
5c11c3e [Hari Shreedharan] Move tests to YarnSparkHadoopUtil to fix compile issues.
b4cb917 [Hari Shreedharan] Send keytab to AM via DistributedCache rather than directly via HDFS
0985b4e [Hari Shreedharan] Write tokens to HDFS and read them back when required, rather than sending them over the wire.
d79b2b9 [Hari Shreedharan] Make sure correct credentials are passed to FileSystem#addDelegationTokens()
8c6928a [Hari Shreedharan] Fix issue caused by direct creation of Actor object.
fb27f46 [Hari Shreedharan] Make sure principal and keytab are set before CoarseGrainedSchedulerBackend is started. Also schedule re-logins in CoarseGrainedSchedulerBackend#start()
41efde0 [Hari Shreedharan] Merge branch 'master' into kerberos-longrunning
d282d7a [Hari Shreedharan] Fix ClientSuite to set YARN mode, so that the correct class is used in tests.
bcfc374 [Hari Shreedharan] Fix Hadoop-1 build by adding no-op methods in SparkHadoopUtil, with impl in YarnSparkHadoopUtil.
f8fe694 [Hari Shreedharan] Handle None if keytab-login is not scheduled.
2b0d745 [Hari Shreedharan] [SPARK-5342][YARN] Allow long running Spark apps to run on secure YARN/HDFS.
ccba5bc [Hari Shreedharan] WIP: More changes wrt kerberos
77914dd [Hari Shreedharan] WIP: Add kerberos principal and keytab to YARN client.
2015-04-30 13:03:23 -05:00
Burak Yavuz 7dacc08ab3 [SPARK-7224] added mock repository generator for --packages tests
This patch contains an `IvyTestUtils` file, which dynamically generates jars and pom files to test the `--packages` feature without having to rely on the internet, and Maven Central.

cc pwendell I know that there existed Util functions to create Jars and stuff already, but they didn't really serve my purposes as they appended random prefixes that was breaking things.

I also added the local repository tests. Notice that they work without passing the `repo` to `resolveMavenCoordinates`.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5790 from brkyvz/maven-utils and squashes the following commits:

3ec79b7 [Burak Yavuz] addressed comments v0.2
a39151b [Burak Yavuz] address comments v0.1
172dfef [Burak Yavuz] use Ivy format
7476d06 [Burak Yavuz] added mock repository generator
2015-04-30 10:19:08 -07:00
Patrick Wendell 47bf406d60 [HOTFIX] Disabling flaky test (fix in progress as part of SPARK-7224) 2015-04-30 01:02:33 -07:00
zsxwing 1b7106b867 [SPARK-6862] [STREAMING] [WEBUI] Add BatchPage to display details of a batch
This is an initial commit for SPARK-6862. Once SPARK-6796 is merged, I will add the links to StreamingPage so that the user can jump to BatchPage.

Screenshots:
![success](https://cloud.githubusercontent.com/assets/1000778/7102439/bbe75406-e0b3-11e4-84fe-3e6de629a49a.png)
![failure](https://cloud.githubusercontent.com/assets/1000778/7102440/bc124454-e0b3-11e4-921a-c8b39d6b61bc.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #5473 from zsxwing/SPARK-6862 and squashes the following commits:

0727d35 [zsxwing] Change BatchUIData to a case class
b380cfb [zsxwing] Add createJobStart to eliminate duplicate codes
9a3083d [zsxwing] Rename XxxDatas -> XxxData
087ba98 [zsxwing] Refactor BatchInfo to store only necessary fields
cb62e4f [zsxwing] Use Seq[(OutputOpId, SparkJobId)] to store the id relations
72f8e7e [zsxwing] Add unit tests for BatchPage
1282b10 [zsxwing] Handle some corner cases and add tests for StreamingJobProgressListener
77a69ae [zsxwing] Refactor codes as per TD's comments
35ffd80 [zsxwing] Merge branch 'master' into SPARK-6862
15bdf9b [zsxwing] Add batch links and unit tests
4bf66b6 [zsxwing] Merge branch 'master' into SPARK-6862
7168807 [zsxwing] Limit the max width of the error message and fix nits in the UI
0b226f9 [zsxwing] Change 'Last Error' to 'Error'
fc98a43 [zsxwing] Put clearing local properties to finally and remove redundant private[streaming]
0c7b2eb [zsxwing] Add BatchPage to display details of a batch
2015-04-29 18:22:14 -07:00
yongtang 3fc6cfd079 [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input
See JIRA: https://issues.apache.org/jira/browse/SPARK-7155

SparkContext's newAPIHadoopFile() does not support comma-separated list of files. For example, the following:
```scala
sc.newAPIHadoopFile("/root/file1.txt,/root/file2.txt", classOf[TextInputFormat], classOf[LongWritable], classOf[Text])
```
will throw
```
org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/root/file1.txt,/root/file2.txt
```
However, the other API hadoopFile() is able to process comma-separated list of files correctly. In addition, since sc.textFile() uses hadoopFile(), it is also able to process comma-separated list of files correctly.

That means the behaviors of hadoopFile() and newAPIHadoopFile() are not aligned.

This pull request fix this issue and allows newAPIHadoopFile() to support comma-separated list of files as input.

A unit test has also been added in SparkContextSuite.scala. It creates two temporary text files as the input and tested against sc.textFile(), sc.hadoopFile(), and sc.newAPIHadoopFile().

Note: The contribution is my original work and that I license the work to the project under the project's open source license.

Author: yongtang <yongtang@users.noreply.github.com>

Closes #5708 from yongtang/SPARK-7155 and squashes the following commits:

654c80c [yongtang] [SPARK-7155] [CORE] Remove unneeded temp file deletion in unit test as parent dir is already temporary.
26faa6a [yongtang] [SPARK-7155] [CORE] Support comma-separated list of files as input for newAPIHadoopFile, wholeTextFiles, and binaryFiles. Use setInputPaths for consistency.
73e1f16 [yongtang] [SPARK-7155] [CORE] Allow newAPIHadoopFile to support comma-separated list of files as input.
2015-04-29 23:55:51 +01:00
Qiping Li 7f4b583733 [SPARK-7181] [CORE] fix inifite loop in Externalsorter's mergeWithAggregation
see [SPARK-7181](https://issues.apache.org/jira/browse/SPARK-7181).

Author: Qiping Li <liqiping1991@gmail.com>

Closes #5737 from chouqin/externalsorter and squashes the following commits:

2924b93 [Qiping Li] fix inifite loop in Externalsorter's mergeWithAggregation
2015-04-29 23:52:16 +01:00
Burak Yavuz d7dbce8f7d [SPARK-7156][SQL] support RandomSplit in DataFrames
This is built on top of kaka1992 's PR #5711 using Logical plans.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5761 from brkyvz/random-sample and squashes the following commits:

a1fb0aa [Burak Yavuz] remove unrelated file
69669c3 [Burak Yavuz] fix broken test
1ddb3da [Burak Yavuz] copy base
6000328 [Burak Yavuz] added python api and fixed test
3c11d1b [Burak Yavuz] fixed broken test
f400ade [Burak Yavuz] fix build errors
2384266 [Burak Yavuz] addressed comments v0.1
e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
2015-04-29 15:34:05 -07:00
Josh Rosen 3a180c19a4 [SPARK-6629] cancelJobGroup() may not work for jobs whose job groups are inherited from parent threads
When a job is submitted with a job group and that job group is inherited from a parent thread, there are multiple bugs that may prevent this job from being cancelable via `SparkContext.cancelJobGroup()`:

- When filtering jobs based on their job group properties, DAGScheduler calls `get()` instead of `getProperty()`, which does not respect inheritance, so it will skip over jobs whose job group properties were inherited.
- `Properties` objects are mutable, but we do not make defensive copies / snapshots, so modifications of the parent thread's job group will cause running jobs' groups to change; this also breaks cancelation.

Both of these issues are easy to fix: use `getProperty()` and perform defensive copying.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5288 from JoshRosen/localProperties-mutability-race and squashes the following commits:

9e29654 [Josh Rosen] Fix style issue
5d90750 [Josh Rosen] Merge remote-tracking branch 'origin/master' into localProperties-mutability-race
3f7b9e8 [Josh Rosen] Add JIRA reference; move clone into DAGScheduler
707e417 [Josh Rosen] Clone local properties to prevent mutations from breaking job cancellation.
b376114 [Josh Rosen] Fix bug that prevented jobs with inherited job group properties from being cancelled.
2015-04-29 13:31:52 -07:00
Tathagata Das a9c4e29950 [SPARK-6752] [STREAMING] [REOPENED] Allow StreamingContext to be recreated from checkpoint and existing SparkContext
Original PR #5428 got reverted due to issues between MutableBoolean and Hadoop 1.0.4 (see JIRA). This replaces MutableBoolean with AtomicBoolean.

srowen pwendell

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5773 from tdas/SPARK-6752 and squashes the following commits:

a0c0ead [Tathagata Das] Fix for hadoop 1.0.4
70ae85b [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-6752
94db63c [Tathagata Das] Fix long line.
524f519 [Tathagata Das] Many changes based on PR comments.
eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
36a7823 [Tathagata Das] Minor changes.
204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext
2015-04-29 13:10:31 -07:00
Reynold Xin 687273d915 [SPARK-7223] Rename RPC askWithReply -> askWithReply, sendWithReply -> ask.
The old naming scheme was very confusing between askWithReply and sendWithReply. I also divided RpcEnv.scala into multiple files.

Author: Reynold Xin <rxin@databricks.com>

Closes #5768 from rxin/rpc-rename and squashes the following commits:

a84058e [Reynold Xin] [SPARK-7223] Rename RPC askWithReply -> askWithReply, sendWithReply -> ask.
2015-04-29 09:46:37 -07:00
Josh Rosen f49284b5bf [SPARK-7076][SPARK-7077][SPARK-7080][SQL] Use managed memory for aggregations
This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory.  This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses.  As a result, this allows for higher memory utilization.  It can also result in better cache locality since objects will be stored closer together in memory.

This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`.  For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric.

### Managing memory with sun.misc.Unsafe

This patch supports both on- and off-heap managed memory.

- In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object.
- In off-heap mode, memory is addressed directly with 64-bit long addresses.

To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields.  In off-heap mode, we simply pass `null` as `baseObject`.

We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks.

By default, we use on-heap mode.  To enable off-heap mode, set `spark.unsafe.offHeap=true`.

To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`.  These classes work together to track allocations and detect memory leaks.

### Compact tuple format

This patch introduces `UnsafeRow`, a compact row layout.  In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values:

![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png)

- Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes)
- The bit set is used for null tracking:
	- Position _i_ is set if and only if field _i_ is null
	- The bit set is aligned to an 8-byte word boundary.
- Every field appears as an 8-byte word in the fixed-length values part:
	- If a field is null, we zero out the values.
	- If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part.
- Each variable-length data type can have its own encoding:
	- For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes.  If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment.

For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this:

![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png)

This format allows us to compare tuples for equality by directly comparing their raw bytes.  This also enables fast hashing of tuples.

### Hash map for performing aggregations

This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with.  This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values.

`BytesToBytesMap` stores pointers to key and value tuples.  For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place.

This map is implemented using open hashing with triangular sequence probing.  Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode.  By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision).

`UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map.  Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`.  This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators.

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725)
<!-- Reviewable:end -->

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5725 from JoshRosen/unsafe and squashes the following commits:

eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns.
81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate
1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing.
017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize()
50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication
70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager:
6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods.
de5e001 [Josh Rosen] Fix debug vs. trace in logging message.
a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention.
78a5b84 [Josh Rosen] Add logging to MemoryManager
ce3c565 [Josh Rosen] More comments, formatting, and code cleanup.
529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds.
3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace
162caf7 [Josh Rosen] Fix test compilation
b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size.
a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv.
0925847 [Josh Rosen] Disable MiMa checks for new unsafe module
cde4132 [Josh Rosen] Add missing pom.xml
9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap
6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow.
31eaabc [Josh Rosen] Lots of TODO and doc cleanup.
a95291e [Josh Rosen] Cleanups to string handling code
afe8dca [Josh Rosen] Some Javadoc cleanup
f3dcbfe [Josh Rosen] More mod replacement
854201a [Josh Rosen] Import and comment cleanup
06e929d [Josh Rosen] More warning cleanup
ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections
29a7575 [Josh Rosen] Remove debug logging
49aed30 [Josh Rosen] More long -> int conversion.
b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation.
765243d [Josh Rosen] Enable optional performance metrics for hash map.
23a440a [Josh Rosen] Bump up default hash map size
628f936 [Josh Rosen] Use ints intead of longs for indexing.
92d5a06 [Josh Rosen] Address a number of minor code review comments.
1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests.
d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap
bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...)
b3eaccd [Josh Rosen] Extract aggregation map into its own class.
d2bb986 [Josh Rosen] Update to implement new Row methods added upstream
58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable)
7df6008 [Josh Rosen] Optimizations related to zeroing out memory:
c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free():
738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate
c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed.
62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns.
c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter
ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort"
c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup
f764d13 [Josh Rosen] Simplify address + length calculation in Location.
079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract.
1a483c5 [Josh Rosen] First version that passes some aggregation tests:
fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate
53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters
1ff814d [Josh Rosen] Add reminder to free memory on iterator completion
8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration.
5d55cef [Josh Rosen] Add skeleton for Row implementation.
f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods.
ab68e08 [Josh Rosen] Begin merging the UTF8String implementations.
480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
2015-04-29 01:07:26 -07:00
Patrick Wendell 1fd6ed9a56 [SPARK-7204] [SQL] Fix callSite for Dataframe and SQL operations
This patch adds SQL to the set of excluded libraries when
generating a callSite. This makes the callSite mechanism work
properly for the data frame API. I also added a small improvement for
JDBC queries where we just use the string "Spark JDBC Server Query"
instead of trying to give a callsite that doesn't make any sense
to the user.

Before (DF):
![screen shot 2015-04-28 at 1 29 26 pm](https://cloud.githubusercontent.com/assets/320616/7380170/ef63bfb0-edae-11e4-989c-f88a5ba6bbee.png)

After (DF):
![screen shot 2015-04-28 at 1 34 58 pm](https://cloud.githubusercontent.com/assets/320616/7380181/fa7f6d90-edae-11e4-9559-26f163ed63b8.png)

After (JDBC):
![screen shot 2015-04-28 at 2 00 10 pm](https://cloud.githubusercontent.com/assets/320616/7380185/02f5b2a4-edaf-11e4-8e5b-99bdc3df66dd.png)

Author: Patrick Wendell <patrick@databricks.com>

Closes #5757 from pwendell/dataframes and squashes the following commits:

0d931a4 [Patrick Wendell] Attempting to fix PySpark tests
85bf740 [Patrick Wendell] [SPARK-7204] Fix callsite for dataframe operations.
2015-04-29 00:35:08 -07:00
Burak Yavuz f98773a90d [SPARK-7205] Support .ivy2/local and .m2/repositories/ in --packages
In addition, I made a small change that will allow users to import 2 different artifacts with the same name. That change is made in `[organization]_[artifact]-[revision].[ext]`. This used to be only `[artifact].[ext]` which might have caused collisions between artifacts with the same artifactId, but different groupId's.

cc pwendell

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #5755 from brkyvz/local-caches and squashes the following commits:

c47c9c5 [Burak Yavuz] Small fixes to --packages
2015-04-28 23:05:02 -07:00
Marcelo Vanzin 28b1af7420 [MINOR] [CORE] Warn users who try to cache RDDs with dynamic allocation on.
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5751 from vanzin/cached-rdd-warning and squashes the following commits:

554cc07 [Marcelo Vanzin] Change message.
9efb9da [Marcelo Vanzin] [minor] [core] Warn users who try to cache RDDs with dynamic allocation on.
2015-04-28 13:49:29 -07:00
Timothy Chen 53befacced [SPARK-5338] [MESOS] Add cluster mode support for Mesos
This patch adds the support for cluster mode to run on Mesos.
It introduces a new Mesos framework dedicated to launch new apps/drivers, and can be called with the spark-submit script and specifying --master flag to the cluster mode REST interface instead of Mesos master.

Example:
./bin/spark-submit --deploy-mode cluster --class org.apache.spark.examples.SparkPi --master mesos://10.0.0.206:8077 --executor-memory 1G --total-executor-cores 100 examples/target/spark-examples_2.10-1.3.0-SNAPSHOT.jar 30

Part of this patch is also to abstract the StandaloneRestServer so it can have different implementations of the REST endpoints.

Features of the cluster mode in this PR:
- Supports supervise mode where scheduler will keep trying to reschedule exited job.
- Adds a new UI for the cluster mode scheduler to see all the running jobs, finished jobs, and supervise jobs waiting to be retried
- Supports state persistence to ZK, so when the cluster scheduler fails over it can pick up all the queued and running jobs

Author: Timothy Chen <tnachen@gmail.com>
Author: Luc Bourlier <luc.bourlier@typesafe.com>

Closes #5144 from tnachen/mesos_cluster_mode and squashes the following commits:

069e946 [Timothy Chen] Fix rebase.
e24b512 [Timothy Chen] Persist submitted driver.
390c491 [Timothy Chen] Fix zk conf key for mesos zk engine.
e324ac1 [Timothy Chen] Fix merge.
fd5259d [Timothy Chen] Address review comments.
1553230 [Timothy Chen] Address review comments.
c6c6b73 [Timothy Chen] Pass spark properties to mesos cluster tasks.
f7d8046 [Timothy Chen] Change app name to spark cluster.
17f93a2 [Timothy Chen] Fix head of line blocking in scheduling drivers.
6ff8e5c [Timothy Chen] Address comments and add logging.
df355cd [Timothy Chen] Add metrics to mesos cluster scheduler.
20f7284 [Timothy Chen] Address review comments
7252612 [Timothy Chen] Fix tests.
a46ad66 [Timothy Chen] Allow zk cli param override.
920fc4b [Timothy Chen] Fix scala style issues.
862b5b5 [Timothy Chen] Support asking driver status when it's retrying.
7f214c2 [Timothy Chen] Fix RetryState visibility
e0f33f7 [Timothy Chen] Add supervise support and persist retries.
371ce65 [Timothy Chen] Handle cluster mode recovery and state persistence.
3d4dfa1 [Luc Bourlier] Adds support to kill submissions
febfaba [Timothy Chen] Bound the finished drivers in memory
543a98d [Timothy Chen] Schedule multiple jobs
6887e5e [Timothy Chen] Support looking at SPARK_EXECUTOR_URI env variable in schedulers
8ec76bc [Timothy Chen] Fix Mesos dispatcher UI.
d57d77d [Timothy Chen] Add documentation
825afa0 [Luc Bourlier] Supports more spark-submit parameters
b8e7181 [Luc Bourlier] Adds a shutdown latch to keep the deamon running
0fa7780 [Luc Bourlier] Launch task through the mesos scheduler
5b7a12b [Timothy Chen] WIP: Making a cluster mode a mesos framework.
4b2f5ef [Timothy Chen] Specify user jar in command to be replaced with local.
e775001 [Timothy Chen] Support fetching remote uris in driver runner.
7179495 [Timothy Chen] Change Driver page output and add logging
880bc27 [Timothy Chen] Add Mesos Cluster UI to display driver results
9986731 [Timothy Chen] Kill drivers when shutdown
67cbc18 [Timothy Chen] Rename StandaloneRestClient to RestClient and add sbin scripts
e3facdd [Timothy Chen] Add Mesos Cluster dispatcher
2015-04-28 13:33:57 -07:00
Zhang, Liye 80098109d9 [SPARK-6314] [CORE] handle JsonParseException for history server
This is handled in the same way with [SPARK-6197](https://issues.apache.org/jira/browse/SPARK-6197). The result of this PR is that exception showed in history server log will be replaced by a warning, and the application that with un-complete history log file will be listed on history server webUI

Author: Zhang, Liye <liye.zhang@intel.com>

Closes #5736 from liyezhang556520/SPARK-6314 and squashes the following commits:

b8d2d88 [Zhang, Liye] handle JsonParseException for history server
2015-04-28 12:36:10 -07:00
Ilya Ganelin 2d222fb39d [SPARK-5932] [CORE] Use consistent naming for size properties
I've added an interface to JavaUtils to do byte conversion and added hooks within Utils.scala to handle conversion within Spark code (like for time strings). I've added matching tests for size conversion, and then updated all deprecated configs and documentation as per SPARK-5933.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #5574 from ilganeli/SPARK-5932 and squashes the following commits:

11f6999 [Ilya Ganelin] Nit fixes
49a8720 [Ilya Ganelin] Whitespace fix
2ab886b [Ilya Ganelin] Scala style
fc85733 [Ilya Ganelin] Got rid of floating point math
852a407 [Ilya Ganelin] [SPARK-5932] Added much improved overflow handling. Can now handle sizes up to Long.MAX_VALUE Petabytes instead of being capped at Long.MAX_VALUE Bytes
9ee779c [Ilya Ganelin] Simplified fraction matches
22413b1 [Ilya Ganelin] Made MAX private
3dfae96 [Ilya Ganelin] Fixed some nits. Added automatic conversion of old paramter for kryoserializer.mb to new values.
e428049 [Ilya Ganelin] resolving merge conflict
8b43748 [Ilya Ganelin] Fixed error in pattern matching for doubles
84a2581 [Ilya Ganelin] Added smoother handling of fractional values for size parameters. This now throws an exception and added a warning for old spark.kryoserializer.buffer
d3d09b6 [Ilya Ganelin] [SPARK-5932] Fixing error in KryoSerializer
fe286b4 [Ilya Ganelin] Resolved merge conflict
c7803cd [Ilya Ganelin] Empty lines
54b78b4 [Ilya Ganelin] Simplified byteUnit class
69e2f20 [Ilya Ganelin] Updates to code
f32bc01 [Ilya Ganelin] [SPARK-5932] Fixed error in API in SparkConf.scala where Kb conversion wasn't being done properly (was Mb). Added test cases for both timeUnit and ByteUnit conversion
f15f209 [Ilya Ganelin] Fixed conversion of kryo buffer size
0f4443e [Ilya Ganelin]     Merge remote-tracking branch 'upstream/master' into SPARK-5932
35a7fa7 [Ilya Ganelin] Minor formatting
928469e [Ilya Ganelin] [SPARK-5932] Converted some longs to ints
5d29f90 [Ilya Ganelin] [SPARK-5932] Finished documentation updates
7a6c847 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer
afc9a38 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize and spark.storage.memoryMapThreshold
ae7e9f6 [Ilya Ganelin] [SPARK-5932] Updated spark.io.compression.snappy.block.size
2d15681 [Ilya Ganelin] [SPARK-5932] Updated spark.executor.logs.rolling.size.maxBytes
1fbd435 [Ilya Ganelin] [SPARK-5932] Updated spark.broadcast.blockSize
eba4de6 [Ilya Ganelin] [SPARK-5932] Updated spark.shuffle.file.buffer.kb
b809a78 [Ilya Ganelin] [SPARK-5932] Updated spark.kryoserializer.buffer.max
0cdff35 [Ilya Ganelin] [SPARK-5932] Updated to use bibibytes in method names. Updated spark.kryoserializer.buffer.mb and spark.reducer.maxMbInFlight
475370a [Ilya Ganelin] [SPARK-5932] Simplified ByteUnit code, switched to using longs. Updated docs to clarify that we use kibi, mebi etc instead of kilo, mega
851d691 [Ilya Ganelin] [SPARK-5932] Updated memoryStringToMb to use new interfaces
a9f4fcf [Ilya Ganelin] [SPARK-5932] Added unit tests for unit conversion
747393a [Ilya Ganelin] [SPARK-5932] Added unit tests for ByteString conversion
09ea450 [Ilya Ganelin] [SPARK-5932] Added byte string conversion to Jav utils
5390fd9 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-5932
db9a963 [Ilya Ganelin] Closing second spark context
1dc0444 [Ilya Ganelin] Added ref equality check
8c884fa [Ilya Ganelin] Made getOrCreate synchronized
cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
733ec9f [Ilya Ganelin] Fixed some bugs in test code
8be2f83 [Ilya Ganelin] Replaced match with if
e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
a99032f [Ilya Ganelin] Spacing fix
d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
2015-04-28 12:18:55 -07:00
Iulian Dragos 8aab94d898 [SPARK-4286] Add an external shuffle service that can be run as a daemon.
This allows Mesos deployments to use the shuffle service (and implicitly dynamic allocation). It does so by adding a new "main" class and two corresponding scripts in `sbin`:

- `sbin/start-shuffle-service.sh`
- `sbin/stop-shuffle-service.sh`

Specific options can be passed in `SPARK_SHUFFLE_OPTS`.

This is picking up work from #3861 /cc tnachen

Author: Iulian Dragos <jaguarul@gmail.com>

Closes #4990 from dragos/feature/external-shuffle-service and squashes the following commits:

6c2b148 [Iulian Dragos] Import order and wrong name fixup.
07804ad [Iulian Dragos] Moved ExternalShuffleService to the `deploy` package + other minor tweaks.
4dc1f91 [Iulian Dragos] Reviewer’s comments:
8145429 [Iulian Dragos] Add an external shuffle service that can be run as a daemon.
2015-04-28 12:08:18 -07:00
Zhang, Liye 52ccf1d373 [Core][test][minor] replace try finally block with tryWithSafeFinally
Author: Zhang, Liye <liye.zhang@intel.com>

Closes #5739 from liyezhang556520/trySafeFinally and squashes the following commits:

55683e5 [Zhang, Liye] replace try finally block with tryWithSafeFinally
2015-04-28 10:24:00 -07:00
Sean Owen 7f3b3b7eb7 [SPARK-7168] [BUILD] Update plugin versions in Maven build and centralize versions
Update Maven build plugin versions and centralize plugin version management

Author: Sean Owen <sowen@cloudera.com>

Closes #5720 from srowen/SPARK-7168 and squashes the following commits:

98a8947 [Sean Owen] Make install, deploy plugin versions explicit
4ecf3b2 [Sean Owen] Update Maven build plugin versions and centralize plugin version management
2015-04-28 07:48:34 -04:00
Andrew Or bf35edd9d4 [SPARK-7187] SerializationDebugger should not crash user code
rxin

Author: Andrew Or <andrew@databricks.com>

Closes #5734 from andrewor14/ser-deb and squashes the following commits:

e8aad6c [Andrew Or] NonFatal
57d0ef4 [Andrew Or] try catch improveException
2015-04-28 00:38:14 -07:00
zsxwing 874a2ca93d [SPARK-7174][Core] Move calling TaskScheduler.executorHeartbeatReceived to another thread
`HeartbeatReceiver` will call `TaskScheduler.executorHeartbeatReceived`, which is a blocking operation because `TaskScheduler.executorHeartbeatReceived` will call

```Scala
    blockManagerMaster.driverEndpoint.askWithReply[Boolean](
      BlockManagerHeartbeat(blockManagerId), 600 seconds)
```

finally. Even if it asks from a local Actor, it may block the current Akka thread. E.g., the reply may be dispatched to the same thread of the ask operation. So the reply cannot be processed. An extreme case is setting the thread number of Akka dispatch thread pool to 1.

jstack log:

```
"sparkDriver-akka.actor.default-dispatcher-14" daemon prio=10 tid=0x00007f2a8c02d000 nid=0x725 waiting on condition [0x00007f2b1d6d0000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000006197a0868> (a scala.concurrent.impl.Promise$CompletionLatch)
	at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1033)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1326)
	at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
	at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
	at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
	at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
	at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread$$anon$3.block(ThreadPoolBuilder.scala:169)
	at scala.concurrent.forkjoin.ForkJoinPool.managedBlock(ForkJoinPool.java:3640)
	at akka.dispatch.MonitorableThreadFactory$AkkaForkJoinWorkerThread.blockOn(ThreadPoolBuilder.scala:167)
	at scala.concurrent.Await$.result(package.scala:107)
	at org.apache.spark.rpc.RpcEndpointRef.askWithReply(RpcEnv.scala:355)
	at org.apache.spark.scheduler.DAGScheduler.executorHeartbeatReceived(DAGScheduler.scala:169)
	at org.apache.spark.scheduler.TaskSchedulerImpl.executorHeartbeatReceived(TaskSchedulerImpl.scala:367)
	at org.apache.spark.HeartbeatReceiver$$anonfun$receiveAndReply$1.applyOrElse(HeartbeatReceiver.scala:103)
	at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$processMessage(AkkaRpcEnv.scala:182)
	at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1$$anonfun$applyOrElse$4.apply$mcV$sp(AkkaRpcEnv.scala:128)
	at org.apache.spark.rpc.akka.AkkaRpcEnv.org$apache$spark$rpc$akka$AkkaRpcEnv$$safelyCall(AkkaRpcEnv.scala:203)
	at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1$$anonfun$receiveWithLogging$1.applyOrElse(AkkaRpcEnv.scala:127)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply$mcVL$sp(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:33)
	at scala.runtime.AbstractPartialFunction$mcVL$sp.apply(AbstractPartialFunction.scala:25)
	at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:59)
	at org.apache.spark.util.ActorLogReceive$$anon$1.apply(ActorLogReceive.scala:42)
	at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:118)
	at org.apache.spark.util.ActorLogReceive$$anon$1.applyOrElse(ActorLogReceive.scala:42)
	at akka.actor.Actor$class.aroundReceive(Actor.scala:465)
	at org.apache.spark.rpc.akka.AkkaRpcEnv$$anonfun$actorRef$lzycompute$1$1$$anon$1.aroundReceive(AkkaRpcEnv.scala:94)
	at akka.actor.ActorCell.receiveMessage(ActorCell.scala:516)
	at akka.actor.ActorCell.invoke(ActorCell.scala:487)
	at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:238)
	at akka.dispatch.Mailbox.run(Mailbox.scala:220)
	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:393)
	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
```

This PR moved this blocking operation to a separated thread.

Author: zsxwing <zsxwing@gmail.com>

Closes #5723 from zsxwing/SPARK-7174 and squashes the following commits:

98bfe48 [zsxwing] Use a single thread for checking timeout and reporting executorHeartbeatReceived
5b3b545 [zsxwing] Move calling `TaskScheduler.executorHeartbeatReceived` to another thread to avoid blocking the Akka thread pool
2015-04-27 21:45:40 -07:00
Sean Owen ab5adb7a97 [SPARK-7145] [CORE] commons-lang (2.x) classes used instead of commons-lang3 (3.x); commons-io used without dependency
Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava

Author: Sean Owen <sowen@cloudera.com>

Closes #5703 from srowen/SPARK-7145 and squashes the following commits:

21fbe03 [Sean Owen] Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
2015-04-27 19:50:55 -04:00
Marcelo Vanzin 5d45e1f600 [SPARK-3090] [CORE] Stop SparkContext if user forgets to.
Set up a shutdown hook to try to stop the Spark context in
case the user forgets to do it. The main effect is that any
open logs files are flushed and closed, which is particularly
interesting for event logs.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5696 from vanzin/SPARK-3090 and squashes the following commits:

3b554b5 [Marcelo Vanzin] [SPARK-3090] [core] Stop SparkContext if user forgets to.
2015-04-27 19:46:17 -04:00
Hong Shen 8e1c00dbf4 [SPARK-6738] [CORE] Improve estimate the size of a large array
Currently, SizeEstimator.visitArray is not correct in the follow case,
```
array size > 200,
elem has the share object
```

when I add a debug log in SizeTracker.scala:
```
 System.err.println(s"numUpdates:$numUpdates, size:$ts, bytesPerUpdate:$bytesPerUpdate, cost time:$b")
```
I get the following log:
```
 numUpdates:1, size:262448, bytesPerUpdate:0.0, cost time:35
 numUpdates:2, size:420698, bytesPerUpdate:158250.0, cost time:35
 numUpdates:4, size:420754, bytesPerUpdate:28.0, cost time:32
 numUpdates:7, size:420754, bytesPerUpdate:0.0, cost time:27
 numUpdates:12, size:420754, bytesPerUpdate:0.0, cost time:28
 numUpdates:20, size:420754, bytesPerUpdate:0.0, cost time:25
 numUpdates:32, size:420754, bytesPerUpdate:0.0, cost time:21
 numUpdates:52, size:420754, bytesPerUpdate:0.0, cost time:20
 numUpdates:84, size:420754, bytesPerUpdate:0.0, cost time:20
 numUpdates:135, size:420754, bytesPerUpdate:0.0, cost time:20
 numUpdates:216, size:420754, bytesPerUpdate:0.0, cost time:11
 numUpdates:346, size:420754, bytesPerUpdate:0.0, cost time:6
 numUpdates:554, size:488911, bytesPerUpdate:327.67788461538464, cost time:8
 numUpdates:887, size:2312259426, bytesPerUpdate:6942253.798798799, cost time:198
15/04/21 14:27:26 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 3.0 GB to disk (1 time so far)
15/04/21 14:27:26 INFO collection.ExternalAppendOnlyMap: /data11/yarnenv/local/usercache/spark/appcache/application_1426746631567_11745/spark-local-20150421142719-c001/30/temp_local_066af981-c2fc-4b70-a00e-110e23006fbc
```
But in fact the file size is only 162K:
```
$ ll -h /data11/yarnenv/local/usercache/spark/appcache/application_1426746631567_11745/spark-local-20150421142719-c001/30/temp_local_066af981-c2fc-4b70-a00e-110e23006fbc
-rw-r----- 1 spark users 162K Apr 21 14:27 /data11/yarnenv/local/usercache/spark/appcache/application_1426746631567_11745/spark-local-20150421142719-c001/30/temp_local_066af981-c2fc-4b70-a00e-110e23006fbc
```

In order to test case, I change visitArray to:
```
       var size = 0l
         for (i <- 0 until length) {
          val obj = JArray.get(array, i)
          size += SizeEstimator.estimate(obj, state.visited).toLong
        }
       state.size += size
```
I get the following log:
```
...
14895 277016088 566.9046118590662 time:8470
23832 281840544 552.3308270676691 time:8031
38132 289891824 539.8294729775092 time:7897
61012 302803640 563.0265734265735 time:13044
97620 322904416 564.3276223776223 time:13554
15/04/14 11:46:43 INFO collection.ExternalAppendOnlyMap: Thread 51 spilling in-memory map of 314.5 MB to disk (1 time so far)
15/04/14 11:46:43 INFO collection.ExternalAppendOnlyMap: /data1/yarnenv/local/usercache/spark/appcache/application_1426746631567_8477/spark-local-20150414114020-2fcb/14/temp_local_5b6b98d5-5bfa-47e2-8216-059482ccbda0
```
 the file size is 85M.
```
$ ll -h /data1/yarnenv/local/usercache/spark/appcache/application_1426746631567_8477/spark- local-20150414114020-2fcb/14/
total 85M
-rw-r----- 1 spark users 85M Apr 14 11:46 temp_local_5b6b98d5-5bfa-47e2-8216-059482ccbda0
```

The following log is when I use this patch,
```
....
numUpdates:32, size:365484, bytesPerUpdate:0.0, cost time:7
numUpdates:52, size:365484, bytesPerUpdate:0.0, cost time:5
numUpdates:84, size:365484, bytesPerUpdate:0.0, cost time:5
numUpdates:135, size:372208, bytesPerUpdate:131.84313725490196, cost time:86
numUpdates:216, size:379020, bytesPerUpdate:84.09876543209876, cost time:21
numUpdates:346, size:1865208, bytesPerUpdate:11432.215384615385, cost time:23
numUpdates:554, size:2052380, bytesPerUpdate:899.8653846153846, cost time:16
numUpdates:887, size:2142820, bytesPerUpdate:271.59159159159157, cost time:15
..
numUpdates:14895, size:251675500, bytesPerUpdate:438.5263157894737, cost time:13
numUpdates:23832, size:257010268, bytesPerUpdate:596.9305135951662, cost time:14
numUpdates:38132, size:263922396, bytesPerUpdate:483.3655944055944, cost time:15
numUpdates:61012, size:268962596, bytesPerUpdate:220.28846153846155, cost time:24
numUpdates:97620, size:286980644, bytesPerUpdate:492.1888111888112, cost time:22
15/04/21 14:45:12 INFO collection.ExternalAppendOnlyMap: Thread 53 spilling in-memory map of 328.7 MB to disk (1 time so far)
15/04/21 14:45:12 INFO collection.ExternalAppendOnlyMap: /data4/yarnenv/local/usercache/spark/appcache/application_1426746631567_11758/spark-local-20150421144456-a2a5/2a/temp_local_9c109510-af16-4468-8f23-48cad04da88f
```
 the file size is 88M.
```
$ ll -h /data4/yarnenv/local/usercache/spark/appcache/application_1426746631567_11758/spark-local-20150421144456-a2a5/2a/
total 88M
-rw-r----- 1 spark users 88M Apr 21 14:45 temp_local_9c109510-af16-4468-8f23-48cad04da88f
```

Author: Hong Shen <hongshen@tencent.com>

Closes #5608 from shenh062326/my_change5 and squashes the following commits:

5506bae [Hong Shen] Fix compile error
c275dd3 [Hong Shen] Alter code style
fe202a2 [Hong Shen] Change the code style and add documentation.
a9fca84 [Hong Shen] Add test case for SizeEstimator
4877eee [Hong Shen] Improve estimate the size of a large array
a2ea7ac [Hong Shen] Alter code style
4c28e36 [Hong Shen] Improve estimate the size of a large array
2015-04-27 18:59:45 -04:00
Steven She b9de9e040a [SPARK-7103] Fix crash with SparkContext.union when RDD has no partitioner
Added a check to the SparkContext.union method to check that a partitioner is defined on all RDDs when instantiating a PartitionerAwareUnionRDD.

Author: Steven She <steven@canopylabs.com>

Closes #5679 from stevencanopy/SPARK-7103 and squashes the following commits:

5a3d846 [Steven She] SPARK-7103: Fix crash with SparkContext.union when at least one RDD has no partitioner
2015-04-27 18:55:02 -04:00
Nishkam Ravi f5473c2bbf [SPARK-6014] [CORE] [HOTFIX] Add try-catch block around ShutDownHook
Add a try/catch block around removeShutDownHook else IllegalStateException thrown in YARN cluster mode (see https://github.com/apache/spark/pull/4690)

cc andrewor14, srowen

Author: Nishkam Ravi <nravi@cloudera.com>
Author: nishkamravi2 <nishkamravi@gmail.com>
Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #5672 from nishkamravi2/master_nravi and squashes the following commits:

0f1abd0 [nishkamravi2] Update Utils.scala
474e3bf [nishkamravi2] Update DiskBlockManager.scala
97c383e [nishkamravi2] Update Utils.scala
8691e0c [Nishkam Ravi] Add a try/catch block around Utils.removeShutdownHook
2be1e76 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
1c13b79 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
bad4349 [nishkamravi2] Update Main.java
36a6f87 [Nishkam Ravi] Minor changes and bug fixes
b7f4ae7 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
4a45d6a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
458af39 [Nishkam Ravi] Locate the jar using getLocation, obviates the need to pass assembly path as an argument
d9658d6 [Nishkam Ravi] Changes for SPARK-6406
ccdc334 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
3faa7a4 [Nishkam Ravi] Launcher library changes (SPARK-6406)
345206a [Nishkam Ravi] spark-class merge Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ac58975 [Nishkam Ravi] spark-class changes
06bfeb0 [nishkamravi2] Update spark-class
35af990 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
32c3ab3 [nishkamravi2] Update AbstractCommandBuilder.java
4bd4489 [nishkamravi2] Update AbstractCommandBuilder.java
746f35b [Nishkam Ravi] "hadoop" string in the assembly name should not be mandatory (everywhere else in spark we mandate spark-assembly*hadoop*.jar)
bfe96e0 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
ee902fa [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
d453197 [nishkamravi2] Update NewHadoopRDD.scala
6f41a1d [nishkamravi2] Update NewHadoopRDD.scala
0ce2c32 [nishkamravi2] Update HadoopRDD.scala
f7e33c2 [Nishkam Ravi] Merge branch 'master_nravi' of https://github.com/nishkamravi2/spark into master_nravi
ba1eb8b [Nishkam Ravi] Try-catch block around the two occurrences of removeShutDownHook. Deletion of semi-redundant occurrences of expensive operation inShutDown.
71d0e17 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
494d8c0 [nishkamravi2] Update DiskBlockManager.scala
3c5ddba [nishkamravi2] Update DiskBlockManager.scala
f0d12de [Nishkam Ravi] Workaround for IllegalStateException caused by recent changes to BlockManager.stop
79ea8b4 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
b446edc [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala
535295a [nishkamravi2] Update TaskSetManager.scala
3e1b616 [Nishkam Ravi] Modify test for maxResultSize
9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0)
5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
636a9ff [nishkamravi2] Update YarnAllocator.scala
8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead
35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead
5ac2ec1 [Nishkam Ravi] Remove out
dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue
42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue
362da5e [Nishkam Ravi] Additional changes for yarn memory overhead
c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead
1cf2d1e [nishkamravi2] Update YarnAllocator.scala
ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts)
2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi
efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark
2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2015-04-25 20:02:23 -04:00
Patrick Wendell a61d65fc8b Revert "[SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint and existing SparkContext"
This reverts commit 534f2a4362.
2015-04-25 10:37:34 -07:00
Calvin Jia 438859eb7c [SPARK-6122] [CORE] Upgrade tachyon-client version to 0.6.3
This is a reopening of #4867.
A short summary of the issues resolved from the previous PR:

1. HTTPClient version mismatch: Selenium (used for UI tests) requires version 4.3.x, and Tachyon included 4.2.5 through a transitive dependency of its shaded thrift jar. To address this, Tachyon 0.6.3 will promote the transitive dependencies of the shaded jar so they can be excluded in spark.

2. Jackson-Mapper-ASL version mismatch: In lower versions of hadoop-client (ie. 1.0.4), version 1.0.1 is included. The parquet library used in spark sql requires version 1.8+. Its unclear to me why upgrading tachyon-client would cause this dependency to break. The solution was to exclude jackson-mapper-asl from hadoop-client.

It seems that the dependency management in spark-parent will not work on transitive dependencies, one way to make sure jackson-mapper-asl is included with the correct version is to add it as a top level dependency. The best solution would be to exclude the dependency in the modules which require a higher version, but that did not fix the unit tests. Any suggestions on the best way to solve this would be appreciated!

Author: Calvin Jia <jia.calvin@gmail.com>

Closes #5354 from calvinjia/upgrade_tachyon_0.6.3 and squashes the following commits:

0eefe4d [Calvin Jia] Handle httpclient version in maven dependency management. Remove httpclient version setting from profiles.
7c00dfa [Calvin Jia] Set httpclient version to 4.3.2 for selenium. Specify version of httpclient for sql/hive (previously 4.2.5 transitive dependency of libthrift).
9263097 [Calvin Jia] Merge master to test latest changes
dbfc1bd [Calvin Jia] Use Tachyon 0.6.4 for cleaner dependencies.
e2ff80a [Calvin Jia] Exclude the jetty and curator promoted dependencies from tachyon-client.
a3a29da [Calvin Jia] Update tachyon-client exclusions.
0ae6c97 [Calvin Jia] Change tachyon version to 0.6.3
a204df9 [Calvin Jia] Update make distribution tachyon version.
a93c94f [Calvin Jia] Exclude jackson-mapper-asl from hadoop client since it has a lower version than spark's expected version.
a8a923c [Calvin Jia] Exclude httpcomponents from Tachyon
910fabd [Calvin Jia] Update to master
eed9230 [Calvin Jia] Update tachyon version to 0.6.1.
11907b3 [Calvin Jia] Use TachyonURI for tachyon paths instead of strings.
71bf441 [Calvin Jia] Upgrade Tachyon client version to 0.6.0.
2015-04-24 17:57:41 -04:00
Cheolsoo Park 336f7f5373 [SPARK-7037] [CORE] Inconsistent behavior for non-spark config properties in spark-shell and spark-submit
When specifying non-spark properties (i.e. names don't start with spark.) in the command line and config file, spark-submit and spark-shell behave differently, causing confusion to users.
Here is the summary-
* spark-submit
  * --conf k=v => silently ignored
  * spark-defaults.conf => applied
* spark-shell
  * --conf k=v => show a warning message and ignored
  *  spark-defaults.conf => show a warning message and ignored

I assume that ignoring non-spark properties is intentional. If so, it should always be ignored with a warning message in all cases.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #5617 from piaozhexiu/SPARK-7037 and squashes the following commits:

8957950 [Cheolsoo Park] Add IgnoreNonSparkProperties method
fedd01c [Cheolsoo Park] Ignore non-spark properties with a warning message in all cases
2015-04-23 20:10:55 -04:00
WangTaoTheTonic baa83a9a67 [SPARK-6879] [HISTORYSERVER] check if app is completed before clean it up
https://issues.apache.org/jira/browse/SPARK-6879

Use `applications` to replace `FileStatus`, and check if the app is completed before clean it up.
If an exception was throwed, add it to `applications` to wait for the next loop.

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5491 from WangTaoTheTonic/SPARK-6879 and squashes the following commits:

4a533eb [WangTaoTheTonic] treat ACE specially
cb45105 [WangTaoTheTonic] rebase
d4d5251 [WangTaoTheTonic] per Marcelo's comments
d7455d8 [WangTaoTheTonic] slightly change when delete file
b0abca5 [WangTaoTheTonic] use global var to store apps to clean
94adfe1 [WangTaoTheTonic] leave expired apps alone to be deleted
9872a9d [WangTaoTheTonic] use the right path
fdef4d6 [WangTaoTheTonic] check if app is completed before clean it up
2015-04-23 17:20:17 -04:00
Josh Rosen 6afde2c781 [SPARK-7058] Include RDD deserialization time in "task deserialization time" metric
The web UI's "task deserialization time" metric is slightly misleading because it does not capture the time taken to deserialize the broadcasted RDD.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5635 from JoshRosen/SPARK-7058 and squashes the following commits:

ed90f75 [Josh Rosen] Update UI tooltip
a3743b4 [Josh Rosen] Update comments.
4f52910 [Josh Rosen] Roll back whitespace change
e9cf9f4 [Josh Rosen] Remove unused variable
9f32e55 [Josh Rosen] Expose executorDeserializeTime on Task instead of pushing runtime calculation into Task.
21f5b47 [Josh Rosen] Don't double-count the broadcast deserialization time in task runtime
1752f0e [Josh Rosen] [SPARK-7058] Incorporate RDD deserialization time in task deserialization time metric
2015-04-23 13:19:03 -07:00
Tathagata Das 534f2a4362 [SPARK-6752][Streaming] Allow StreamingContext to be recreated from checkpoint and existing SparkContext
Currently if you want to create a StreamingContext from checkpoint information, the system will create a new SparkContext. This prevent StreamingContext to be recreated from checkpoints in managed environments where SparkContext is precreated.

The solution in this PR: Introduce the following methods on StreamingContext
1. `new StreamingContext(checkpointDirectory, sparkContext)`
   Recreate StreamingContext from checkpoint using the provided SparkContext
2. `StreamingContext.getOrCreate(checkpointDirectory, sparkContext, createFunction: SparkContext => StreamingContext)`
   If checkpoint file exists, then recreate StreamingContext using the provided SparkContext (that is, call 1.), else create StreamingContext using the provided createFunction

TODO: the corresponding Java and Python API has to be added as well.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #5428 from tdas/SPARK-6752 and squashes the following commits:

94db63c [Tathagata Das] Fix long line.
524f519 [Tathagata Das] Many changes based on PR comments.
eabd092 [Tathagata Das] Added Function0, Java API and unit tests for StreamingContext.getOrCreate
36a7823 [Tathagata Das] Minor changes.
204814e [Tathagata Das] Added StreamingContext.getOrCreate with existing SparkContext
2015-04-23 11:29:34 -07:00
Kay Ousterhout 03e85b4a11 [SPARK-7046] Remove InputMetrics from BlockResult
This is a code cleanup.

The BlockResult class originally contained an InputMetrics object so that InputMetrics could
directly be used as the InputMetrics for the whole task. Now we copy the fields out of here, and
the presence of this object is confusing because it's only a partial input metrics (it doesn't
include the records read). Because this object is no longer useful (and is confusing), it should
be removed.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #5627 from kayousterhout/SPARK-7046 and squashes the following commits:

bf64bbe [Kay Ousterhout] Import fix
a08ca19 [Kay Ousterhout] [SPARK-7046] Remove InputMetrics from BlockResult
2015-04-22 21:42:09 -07:00
zsxwing 33b85620f9 [SPARK-7052][Core] Add ThreadUtils and move thread methods from Utils to ThreadUtils
As per rxin 's suggestion in https://github.com/apache/spark/pull/5392/files#r28757176

What's more, there is a race condition in the global shared `daemonThreadFactoryBuilder`. `daemonThreadFactoryBuilder` may be modified by multiple threads. This PR removed the global `daemonThreadFactoryBuilder` and created a new `ThreadFactoryBuilder` every time.

Author: zsxwing <zsxwing@gmail.com>

Closes #5631 from zsxwing/thread-utils and squashes the following commits:

9fe5b0e [zsxwing] Add ThreadUtils and move thread methods from Utils to ThreadUtils
2015-04-22 11:08:59 -07:00
Patrick Wendell 70f9f8ff38 [MINOR] Comment improvements in ExternalSorter.
1. Clearly specifies the contract/interactions for users of this class.
2. Minor fix in one doc to avoid ambiguity.

Author: Patrick Wendell <patrick@databricks.com>

Closes #5620 from pwendell/cleanup and squashes the following commits:

8d8f44f [Patrick Wendell] [Minor] Comment improvements in ExternalSorter.
2015-04-21 21:04:04 -07:00
zsxwing 3a3f7100f4 [SPARK-6490][Docs] Add docs for rpc configurations
Added docs for rpc configurations and also fixed two places that should have been fixed in #5595.

Author: zsxwing <zsxwing@gmail.com>

Closes #5607 from zsxwing/SPARK-6490-docs and squashes the following commits:

25a6736 [zsxwing] Increase the default timeout to 120s
6e37c30 [zsxwing] Update docs
5577540 [zsxwing] Use spark.network.timeout as the default timeout if it presents
4f07174 [zsxwing] Fix unit tests
1c2cf26 [zsxwing] Add docs for rpc configurations
2015-04-21 18:37:53 -07:00
Marcelo Vanzin e72c16e30d [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races.
This change adds some new utility code to handle shutdown hooks in
Spark. The main goal is to take advantage of Hadoop 2.x's API for
shutdown hooks, which allows Spark to register a hook that will
run before the one that cleans up HDFS clients, and thus avoids
some races that would cause exceptions to show up and other issues
such as failure to properly close event logs.

Unfortunately, Hadoop 1.x does not have such APIs, so in that case
correctness is still left to chance.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5560 from vanzin/SPARK-6014 and squashes the following commits:

edfafb1 [Marcelo Vanzin] Better scaladoc.
fcaeedd [Marcelo Vanzin] Merge branch 'master' into SPARK-6014
e7039dc [Marcelo Vanzin] [SPARK-6014] [core] Revamp Spark shutdown hooks, fix shutdown races.
2015-04-21 20:33:57 -04:00
mweindel b063a61b98 Avoid warning message about invalid refuse_seconds value in Mesos >=0.21...
Starting with version 0.21.0, Apache Mesos is very noisy if the filter parameter refuse_seconds is set to an invalid value like `-1`.
I have seen systems with millions of log lines like
```
W0420 18:00:48.773059 32352 hierarchical_allocator_process.hpp:589] Using the default value of 'refuse_seconds' to create the refused resources filter because the input value is negative
```
in the Mesos master INFO and WARNING log files.
Therefore the CoarseMesosSchedulerBackend should set the default value for refuse seconds (i.e. 5 seconds) directly.
This is no problem for the fine-grained MesosSchedulerBackend, as it uses the value 1 second for this parameter.

Author: mweindel <m.weindel@usu-software.de>

Closes #5597 from MartinWeindel/master and squashes the following commits:

2f99ffd [mweindel] Avoid warning message about invalid refuse_seconds value in Mesos >=0.21.
2015-04-21 20:19:33 -04:00
Josh Rosen f83c0f112d [SPARK-3386] Share and reuse SerializerInstances in shuffle paths
This patch modifies several shuffle-related code paths to share and re-use SerializerInstances instead of creating new ones.  Some serializers, such as KryoSerializer or SqlSerializer, can be fairly expensive to create or may consume moderate amounts of memory, so it's probably best to avoid unnecessary serializer creation in hot code paths.

The key change in this patch is modifying `getDiskWriter()` / `DiskBlockObjectWriter` to accept `SerializerInstance`s instead of `Serializer`s (which are factories for instances).  This allows the disk writer's creator to decide whether the serializer instance can be shared or re-used.

The rest of the patch modifies several write and read paths to use shared serializers.  One big win is in `ShuffleBlockFetcherIterator`, where we used to create a new serializer per received block.  Similarly, the shuffle write path used to create a new serializer per file even though in many cases only a single thread would be writing to a file at a time.

I made a small serializer reuse optimization in CoarseGrainedExecutorBackend as well, since it seemed like a small and obvious improvement.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5606 from JoshRosen/SPARK-3386 and squashes the following commits:

f661ce7 [Josh Rosen] Remove thread local; add comment instead
64f8398 [Josh Rosen] Use ThreadLocal for serializer instance in CoarseGrainedExecutorBackend
aeb680e [Josh Rosen] [SPARK-3386] Reuse SerializerInstance in shuffle code paths
2015-04-21 16:24:15 -07:00
Kay Ousterhout c035c0f2d7 [SPARK-5360] [SPARK-6606] Eliminate duplicate objects in serialized CoGroupedRDD
CoGroupPartition, part of CoGroupedRDD, includes references to each RDD that the CoGroupedRDD narrowly depends on, and a reference to the ShuffleHandle. The partition is serialized separately from the RDD, so when the RDD and partition arrive on the worker, the references in the partition and in the RDD no longer point to the same object.

This is a relatively minor performance issue (the closure can be 2x larger than it needs to be because the rdds and partitions are serialized twice; see numbers below) but is more annoying as a developer issue (this is where I ran into): if any state is stored in the RDD or ShuffleHandle on the worker side, subtle bugs can appear due to the fact that the references to the RDD / ShuffleHandle in the RDD and in the partition point to separate objects. I'm not sure if this is enough of a potential future problem to fix this old and central part of the code, so hoping to get input from others here.

I did some simple experiments to see how much this effects closure size. For this example:
$ val a = sc.parallelize(1 to 10).map((_, 1))
$ val b = sc.parallelize(1 to 2).map(x => (x, 2*x))
$ a.cogroup(b).collect()
the closure was 1902 bytes with current Spark, and 1129 bytes after my change. The difference comes from eliminating duplicate serialization of the shuffle handle.

For this example:
$ val sortedA = a.sortByKey()
$ val sortedB = b.sortByKey()
$ sortedA.cogroup(sortedB).collect()
the closure was 3491 bytes with current Spark, and 1333 bytes after my change. Here, the difference comes from eliminating duplicate serialization of the two RDDs for the narrow dependencies.

The ShuffleHandle includes the ShuffleDependency, so this difference will get larger if a ShuffleDependency includes a serializer, a key ordering, or an aggregator (all set to None by default). It would also get bigger for a big RDD -- although I can't think of any examples where the RDD object gets large.  The difference is not affected by the size of the function the user specifies, which (based on my understanding) is typically the source of large task closures.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #4145 from kayousterhout/SPARK-5360 and squashes the following commits:

85156c3 [Kay Ousterhout] Better comment the narrowDeps parameter
cff0209 [Kay Ousterhout] Fixed spelling issue
658e1af [Kay Ousterhout] [SPARK-5360] Eliminate duplicate objects in serialized CoGroupedRDD
2015-04-21 11:01:18 -07:00
zsxwing 8136810dfa [SPARK-6490][Core] Add spark.rpc.* and deprecate spark.akka.*
Deprecated `spark.akka.num.retries`, `spark.akka.retry.wait`, `spark.akka.askTimeout`,  `spark.akka.lookupTimeout`, and added `spark.rpc.num.retries`, `spark.rpc.retry.wait`, `spark.rpc.askTimeout`, `spark.rpc.lookupTimeout`.

Author: zsxwing <zsxwing@gmail.com>

Closes #5595 from zsxwing/SPARK-6490 and squashes the following commits:

e0d80a9 [zsxwing] Use getTimeAsMs and getTimeAsSeconds and other minor fixes
31dbe69 [zsxwing] Add spark.rpc.* and deprecate spark.akka.*
2015-04-20 23:18:42 -07:00
zsxwing d8e1b7b06c [SPARK-6983][Streaming] Update ReceiverTrackerActor to use the new Rpc interface
A subtask of [SPARK-5293](https://issues.apache.org/jira/browse/SPARK-5293)

Author: zsxwing <zsxwing@gmail.com>

Closes #5557 from zsxwing/SPARK-6983 and squashes the following commits:

e777e9f [zsxwing] Update ReceiverTrackerActor to use the new Rpc interface
2015-04-19 20:35:43 -07:00
GuoQiang Li 0424da68d4 [SPARK-6963][CORE]Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint
cc andrewor14

Author: GuoQiang Li <witgo@qq.com>

Closes #5548 from witgo/SPARK-6963 and squashes the following commits:

964aea7 [GuoQiang Li] review commits
b08b3c9 [GuoQiang Li] Flaky test: o.a.s.ContextCleanerSuite automatically cleanup checkpoint
2015-04-19 09:37:09 +01:00
Olivier Girardot 8fbd45c74e SPARK-6993 : Add default min, max methods for JavaDoubleRDD
The default method will use Guava's Ordering instead of
java.util.Comparator.naturalOrder() because it's not available
in Java 7, only in Java 8.

Author: Olivier Girardot <o.girardot@lateral-thoughts.com>

Closes #5571 from ogirardot/master and squashes the following commits:

7fe2e9e [Olivier Girardot] SPARK-6993 : Add default min, max methods for JavaDoubleRDD
2015-04-18 18:21:44 -07:00
Marcelo Vanzin 327ebf0cb5 [core] [minor] Make sure ConnectionManager stops.
My previous fix (force a selector wakeup) didn't seem to work since
I ran into the hang again. So change the code a bit to be more
explicit about the condition when the selector thread should exit.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5566 from vanzin/conn-mgr-hang and squashes the following commits:

ddb2c03 [Marcelo Vanzin] [core] [minor] Make sure ConnectionManager stops.
2015-04-18 10:14:56 +01:00
Marcelo Vanzin 1991337336 [SPARK-5933] [core] Move config deprecation warnings to SparkConf.
I didn't find many deprecated configs after a grep-based search,
but the ones I could find were moved to the centralized location
in SparkConf.

While there, I deprecated a couple more HS configs that mentioned
time units.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5562 from vanzin/SPARK-5933 and squashes the following commits:

dcb617e7 [Marcelo Vanzin] [SPARK-5933] [core] Move config deprecation warnings to SparkConf.
2015-04-17 19:02:07 -07:00
Jongyoul Lee 6fbeb82e13 [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode
- Defined executorCores from "spark.mesos.executor.cores"
- Changed the amount of mesosExecutor's cores to executorCores.
- Added new configuration option on running-on-mesos.md

Author: Jongyoul Lee <jongyoul@gmail.com>

Closes #5063 from jongyoul/SPARK-6350 and squashes the following commits:

9238d6e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs - Changed configuration name - Made mesosExecutorCores private
2d41241 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
89edb4f [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
8ba7694 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Fixed docs
7549314 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed docs
4ae7b0c [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Removed TODO
c27efce [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Fixed Mesos*Suite for supporting integer WorkerOffers - Fixed Documentation
1fe4c03 [Jongyoul Lee] [SPARK-6453][Mesos] Some Mesos*Suite have a different package with their classes - Change available resources of cpus to integer value beacuse WorkerOffer support the amount cpus as integer value
5f3767e [Jongyoul Lee] Revert "[SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode"
4b7c69e [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Changed configruation name and description from "spark.mesos.executor.cores" to "spark.executor.frameworkCores"
0556792 [Jongyoul Lee] [SPARK-6350][Mesos] Make mesosExecutorCores configurable in mesos "fine-grained" mode - Defined executorCores from "spark.mesos.executor.cores" - Changed the amount of mesosExecutor's cores to executorCores. - Added new configuration option on running-on-mesos.md
2015-04-17 18:30:55 -07:00
Ilya Ganelin c5ed510135 [SPARK-6703][Core] Provide a way to discover existing SparkContext's
I've added a static getOrCreate method to the static SparkContext object that allows one to either retrieve a previously created SparkContext or to instantiate a new one with the provided config. The method accepts an optional SparkConf to make usage intuitive.

Still working on a test for this, basically want to create a new context from scratch, then ensure that subsequent calls don't overwrite that.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #5501 from ilganeli/SPARK-6703 and squashes the following commits:

db9a963 [Ilya Ganelin] Closing second spark context
1dc0444 [Ilya Ganelin] Added ref equality check
8c884fa [Ilya Ganelin] Made getOrCreate synchronized
cb0c6b7 [Ilya Ganelin] Doc updates and code cleanup
270cfe3 [Ilya Ganelin] [SPARK-6703] Documentation fixes
15e8dea [Ilya Ganelin] Updated comments and added MiMa Exclude
0e1567c [Ilya Ganelin] Got rid of unecessary option for AtomicReference
dfec4da [Ilya Ganelin] Changed activeContext to AtomicReference
733ec9f [Ilya Ganelin] Fixed some bugs in test code
8be2f83 [Ilya Ganelin] Replaced match with if
e92caf7 [Ilya Ganelin] [SPARK-6703] Added test to ensure that getOrCreate both allows creation, retrieval, and a second context if desired
a99032f [Ilya Ganelin] Spacing fix
d7a06b8 [Ilya Ganelin] Updated SparkConf class to add getOrCreate method. Started test suite implementation
2015-04-17 18:28:42 -07:00
cafreeman 59e206deb7 [SPARK-6807] [SparkR] Merge recent SparkR-pkg changes
This PR pulls in recent changes in SparkR-pkg, including

cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField.

Author: cafreeman <cfreeman@alteryx.com>
Author: Davies Liu <davies@databricks.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Sun Rui <rui.sun@intel.com>

Closes #5436 from davies/R3 and squashes the following commits:

c2b09be [Davies Liu] SQLTypes -> schema
a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3
168b7fe [Davies Liu] sort generics
b1fe460 [Davies Liu] fix conflict in README.md
e74c04e [Davies Liu] fix schema.R
4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5
41f8184 [Davies Liu] rm man
ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3
1bdcb63 [Zongheng Yang] Updates to README.md.
5a553e7 [cafreeman] Use object attribute instead of argument
71372d9 [cafreeman] Update docs and examples
8526d2e71 [cafreeman] Remove `tojson` functions
6ef5f2d [cafreeman] Fix spacing
7741d66 [cafreeman] Rename the SQL DataType function
141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream
9387402 [Davies Liu] fix style
40199eb [Shivaram Venkataraman] Move except into sorted position
07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD.
7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey
ed66c81 [cafreeman] Update `subtract` to work with `generics.R`
f3ba785 [cafreeman] Fixed duplicate export
275deb4 [cafreeman] Update `NAMESPACE` and tests
1a3b63d [cafreeman] new version of `CreateDF`
836c4bf [cafreeman] Update `createDataFrame` and `toDF`
be5d5c1 [cafreeman] refactor schema functions
40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5
20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist
ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4
c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master
b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master
136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats
cd66603 [cafreeman] new line at EOF
8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep
7dd81b7 [cafreeman] Documentation
0e2a94f [cafreeman] Define functions for schema and fields
2015-04-17 13:42:19 -07:00
linweizhong dc48ba9f9f [SPARK-6604][PySpark]Specify ip of python server scoket
In driver now will start a server socket and use a wildcard ip, use 127.0.0.0 is more reasonable, as we only use it by local Python process.
/cc davies

Author: linweizhong <linweizhong@huawei.com>

Closes #5256 from Sephiroth-Lin/SPARK-6604 and squashes the following commits:

7b3c633 [linweizhong] rephrase
2015-04-17 12:04:02 +01:00
Marcelo Vanzin 4527761bcd [SPARK-6046] [core] Reorganize deprecated config support in SparkConf.
This change tries to follow the chosen way for handling deprecated
configs in SparkConf: all values (old and new) are kept in the conf
object, and newer names take precedence over older ones when
retrieving the value.

Warnings are logged when config options are set, which generally happens
on the driver node (where the logs are most visible).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5514 from vanzin/SPARK-6046 and squashes the following commits:

9371529 [Marcelo Vanzin] Avoid math.
6cf3f11 [Marcelo Vanzin] Review feedback.
2445d48 [Marcelo Vanzin] Fix (and cleanup) update interval initialization.
b6824be [Marcelo Vanzin] Clean up the other deprecated config use also.
ab20351 [Marcelo Vanzin] Update FsHistoryProvider to only retrieve new config key.
2c93209 [Marcelo Vanzin] [SPARK-6046] [core] Reorganize deprecated config support in SparkConf.
2015-04-17 11:06:01 +01:00
Sean Owen f7a25644ed SPARK-6846 [WEBUI] Stage kill URL easy to accidentally trigger and possibility for security issue
kill endpoints now only accept a POST (kill stage, master kill app, master kill driver); kill link now POSTs

Author: Sean Owen <sowen@cloudera.com>

Closes #5528 from srowen/SPARK-6846 and squashes the following commits:

137ac9f [Sean Owen] Oops, fix scalastyle line length probelm
7c5f961 [Sean Owen] Add Imran's test of kill link
59f447d [Sean Owen] kill endpoints now only accept a POST (kill stage, master kill app, master kill driver); kill link now POSTs
2015-04-17 11:02:31 +01:00
zsxwing ef3fb801ae [SPARK-6934][Core] Use 'spark.akka.askTimeout' for the ask timeout
Fixed my mistake in #4588

Author: zsxwing <zsxwing@gmail.com>

Closes #5529 from zsxwing/SPARK-6934 and squashes the following commits:

9890b2d [zsxwing] Use 'spark.akka.askTimeout' for the ask timeout
2015-04-16 13:45:55 -05:00
Marcelo Vanzin de4fa6b6d1 [SPARK-4194] [core] Make SparkContext initialization exception-safe.
SparkContext has a very long constructor, where multiple things are
initialized, multiple threads are spawned, and multiple opportunities
for exceptions to be thrown exist. If one of these happens at an
innoportune time, lots of garbage tends to stick around.

This patch re-organizes SparkContext so that its internal state is
initialized in a big "try" block. The fields keeping state are now
completely private to SparkContext, and are "vars", because Scala
doesn't allow you to initialize a val later. The existing API interface
is kept by turning vals into defs (which works because Scala guarantees
the same binary interface for those).

On top of that, a few things in other areas were changed to avoid more
things leaking:

- Executor was changed to explicitly wait for the heartbeat thread to
  stop. LocalBackend was changed to wait for the "StopExecutor"
  message to be received, since otherwise there could be a race
  between that message arriving and the actor system being shut down.
- ConnectionManager could possibly hang during shutdown, because an
  interrupt at the wrong moment could cause the selector thread to
  still call select and then wait forever. So also wake up the
  selector so that this situation is avoided.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5335 from vanzin/SPARK-4194 and squashes the following commits:

746b661 [Marcelo Vanzin] Fix borked merge.
80fc00e [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
408dada [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
2621609 [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
6b73fcb [Marcelo Vanzin] Scalastyle.
c671c46 [Marcelo Vanzin] Fix merge.
3979aad [Marcelo Vanzin] Merge branch 'master' into SPARK-4194
8caa8b3 [Marcelo Vanzin] [SPARK-4194] [core] Make SparkContext initialization exception-safe.
071f16e [Marcelo Vanzin] Nits.
27456b9 [Marcelo Vanzin] More exception safety.
a0b0881 [Marcelo Vanzin] Stop alloc manager before scheduler.
5545d83 [Marcelo Vanzin] [SPARK-6650] [core] Stop ExecutorAllocationManager when context stops.
2015-04-16 10:48:31 +01:00
Sean Owen 6179a94837 SPARK-4783 [CORE] System.exit() calls in SparkContext disrupt applications embedding Spark
Avoid `System.exit(1)` in `TaskSchedulerImpl` and convert to `SparkException`; ensure scheduler calls `sc.stop()` even when this exception is thrown.

CC mateiz aarondav as those who may have last touched this code.

Author: Sean Owen <sowen@cloudera.com>

Closes #5492 from srowen/SPARK-4783 and squashes the following commits:

60dc682 [Sean Owen] Avoid System.exit(1) in TaskSchedulerImpl and convert to SparkException; ensure scheduler calls sc.stop() even when this exception is thrown
2015-04-16 10:45:32 +01:00
Isaias Barroso d5f1b9650b [SPARK-2312] Logging Unhandled messages
The  previous solution has changed based on https://github.com/apache/spark/pull/2048 discussions.

Author: Isaias Barroso <isaias.barroso@gmail.com>

Closes #2055 from isaias/SPARK-2312 and squashes the following commits:

f61d9e6 [Isaias Barroso] Change Log level for unhandled message to debug
f341777 [Isaias Barroso] [SPARK-2312] Logging Unhandled messages
2015-04-15 22:40:52 +01:00
Marcelo Vanzin 30a6e0dcc0 [SPARK-5634] [core] Show correct message in HS when no incomplete apps f...
...ound.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #5515 from vanzin/SPARK-5634 and squashes the following commits:

f74ecf1 [Marcelo Vanzin] [SPARK-5634] [core] Show correct message in HS when no incomplete apps found.
2015-04-14 18:52:48 -07:00
Kousuke Saruta 4d4b249274 [SPARK-6769][YARN][TEST] Usage of the ListenerBus in YarnClusterSuite is wrong
In YarnClusterSuite, a test case uses `SaveExecutorInfo`  to handle ExecutorAddedEvent as follows.

```
private class SaveExecutorInfo extends SparkListener {
  val addedExecutorInfos = mutable.Map[String, ExecutorInfo]()

  override def onExecutorAdded(executor: SparkListenerExecutorAdded) {
    addedExecutorInfos(executor.executorId) = executor.executorInfo
  }
}

...

    listener = new SaveExecutorInfo
    val sc = new SparkContext(new SparkConf()
      .setAppName("yarn \"test app\" 'with quotes' and \\back\\slashes and $dollarSigns"))
    sc.addSparkListener(listener)
    val status = new File(args(0))
    var result = "failure"
    try {
      val data = sc.parallelize(1 to 4, 4).collect().toSet
      assert(sc.listenerBus.waitUntilEmpty(WAIT_TIMEOUT_MILLIS))
      data should be (Set(1, 2, 3, 4))
      result = "success"
    } finally {
      sc.stop()
      Files.write(result, status, UTF_8)
    }
```

But, the usage is wrong because Executors will spawn during initializing SparkContext and SparkContext#addSparkListener should be invoked after the initialization, thus after Executors spawn, so SaveExecutorInfo cannot handle ExecutorAddedEvent.

Following code refers the result of the handling ExecutorAddedEvent. Because of the reason above, we cannot reach the assertion.

```
    // verify log urls are present
    listener.addedExecutorInfos.values.foreach { info =>
      assert(info.logUrlMap.nonEmpty)
    }
```

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #5417 from sarutak/SPARK-6769 and squashes the following commits:

8adc8ba [Kousuke Saruta] Fixed compile error
e258530 [Kousuke Saruta] Fixed style
591cf3e [Kousuke Saruta] Fixed style
48ec89a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6769
860c965 [Kousuke Saruta] Simplified code
207d325 [Kousuke Saruta] Added findListenersByClass method to ListenerBus
2408c84 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6769
2d7e409 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-6769
3874adf [Kousuke Saruta] Fixed the usage of listener bus in LogUrlsStandaloneSuite
153a91b [Kousuke Saruta] Fixed the usage of listener bus in YarnClusterSuite
2015-04-14 14:01:55 -07:00
CodingCat 8f8dc45f6d SPARK-1706: Allow multiple executors per worker in Standalone mode
resubmit of https://github.com/apache/spark/pull/636  for a totally different algorithm

https://issues.apache.org/jira/browse/SPARK-1706

In current implementation, the user has to start multiple workers in a server for starting multiple executors in a server, which introduces additional overhead due to the more JVM processes...

In this patch, I changed the scheduling logic in master to enable the user to start multiple executor processes within the same JVM process.

1. user configure spark.executor.maxCoreNumPerExecutor to suggest the maximum core he/she would like to allocate to each executor

2. Master assigns the executors to the workers with the major consideration on the memoryPerExecutor and the worker.freeMemory, and tries to allocate as many as possible cores to the executor ```min(min(memoryPerExecutor, worker.freeCore), maxLeftCoreToAssign)``` where ```maxLeftCoreToAssign = maxExecutorCanAssign * maxCoreNumPerExecutor```

---------------------------------------

Other small changes include

change memoryPerSlave in ApplicationDescription to memoryPerExecutor, as "Slave" is overrided to represent both worker and executor in the documents... (we have some discussion on this before?)

Author: CodingCat <zhunansjtu@gmail.com>

Closes #731 from CodingCat/SPARK-1706-2 and squashes the following commits:

6dee808 [CodingCat] change filter predicate
fbeb7e5 [CodingCat] address the comments
940cb42 [CodingCat] avoid unnecessary allocation
b8ca561 [CodingCat] revert a change
45967b4 [CodingCat] remove unused method
2eeff77 [CodingCat] stylistic fixes
12a1b32 [CodingCat] change the semantic of coresPerExecutor to exact core number
f035423 [CodingCat] stylistic fix
d9c1685 [CodingCat] remove unused var
f595bd6 [CodingCat] recover some unintentional changes
63b3df9 [CodingCat] change the description of the parameter in the submit script
4cf61f1 [CodingCat] improve the code and docs
ff011e2 [CodingCat] start multiple executors on the worker by rewriting startExeuctor logic
2c2bcc5 [CodingCat] fix wrong usage info
497ec2c [CodingCat] address andrew's comments
878402c [CodingCat] change the launching executor code
f64a28d [CodingCat] typo fix
387f4ec [CodingCat] bug fix
35c462c [CodingCat] address Andrew's comments
0b64fea [CodingCat] fix compilation issue
19d3da7 [CodingCat] address the comments
5b81466 [CodingCat] remove outdated comments
ec7d421 [CodingCat] test commit
e5efabb [CodingCat] more java docs and consolidate canUse function
a26096d [CodingCat] stylistic fix
a5d629a [CodingCat] java doc
b34ec0c [CodingCat] make master support multiple executors per worker
2015-04-14 13:32:06 -07:00
GuoQiang Li 25998e4d73 [SPARK-2033] Automatically cleanup checkpoint
Author: GuoQiang Li <witgo@qq.com>

Closes #855 from witgo/cleanup_checkpoint_date and squashes the following commits:

1649850 [GuoQiang Li] review commit
c0087e0 [GuoQiang Li] Automatically cleanup checkpoint
2015-04-14 12:56:47 -07:00
pankaj arora dcf8a9f331 [CORE] SPARK-6880: Fixed null check when all the dependent stages are cancelled due to previous stage failure
Fixed null check when all the dependent stages are cancelled due to previous stage failure. This happens when one of the executor node goes down and all the dependent stages are cancelled.

Author: pankaj arora <pankaj.arora@guavus.com>

Closes #5494 from pankajarora12/NEWBRANCH and squashes the following commits:

55ba5e3 [pankaj arora] [CORE] SPARK-6880: Fixed null check when all the dependent stages are cancelled due to previous stage failure
4575720 [pankaj arora] [CORE] SPARK-6880: Fixed null check when all the dependent stages are cancelled due to previous stage failure
2015-04-14 12:07:08 -07:00
Timothy Chen 320bca4508 [SPARK-6081] Support fetching http/https uris in driver runner.
Currently if passed uris such as http/https, it won't able to fetch them as it only calls HadoopFs get.
This fix utilizes the existing util method to fetch remote uris as well.

Author: Timothy Chen <tnachen@gmail.com>

Closes #4832 from tnachen/driver_remote and squashes the following commits:

aa52cd6 [Timothy Chen] Support fetching remote uris in driver runner.
2015-04-14 11:49:04 -07:00
Erik van Oosten 51b306b930 SPARK-6878 [CORE] Fix for sum on empty RDD fails with exception
Author: Erik van Oosten <evanoosten@ebay.com>

Closes #5489 from erikvanoosten/master and squashes the following commits:

1c91954 [Erik van Oosten] Rewrote double range matcher to an exact equality assert (SPARK-6878)
f1708c9 [Erik van Oosten] Fix for sum on empty RDD fails with exception (SPARK-6878)
2015-04-14 12:39:56 +01:00
Ilya Ganelin c4ab255e94 [SPARK-5931][CORE] Use consistent naming for time properties
I've added new utility methods to do the conversion from times specified as e.g. 120s, 240ms, 360us to convert to a consistent internal representation. I've updated usage of these constants throughout the code to be consistent.

I believe I've captured all usages of time-based properties throughout the code. I've also updated variable names in a number of places to reflect their units for clarity and updated documentation where appropriate.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Author: Ilya Ganelin <ilganeli@gmail.com>

Closes #5236 from ilganeli/SPARK-5931 and squashes the following commits:

4526c81 [Ilya Ganelin] Update configuration.md
de3bff9 [Ilya Ganelin] Fixing style errors
f5fafcd [Ilya Ganelin] Doc updates
951ca2d [Ilya Ganelin] Made the most recent round of changes
bc04e05 [Ilya Ganelin] Minor fixes and doc updates
25d3f52 [Ilya Ganelin] Minor nit fixes
642a06d [Ilya Ganelin] Fixed logic for invalid suffixes and addid matching test
8927e66 [Ilya Ganelin] Fixed handling of -1
69fedcc [Ilya Ganelin] Added test for zero
dc7bd08 [Ilya Ganelin] Fixed error in exception handling
7d19cdd [Ilya Ganelin] Added fix for possible NPE
6f651a8 [Ilya Ganelin] Now using regexes to simplify code in parseTimeString. Introduces getTimeAsSec and getTimeAsMs methods in SparkConf. Updated documentation
cbd2ca6 [Ilya Ganelin] Formatting error
1a1122c [Ilya Ganelin] Formatting fixes and added m for use as minute formatter
4e48679 [Ilya Ganelin] Fixed priority order and mixed up conversions in a couple spots
d4efd26 [Ilya Ganelin] Added time conversion for yarn.scheduler.heartbeat.interval-ms
cbf41db [Ilya Ganelin] Got rid of thrown exceptions
1465390 [Ilya Ganelin] Nit
28187bf [Ilya Ganelin] Convert straight to seconds
ff40bfe [Ilya Ganelin] Updated tests to fix small bugs
19c31af [Ilya Ganelin] Added cleaner computation of time conversions in tests
6387772 [Ilya Ganelin] Updated suffix handling to handle overlap of units more gracefully
5193d5f [Ilya Ganelin] Resolved merge conflicts
76cfa27 [Ilya Ganelin] [SPARK-5931] Minor nit fixes'
bf779b0 [Ilya Ganelin] Special handling of overlapping usffixes for java
dd0a680 [Ilya Ganelin] Updated scala code to call into java
b2fc965 [Ilya Ganelin] replaced get or default since it's not present in this version of java
39164f9 [Ilya Ganelin] [SPARK-5931] Updated Java conversion to be similar to scala conversion. Updated conversions to clean up code a little using TimeUnit.convert. Added Unit tests
3b126e1 [Ilya Ganelin] Fixed conversion to US from seconds
1858197 [Ilya Ganelin] Fixed bug where all time was being converted to us instead of the appropriate units
bac9edf [Ilya Ganelin] More whitespace
8613631 [Ilya Ganelin] Whitespace
1c0c07c [Ilya Ganelin] Updated Java code to add day, minutes, and hours
647b5ac [Ilya Ganelin] Udpated time conversion to use map iterator instead of if fall through
70ac213 [Ilya Ganelin] Fixed remaining usages to be consistent. Updated Java-side time conversion
68f4e93 [Ilya Ganelin] Updated more files to clean up usage of default time strings
3a12dd8 [Ilya Ganelin] Updated host revceiver
5232a36 [Ilya Ganelin] [SPARK-5931] Changed default behavior of time string conversion.
499bdf0 [Ilya Ganelin] Merge branch 'SPARK-5931' of github.com:ilganeli/spark into SPARK-5931
9e2547c [Ilya Ganelin] Reverting doc changes
8f741e1 [Ilya Ganelin] Update JavaUtils.java
34f87c2 [Ilya Ganelin] Update Utils.scala
9a29d8d [Ilya Ganelin] Fixed misuse of time in streaming context test
42477aa [Ilya Ganelin] Updated configuration doc with note on specifying time properties
cde9bff [Ilya Ganelin] Updated spark.streaming.blockInterval
c6a0095 [Ilya Ganelin] Updated spark.core.connection.auth.wait.timeout
5181597 [Ilya Ganelin] Updated spark.dynamicAllocation.schedulerBacklogTimeout
2fcc91c [Ilya Ganelin] Updated spark.dynamicAllocation.executorIdleTimeout
6d1518e [Ilya Ganelin] Upated spark.speculation.interval
3f1cfc8 [Ilya Ganelin] Updated spark.scheduler.revive.interval
3352d34 [Ilya Ganelin] Updated spark.scheduler.maxRegisteredResourcesWaitingTime
272c215 [Ilya Ganelin] Updated spark.locality.wait
7320c87 [Ilya Ganelin] updated spark.akka.heartbeat.interval
064ebd6 [Ilya Ganelin] Updated usage of spark.cleaner.ttl
21ef3dd [Ilya Ganelin] updated spark.shuffle.sasl.timeout
c9f5cad [Ilya Ganelin] Updated spark.shuffle.io.retryWait
4933fda [Ilya Ganelin] Updated usage of spark.storage.blockManagerSlaveTimeout
7db6d2a [Ilya Ganelin] Updated usage of spark.akka.timeout
404f8c3 [Ilya Ganelin] Updated usage of spark.core.connection.ack.wait.timeout
59bf9e1 [Ilya Ganelin] [SPARK-5931] Updated Utils and JavaUtils classes to add helper methods to handle time strings. Updated time strings in a few places to properly parse time
2015-04-13 16:28:07 -07:00
Cheolsoo Park 6cc5b3ed3c [SPARK-6662][YARN] Allow variable substitution in spark.yarn.historyServer.address
In Spark on YARN, explicit hostname and port number need to be set for "spark.yarn.historyServer.address" in SparkConf to make the HISTORY link. If the history server address is known and static, this is usually not a problem.

But in cloud, that is usually not true. Particularly in EMR, the history server always runs on the same node as with RM. So I could simply set it to ${yarn.resourcemanager.hostname}:18080 if variable substitution is allowed.

In fact, Hadoop configuration already implements variable substitution, so if this property is read via YarnConf, this can be easily achievable.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #5321 from piaozhexiu/SPARK-6662 and squashes the following commits:

e37de75 [Cheolsoo Park] Preserve the space between the Hadoop and Spark imports
79757c6 [Cheolsoo Park] Incorporate review comments
10e2917 [Cheolsoo Park] Add helper function that substitutes hadoop vars to SparkHadoopUtil
589b52c [Cheolsoo Park] Revert "Allow variable substitution for spark.yarn. properties"
ff9c35d [Cheolsoo Park] Allow variable substitution for spark.yarn. properties
2015-04-13 13:45:10 -05:00
Reynold Xin c5b0b296b8 [SPARK-6765] Enable scalastyle on test code.
Turn scalastyle on for all test code. Most of the violations have been resolved in my previous pull requests:

Core: https://github.com/apache/spark/pull/5484
SQL: https://github.com/apache/spark/pull/5412
MLlib: https://github.com/apache/spark/pull/5411
GraphX: https://github.com/apache/spark/pull/5410
Streaming: https://github.com/apache/spark/pull/5409

Author: Reynold Xin <rxin@databricks.com>

Closes #5486 from rxin/test-style-enable and squashes the following commits:

01683de [Reynold Xin] Fixed new code.
a4ab46e [Reynold Xin] Fixed tests.
20adbc8 [Reynold Xin] Missed one violation.
5e36521 [Reynold Xin] [SPARK-6765] Enable scalastyle on test code.
2015-04-13 09:29:04 -07:00
nyaapa 9d117cee0b [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI
Author: nyaapa <nyaapa@gmail.com>

Closes #5424 from nyaapa/master and squashes the following commits:

6b717aa [nyaapa] [SPARK-6440][CORE] Remove Utils.localIpAddressHostname, Utils.localIpAddressURI and Utils.getAddressHostName; make Utils.localIpAddress private; rename Utils.localHostURI into Utils.localHostNameForURI; use Utils.localHostName in org.apache.spark.streaming.kinesis.KinesisReceiver and org.apache.spark.sql.hive.thriftserver.SparkSQLEnv
2098081 [nyaapa] [SPARK-6440][CORE] style fixes and use getHostAddress instead of getHostName
84763d7 [nyaapa] [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI
2015-04-13 12:55:25 +01:00
Reynold Xin a1fe59dae5 [SPARK-6765] Fix test code style for core.
Author: Reynold Xin <rxin@databricks.com>

Closes #5484 from rxin/test-style-core and squashes the following commits:

e0b0100 [Reynold Xin] [SPARK-6765] Fix test code style for core.
2015-04-12 20:50:49 -07:00
lisurprise ddc17431a4 [SPARK-6843][core]Add volatile for the "state"
Fix potential visibility problem for the "state" of Executor

The field of "state" is shared and modified by multiple threads. i.e:

```scala
Within ExecutorRunner.scala

(1) workerThread = new Thread("ExecutorRunner for " + fullId) {
  override def run() { fetchAndRunExecutor() }
}
 workerThread.start()
// Shutdown hook that kills actors on shutdown.

(2)shutdownHook = new Thread() {
  override def run() {
    killProcess(Some("Worker shutting down"))
  }
}

(3)and also the "Actor thread" for worker.

```
I think we should at lease add volatile to ensure the visibility among threads otherwise the worker might send an out-of-date status to the master.

https://issues.apache.org/jira/browse/SPARK-6843

Author: lisurprise <zhichao.li@intel.com>

Closes #5448 from zhichao-li/state and squashes the following commits:

a2386e7 [lisurprise] add volatile for state field
2015-04-12 13:41:44 +01:00
Liang-Chi Hsieh 95a07591b3 [Minor][Core] Fix typo
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #5466 from viirya/fix_ShuffleMapTask_typo and squashes the following commits:

2789fd5 [Liang-Chi Hsieh] fix typo.
2015-04-11 13:07:41 +01:00
WangTaoTheTonic 7d92db342e [SPARK-6758]block the right jetty package in log
https://issues.apache.org/jira/browse/SPARK-6758

I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?).

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits:

e09605b [WangTaoTheTonic] block the right jetty package
2015-04-09 17:44:08 -04:00
Shivaram Venkataraman 2fe0a1aaee [SPARK-5654] Integrate SparkR
This pull requests integrates SparkR, an R frontend for Spark. The SparkR package contains both RDD and DataFrame APIs in R and is integrated with Spark's submission scripts to work on different cluster managers.

Some integration points that would be great to get feedback on:

1. Build procedure: SparkR requires R to be installed on the machine to be built. Right now we have a new Maven profile `-PsparkR` that can be used to enable SparkR builds

2. YARN cluster mode: The R package that is built needs to be present on the driver and all the worker nodes during execution. The R package location is currently set using SPARK_HOME, but this might not work on YARN cluster mode.

The SparkR package represents the work of many contributors and attached below is a list of people along with areas they worked on

edwardt (edwart) - Documentation improvements
Felix Cheung (felixcheung) - Documentation improvements
Hossein Falaki (falaki)  - Documentation improvements
Chris Freeman (cafreeman) - DataFrame API, Programming Guide
Todd Gao (7c00) - R worker Internals
Ryan Hafen (hafen) - SparkR Internals
Qian Huang (hqzizania) - RDD API
Hao Lin (hlin09) - RDD API, Closure cleaner
Evert Lammerts (evertlammerts) - DataFrame API
Davies Liu (davies) - DataFrame API, R worker internals, Merging with Spark
Yi Lu (lythesia) - RDD API, Worker internals
Matt Massie (massie) - Jenkins build
Harihar Nahak (hnahak87) - SparkR examples
Oscar Olmedo (oscaroboto) - Spark configuration
Antonio Piccolboni (piccolbo) - SparkR examples, Namespace bug fixes
Dan Putler (dputler) - Dataframe API, SparkR Install Guide
Ashutosh Raina (ashutoshraina) - Build improvements
Josh Rosen (joshrosen) - Travis CI build
Sun Rui (sun-rui)- RDD API, JVM Backend, Shuffle improvements
Shivaram Venkataraman (shivaram) - RDD API, JVM Backend, Worker Internals
Zongheng Yang (concretevitamin) - RDD API, Pipelined RDDs, Examples and EC2 guide

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Zongheng Yang <zongheng.y@gmail.com>
Author: cafreeman <cfreeman@alteryx.com>
Author: Shivaram Venkataraman <shivaram@eecs.berkeley.edu>
Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>
Author: hlin09 <hlin09pu@gmail.com>
Author: Sun Rui <rui.sun@intel.com>
Author: lythesia <iranaikimi@gmail.com>
Author: oscaroboto <oscarjr@gmail.com>
Author: Antonio Piccolboni <antonio@piccolboni.info>
Author: root <edward>
Author: edwardt <edwardt.tril@gmail.com>
Author: hqzizania <qian.huang@intel.com>
Author: dputler <dan.putler@gmail.com>
Author: Todd Gao <todd.gao.2013@gmail.com>
Author: Chris Freeman <cfreeman@alteryx.com>
Author: Felix Cheung <fcheung@AVVOMAC-119.local>
Author: Hossein <hossein@databricks.com>
Author: Evert Lammerts <evert@apache.org>
Author: Felix Cheung <fcheung@avvomac-119.t-mobile.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Author: Ryan Hafen <rhafen@gmail.com>
Author: Ashutosh Raina <ashutoshraina@users.noreply.github.com>
Author: Oscar Olmedo <oscarjr@gmail.com>
Author: Josh Rosen <rosenville@gmail.com>
Author: Yi Lu <iranaikimi@gmail.com>
Author: Harihar Nahak <hnahak87@users.noreply.github.com>

Closes #5096 from shivaram/R and squashes the following commits:

da64742 [Davies Liu] fix Date serialization
59266d1 [Davies Liu] check exclusive of primary-py-file and primary-r-file
55808e4 [Davies Liu] fix tests
5581c75 [Davies Liu] update author of SparkR
f731b48 [Shivaram Venkataraman] Only run SparkR tests if R is installed
64eda24 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
d7c3f22 [Shivaram Venkataraman] Address code review comments Changes include 1. Adding SparkR docs to API docs generated 2. Style fixes in SparkR scala files 3. Clean up of shell scripts and explanation of install-dev.sh
377151f [Shivaram Venkataraman] Merge remote-tracking branch 'apache/master' into R
eb5da53 [Shivaram Venkataraman] Merge pull request #3 from davies/R2
a18ff5c [Davies Liu] Update sparkR.R
5133f3a [Shivaram Venkataraman] Merge pull request #7 from hqzizania/R3
940b631 [hqzizania] [SPARKR-92] Phase 2: implement sum(rdd)
0e788c0 [Shivaram Venkataraman] Merge pull request #5 from hlin09/doc-fix
3487461 [hlin09] Add tests log in .gitignore.
1d1802e [Shivaram Venkataraman] Merge pull request #4 from felixcheung/r-require
11981b7 [felixcheung] Update R to fail early if SparkR package is missing
c300e08 [Davies Liu] remove duplicated file
b045701 [Davies Liu] Merge branch 'remote_r' into R
19c9368 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r
f8fa8af [Davies Liu] mute logging when start/stop context
e7104b6 [Davies Liu] remove ::: in SparkR
a1777eb [Davies Liu] move rules into R/.gitignore
e88b649 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
6e20e71 [Davies Liu] address comments
b433817 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R
a1cedad [Shivaram Venkataraman] Merge pull request #228 from felixcheung/doc
e089151 [Davies Liu] Merge pull request #225 from sun-rui/SPARKR-154_2
463e28c [Davies Liu] Merge pull request #2 from shivaram/doc-fixes
bc2d6d8 [Shivaram Venkataraman] Remove arg from sparkR.stop and update docs
d425363 [Shivaram Venkataraman] Some doc fixes for column, generics, group
1f1a7e0 [Shivaram Venkataraman] Some fixes to DataFrame, RDD, SQLContext docs
104ad4e [Shivaram Venkataraman] Check the right env in exists
cf5cd99 [Shivaram Venkataraman] Remove unused numCols argument
85a50ec [Shivaram Venkataraman] Merge pull request #226 from RevolutionAnalytics/master
3eacfc0 [Davies Liu] fix flaky test
733380d [Davies Liu] update R examples (remove master from args)
b21a0da [Davies Liu] Merge pull request #1 from shivaram/log4j-tests
a1493d7 [Shivaram Venkataraman] Address comments
e1f83ab [Shivaram Venkataraman] Send Spark INFO logs to a file in SparkR tests
58276f5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
52cc92d [Shivaram Venkataraman] Add license to create-docs.sh
6ff5ea2 [Shivaram Venkataraman] Add instructions to generate docs
1f478c5 [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
02b4833 [Shivaram Venkataraman] Add a script to generate R docs (Rd, html) Also fix some issues with our documentation
d6d3729 [Davies Liu] enable spark and pyspark tests
0e5a83f [Davies Liu] fix code style
afd8a77 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
d87a181 [Davies Liu] fix flaky tests
7100fb9 [Shivaram Venkataraman] Fix libPaths in README
bdf3a14 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
05e7375 [Davies Liu] sort generics
b44e371 [Shivaram Venkataraman] Include RStudio instructions in README
855537f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
9fb6af3 [Davies Liu] mark R classes/objects are private
423ea3c [Shivaram Venkataraman] Ignore unknown jobj in cleanup
974e4ea [Davies Liu] fix flaky test
410ec18 [Davies Liu] fix zipRDD() tests
d8b24fc [Davies Liu] disable spark and python tests temporary
ce3ca62 [Davies Liu] fix license check
7da0049 [Davies Liu] fix build
2892e29 [Davies Liu] support R in YARN cluster
ebd4d07 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
38cbf59 [Davies Liu] fix test of zipRDD()
756ece0 [Shivaram Venkataraman] Update README remove outdated TODO
d436f26 [Davies Liu] add missing files
40d193a [Shivaram Venkataraman] Merge pull request #224 from sun-rui/SPARKR-224-new
1a16cd6 [Davies Liu] rm PROJECT_HOME
56670ef [Davies Liu] rm man page
ba4b80b [Davies Liu] Merge branch 'remote_r' into R
f04080c [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into remote_r
028cbfb [Davies Liu] fix exit code of sparkr unit test
42d8b4c [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
ef26015 [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
a1870e8 [Shivaram Venkataraman] Merge pull request #214 from sun-rui/SPARKR-156_3
cb6e5e3 [Shivaram Venkataraman] Add scripts to start SparkR on windows
8030847 [Shivaram Venkataraman] Set windows file separators, install dirs
05afef0 [Shivaram Venkataraman] Only stop backend JVM if R launched it
95d2de3 [Davies Liu] fix spark-submit with R scripot
baefd9e [Shivaram Venkataraman] Make bin/sparkR use spark-submit As a part of this move the R initialization functions into first.R and first-submit.R
d6f2bdd [Shivaram Venkataraman] Fix run-tests path
ea90fab [Davies Liu] fix spark-submit with R path and sparkR -h
0e2412c [Davies Liu] fix bin/sparkR
9f6aa1f [Davies Liu] Merge branch 'R' of github.com:amplab-extras/spark into R
479e3fe [Davies Liu] change println() to logging
52ca6e5 [Shivaram Venkataraman] Add missing comma
716b16f [Shivaram Venkataraman] Merge branch 'R' of https://github.com/amplab-extras/spark into R
2d235d4 [Shivaram Venkataraman] Build SparkR with Maven profile
aae881b [Davies Liu] fix rat
ff776aa [Shivaram Venkataraman] Fix style
e4f1937 [Shivaram Venkataraman] Remove DFC example
f7b6936 [Davies Liu] remove Spark prefix for class
043959e [Davies Liu] cleanup
ba53b09 [Davies Liu] support R in spark-submit
f403b4a [Davies Liu] rm .travis.yml
c4a5bdf [Davies Liu] run sparkr tests in Spark
e8fc7ca [Davies Liu] fix .gitignore
35e5755 [Davies Liu] reduce size of example data
50bff63 [Davies Liu] add LICENSE header for R sources
facb6e0 [Davies Liu] add .gitignore for .o, .so, .Rd
18e5eed [Davies Liu] update docs
0a0e632 [Davies Liu] move sparkR into bin/
a76472f [Davies Liu] fix path of assembly jar
df3eeea [Davies Liu] move R/examples into examples/src/main/r
3415cc7 [Davies Liu] move Scala source into core/ and sql/
180fc9c [Davies Liu] move scala
014d253 [Davies Liu] delete man pages
49a8133 [Davies Liu] Merge branch 'remote_r' into R
44994c2 [Davies Liu] Moved files to R/
2fc553f [Shivaram Venkataraman] Merge pull request #222 from davies/column2
b043876 [Davies Liu] fix test
5e610cb [Davies Liu] add more API for Column
6f95d49 [Shivaram Venkataraman] Merge pull request #221 from shivaram/sparkr-stop-start
3214c6d [Shivaram Venkataraman] Merge pull request #217 from hlin09/cleanClosureFix
f5d3355 [Shivaram Venkataraman] Merge pull request #218 from davies/merge
70f620c [Davies Liu] address comments
4b1628d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into merge
3139325 [Shivaram Venkataraman] Merge pull request #212 from davies/toDF
6122e0e [Davies Liu] handle NULL
bc2ff38 [Davies Liu] handle NULL
7f5e70c [Davies Liu] Update SerDe.scala
46454e4 [Davies Liu] address comments
dd52cbc [Shivaram Venkataraman] Merge pull request #220 from shivaram/sparkr-utils-include
662938a [Shivaram Venkataraman] Include utils before SparkR for `head` to work Before this change calling `head` on a DataFrame would not work from the sparkR script as utils would be loaded after SparkR and placed ahead in the search list. This change requires utils to be loaded before SparkR
1bc2998 [Shivaram Venkataraman] Merge pull request #179 from evertlammerts/sparkr-sql
7695d36 [Evert Lammerts] added tests
8190127 [Evert Lammerts] fixed parquetFile signature
d8c8fcc [Shivaram Venkataraman] Merge pull request #219 from shivaram/sparkr-build-final
963c7ee [Davies Liu] Merge branch 'master' into merge
8bff523 [Shivaram Venkataraman] Remove staging repo now that 1.3 is released
e52258f [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into toDF
05b9126 [Shivaram Venkataraman] Merge pull request #215 from davies/agg
8e1497d [Davies Liu] Update DataFrame.R
72adb14 [Davies Liu] Update SQLContext.R
66cc92a [Davies Liu] address commets
55c38bc [Shivaram Venkataraman] Merge pull request #216 from davies/select2
3e0555d [Shivaram Venkataraman] Merge pull request #193 from davies/daemon
0467474 [Davies Liu] add more selecter for DataFrame
9a6be74 [Davies Liu] include grouping columns in agg()
e87bb98 [Davies Liu] improve comment and logging
a6dc435 [Davies Liu] remove dependency of jsonlite
26a3621 [Davies Liu] support date.frame and Date/Time
4e4908a [Davies Liu] createDataFrame from rdd
5757b95 [Shivaram Venkataraman] Merge pull request #196 from davies/die
90f2692 [Shivaram Venkataraman] Merge pull request #211 from hlin09/generics
8583968 [Davies Liu] readFully()
46cea3d [Davies Liu] retry
01aa5ee [Davies Liu] add config for using daemon, refactor
ff948db [hlin09] Remove missingOrInteger.
ecdfda1 [hlin09] Remove duplication.
411b751 [Davies Liu] make RStudio happy
8f8813f [Davies Liu] switch back to use parallel
6bccbbf [hlin09] Move roxygen doc back to implementation.
ffd6e8e [Shivaram Venkataraman] Merge pull request #210 from hlin09/hlin09
471c794 [hlin09] Move getJRDD and broadcast's value to 00-generic.R.
89b886d [hlin09] Move setGeneric() to 00-generics.R.
97dde1a [hlin09] Add a test for access operators.
09ff163 [Shivaram Venkataraman] Merge pull request #204 from cafreeman/sparkr-sql
15a713f [cafreeman] Fix example for `dropTempTable`
dc1291b [hlin09] Add checks for namespace access operators in cleanClosure.
b4c0b2e [Davies Liu] use fork package
3db5649 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql
789be97 [Shivaram Venkataraman] Merge pull request #207 from shivaram/err-remove
e60578a [cafreeman] update tests to guarantee row order
5eec6fc [Shivaram Venkataraman] Merge pull request #206 from sun-rui/SPARKR-156_2
3f7aed6 [Sun Rui] Fix minor typos in the function description.
a8cebf0 [Shivaram Venkataraman] Remove print statement in SparkRBackendHandler This print statement is noisy for SQL methods which have multiple APIs (like loadDF). We already have a better error message when no valid methods are found
5e3a576 [Sun Rui] Fix indentation.
f3d99a6 [Sun Rui] [SPARKR-156] phase 2: implement zipWithIndex() of the RDD class.
a582810 [cafreeman] Merge branch 'dfMethods' into sparkr-sql
7a5d6fd [cafreeman] `withColumn` and `withColumnRenamed`
c5fa3b9 [cafreeman] New `select` method
bcb0bf5 [Shivaram Venkataraman] Merge pull request #180 from davies/group
9dd6a5a [Davies Liu] Update SparkRBackendHandler.scala
e6fb8d8 [Davies Liu] improve logging
428a99a [Davies Liu] remove test, catch exception
fef99de [cafreeman] `intersect`, `subtract`, `unionAll`
befbd32 [cafreeman] `insertInto`
9d01bcd [cafreeman] `dropTempTable`
d8c1c09 [Davies Liu] add test to start and stop context multiple times
18c6004 [Shivaram Venkataraman] Merge pull request #201 from sun-rui/SPARKR-156_1
dfb399a [Davies Liu] address comments
f06ccec [Sun Rui] Use mapply() instead of for statement.
3c7674f [Davies Liu] Merge branch 'die' of github.com:davies/SparkR-pkg into die
ac8a852 [Davies Liu] close monitor connection in sparkR.stop()
4d0fb56 [Shivaram Venkataraman] Merge pull request #203 from shivaram/sparkr-hive-fix
62b0760 [Shivaram Venkataraman] Fix test hive context package name
47a613f [Shivaram Venkataraman] Fix HiveContext package name
fb3b139 [Davies Liu] fix tests
d0d4626 [Shivaram Venkataraman] Merge pull request #199 from davies/load
8b7fb67 [Davies Liu] fix HiveContext
bb46832 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
e9e2a03 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group
b875b4f [Davies Liu] fix style
de2abfa [Shivaram Venkataraman] Merge pull request #202 from cafreeman/sparkr-sql
3675fcf [cafreeman] Update `explain` and fixed doc for `toJSON`
5fd9575 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
6fac596 [Davies Liu] support Column expression in agg()
f10a24e [Davies Liu] address comments
ff8b005 [cafreeman] 'saveAsParquetFile`
a5c2887 [cafreeman] fix test
3fab0f8 [cafreeman] `showDF`
779c102 [cafreeman] `isLocal`
68b11cf [cafreeman] `toJSON`
0ac4abc [cafreeman] 'explain`
20242c4 [cafreeman] clean up docs
6a1fe64 [Shivaram Venkataraman] Merge pull request #198 from cafreeman/sparkr-sql
198c130 [Shivaram Venkataraman] Merge pull request #200 from shivaram/sparkr-sql-build
870acd4 [Shivaram Venkataraman] Use rc2 explicitly
8b9a963 [cafreeman] Merge branch 'sparkr-sql' of https://github.com/amplab-extras/SparkR-pkg into sparkr-sql
bc90115 [cafreeman] Fixed docs
3865f39 [Sun Rui] [SPARKR-156] phase 1: implement zipWithUniqueId() of the RDD class.
a37fd80 [Davies Liu] Update sparkR.R
d18f9d3 [Shivaram Venkataraman] Remove SparkR snapshot build We now have 1.3.0 RC2 on Apache Staging
8de958d [Davies Liu] Update SparkRBackend.scala
4e0becc [Shivaram Venkataraman] Merge pull request #194 from davies/api
197a79b [Davies Liu] add HiveContext (commented)
32aa01d [Shivaram Venkataraman] Merge pull request #191 from felixcheung/doc
5073e07 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into load
7918634 [cafreeman] Fix test
acea146 [cafreeman] remove extra line
74269f3 [cafreeman] Merge branch 'dfMethods' into sparkr-sql
cd7ac8a [Shivaram Venkataraman] Merge pull request #197 from cafreeman/sparkr-sql
494a4dd [cafreeman] update export
e14c328 [cafreeman] `selectExpr`
32b37d1 [cafreeman] Fixed indent in `join` test.
2e7b190 [Felix Cheung] small update on yarn deploy mode.
8ff29d6 [Davies Liu] fix tests
12a6db2 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api
294ca4a [cafreeman] `join`, `sort`, and `filter`
4fa6343 [cafreeman] Refactor `join` generic for use with `DataFrame`
3f22c8d [Shivaram Venkataraman] Merge pull request #195 from cafreeman/sparkr-sql
2b6f980 [Davies Liu] shutdown the JVM after R process die
e8639c3 [cafreeman] New 1.3 repo and updates to `column.R`
ed9a89f [Davies Liu] address comments
03bcf20 [Davies Liu] Merge branch 'group' of github.com:davies/SparkR-pkg into group
39c253d [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into group
98cc97a [Davies Liu] fix test and docs
e2d144a [Felix Cheung] Fixed small typos
3beadcf [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into api
06cbc2d [Davies Liu] launch R worker by a daemon
8a676b1 [Shivaram Venkataraman] Merge pull request #188 from davies/column
524c122 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
f798402 [Davies Liu] Update column.R
1d0f2ae [Davies Liu] Update DataFrame.R
03402eb [Felix Cheung] Updates as per feedback on sparkR-submit
76cf2e0 [Shivaram Venkataraman] Merge pull request #192 from cafreeman/sparkr-sql
1955a09 [cafreeman] return object instead of a list of one object
f585929 [cafreeman] Fix brackets
e998356 [cafreeman] define generic for 'first' in RDD API
71d66a1 [Davies Liu] fix first(0
8ec21af [Davies Liu] fix signature
acae527 [Davies Liu] refactor
d7b17a4 [Davies Liu] fix approxCountDistinct
7dfe27d [Davies Liu] fix cyclic namespace dependency
8caf5bb [Davies Liu] use S4 methods
5c0bb24 [Felix Cheung] Doc updates: build and running on YARN
773baf0 [Zongheng Yang] Merge pull request #178 from davies/random
862f07c [Shivaram Venkataraman] Merge pull request #190 from shivaram/SPARKR-79
b457833 [Shivaram Venkataraman] Merge pull request #189 from shivaram/stdErrFix
f7caeb8 [Davies Liu] Update SparkRBackend.scala
8c4deae [Shivaram Venkataraman] Remove unused function
6e51c7f [Shivaram Venkataraman] Fix stderr redirection on executors
7afa4c9 [Shivaram Venkataraman] Merge pull request #186 from hlin09/funcDep3
4d36ab1 [hlin09] Add tests for broadcast variables.
3f57e56 [hlin09] Fix comments.
7b72487 [hlin09] Fix comments.
ae05bf1 [Davies Liu] Merge branch 'sparkr-sql' of github.com:amplab-extras/SparkR-pkg into column
abb4bb9 [Davies Liu] add Column and expression
eb8ac11 [Shivaram Venkataraman] Set Spark version 1.3.0 in Windows build
5c72e73 [Davies Liu] wait atmost 100 seconds
e425437 [Shivaram Venkataraman] Merge pull request #177 from lythesia/master
a00f502 [lythesia] fix indents
0346e5f [Davies Liu] address comment
6134649 [Shivaram Venkataraman] Merge pull request #187 from cafreeman/sparkr-sql
ad0935e [lythesia] minor fixes
b0e7f73 [cafreeman] Update `sampleDF` test
7b0d070 [lythesia] keep partitions check
889c265 [cafreeman] numToInt utility function
27dd3a0 [lythesia] modify tests for repartition
cad0f0c [cafreeman] Fix docs and indents
2808dcf [cafreeman] Three more DataFrame methods
5ef66fb [Davies Liu] send back the port via temporary file
3b46429 [Davies Liu] Merge branch 'master' of github.com:amplab-extras/SparkR-pkg into random
798f453 [cafreeman] Merge branch 'sparkr-sql' into dev
9aa4acf [Shivaram Venkataraman] Merge pull request #184 from davies/socket
020bce8 [Shivaram Venkataraman] Merge pull request #183 from cafreeman/sparkr-sql
222e06b [cafreeman] Lazy evaluation and formatting changes
e776324 [Davies Liu] fix import
211cc15 [cafreeman] Merge branch 'sparkr-sql' into dev
3351afd [hlin09] Replaces getDependencies with cleanClosure, to serialize UDFs to workers.
e7c56d6 [lythesia] fix random partition key
50c74b1 [Davies Liu] address comments
083c89f [cafreeman] Remove commented lines an unused import
dfa119b [hlin09] Improve the coverage of processClosure.
a41c9b9 [cafreeman] Merge branch 'wrapper' into sparkr-sql
1cd714f [cafreeman] Wrapper function docs.
db0cd9e [cafreeman] Clean up for wrapper functions
818c19f [cafreeman] Update schema-related functions
a57884e [cafreeman] Remove unused import
d72e830 [cafreeman] Add wrapper for `StructField` and `StructType`
2ea2ecf [lythesia] use generic arg
09b9512 [hlin09] add docs
f4f077c [hlin09] Add recursive cleanClosure for function access.
f84ad27 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2
5300766 [Shivaram Venkataraman] Merge pull request #185 from hlin09/hlin09
07aa7c0 [hlin09] Unifies the implementation of lapply with lapplyParitionsWithIndex.
f4dbb0b [Davies Liu] use socket in worker
8282c59 [Davies Liu] Update DataFrame.R
ba495a8 [Davies Liu] Update NAMESPACE
36dffb3 [cafreeman] Add 'head` and `first`
534a95f [cafreeman] Schema-related methods
64f488d [cafreeman] Cache and Persist Methods
30d71fd [cafreeman] Standardize method arguments for DataFrame methods
785898b [Shivaram Venkataraman] Merge pull request #182 from cafreeman/sparkr-sql
2619003 [Shivaram Venkataraman] Merge pull request #181 from cafreeman/master
a9bbe0b [cafreeman] Update existing SparkSQL functions
8c241a3 [cafreeman] Merge with master, include changes to method args
68d6de4 [cafreeman] Fix typos
8d2ec6e [Davies Liu] add sum/max/min/avg/mean
774e687 [Davies Liu] add missing API in SQLContext
1e72b4b [Davies Liu] missing API in SQLContext
3294949 [Chris Freeman] Restore `rdd` argument to `getJRDD`
3a58ebc [Davies Liu] rm unrelated file
8bd93b5 [Davies Liu] fix signature
c652b4c [cafreeman] Update method signatures to use generic arg
48c8827 [Davies Liu] update NAMESPACE
84e2d8c [Davies Liu] groupBy and agg()
7c3ddbd [Davies Liu] create jmode in JVM
9465426 [Davies Liu] load and save
982f342 [lythesia] fix numeric issue
7651d84 [lythesia] fix coalesce
4e712e1 [Davies Liu] use random port in backend
041d22b [Shivaram Venkataraman] Merge pull request #172 from cafreeman/sparkr-sql
0d07770 [cafreeman] Added `limit` and updated `take`
301d8e5 [cafreeman] Remove extraneous map functions
0387db2 [cafreeman] Remove colNames
04c4b65 [lythesia] add repartition/coalesce
231deab [cafreeman] Change reserialize to serializeToBytes
acf7e1a [cafreeman] Rework the Scala to R DataFrame Conversion
481ae37 [cafreeman] Updated stale comments and standardized arg names
21d4a97 [hlin09] Adds cleanClosure to capture the function closures.
d24ffb4 [hlin09] Merge remote-tracking branch 'upstream/master' into funcDep2
8be02de [hlin09] Revert "loop 1-12 test pass."
fddb9cc [hlin09] Revert "add docs"
f8ef0ab [hlin09] Revert "More docs"
8e4b3da [hlin09] Revert "More docs"
57e005b [hlin09] Revert "fix tests."
c10148e [Shivaram Venkataraman] Merge pull request #174 from shivaram/sparkr-runner
910e3be [Shivaram Venkataraman] Add a timeout for initialization Also move sparkRBackend.stop into a finally block
bf52b17 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner
08102b0 [Shivaram Venkataraman] Merge pull request #176 from lythesia/master
9c77b20 [Chris Freeman] Merge pull request #2 from shivaram/sparkr-sql
179ab38 [lythesia] add try counts and increase time interval
71a73b2 [Shivaram Venkataraman] Use a getter for serialization mode This change encapsulates the semantics of serialization mode for RDDs inside a getter function. For PipelinedRDDs if a backing JavaRDD is available we use that else we fall back to a default serialization mode
06bf250 [Shivaram Venkataraman] Merge pull request #173 from shivaram/windows-space-fix
88bf97f [Shivaram Venkataraman] Create SparkContext for R shell launch
f9268d9 [Shivaram Venkataraman] Fix code review comments
e6ad12d [Shivaram Venkataraman] Update comment describing sparkR-submit
17eda4c [Shivaram Venkataraman] Merge pull request #175 from falaki/docfix
ba2b72b [Hossein] Spark 1.1.0 is default
4cd7d3f [lythesia] retry backend connection
749e2d0 [Hossein] Updated README
bc04cf4 [Shivaram Venkataraman] Use SPARKR_BACKEND_PORT in sparkR.R as default Change SparkRRunner to use EXISTING_SPARKR_BACKEND_PORT to differentiate between the two
22a19ac [Shivaram Venkataraman] Use a semaphore to wait for backend to initalize Also pick a random port to avoid collisions
7f1f0f8 [cafreeman] Move comments to fit 100 char line length
8b84e4e [cafreeman] Make if statements more explicit
ce5d5ab [cafreeman] New tests for Union and Object File
b063320 [cafreeman] Changed 'serialized' to 'serializedMode'
0981dff [Zongheng Yang] Merge pull request #168 from sun-rui/SPARKR-153_2
86fc639 [Shivaram Venkataraman] Move sparkR-submit into pkg/inst
fd8f8a9 [Shivaram Venkataraman] Merge branch 'hqzizania-master'
a33dbea [Shivaram Venkataraman] Merge branch 'master' of https://github.com/hqzizania/SparkR-pkg into hqzizania-master
384e6e2 [Shivaram Venkataraman] Merge pull request #171 from hlin09/hlin09
1f5a6ac [hlin09] fixed comments
7f7596a [cafreeman] Additional handling for "row" serialization
8c3b8c5 [cafreeman] Add test for UnionRDD on "row" serialization
b1141f8 [cafreeman] Fixed formatting issues.
5db30bf [cafreeman] Changed serialized from bool to string
2f0c0b8 [cafreeman] Add check for serialized type
d243dfb [cafreeman] Clean up code
5ff63a2 [cafreeman] Change test from boolean to string
77fec1a [cafreeman] Updated .Rd files
9224989 [cafreeman] Various updates for DataFrame to RRDD
26af62b [cafreeman] DataFrame to RRDD
e004481 [cafreeman] Update UnionRDD test
5292be7 [hlin09] Adds support of pipeRDD().
e2a7560 [Shivaram Venkataraman] Merge pull request #170 from cafreeman/sparkr-sql
5d537f4 [cafreeman] Add pairRDD to Description
b6fa88e [cafreeman] Updating to current master
0cda231 [Sun Rui] [SPARKR-153] phase 2: implement aggregateByKey() and foldByKey().
95ee6b4 [Shivaram Venkataraman] Merge remote-tracking branch 'amplab-sparkr/master' into sparkr-runner
67fbc60 [Shivaram Venkataraman] Add support for SparkR shell to use spark-submit This ensures that SparkConf options are read in both in batch and interactive modes
2271030 [Shivaram Venkataraman] Merge pull request #167 from sun-rui/removePartionByInRDD
7fcb46a [Sun Rui] Remove partitionBy() in RDD.
52f94c4 [Shivaram Venkataraman] Merge pull request #160 from lythesia/master
59e2d54 [lythesia] merge with upstream
5836650 [Zongheng Yang] Merge pull request #163 from sun-rui/SPARKR-153_1
141723e [Sun Rui] fix comments.
f73a07e [Shivaram Venkataraman] Merge pull request #165 from shivaram/sparkr-sql-build
10ffc6d [Shivaram Venkataraman] Set Spark version to 1.3 using staging dependency Also fix the maven build
c91ede2 [Shivaram Venkataraman] Merge pull request #164 from hlin09/hlin09
9d335a9 [hlin09] Makes git to ignore Eclipse meta files.
94066bf [Sun Rui] [SPARKR-153] phase 1: implement fold() and aggregate().
9c391c7 [hqzizania] Merge remote-tracking branch 'upstream/master'
5f29551 [hqzizania] 	modified:   pkg/R/RDD.R 	modified:   pkg/R/context.R
d968664 [lythesia] fix comment
7972858 [Shivaram Venkataraman] Merge pull request #159 from sun-rui/SPARKR-150_2
7690878 [lythesia] separate out pair RDD functions
f4573c1 [Sun Rui] Use reduce() instead of sortBy().take() to get the ordered elements.
63e62ed [Sun Rui] [SPARKR-150] phase 2: implement takeOrdered() and top().
050390b [Shivaram Venkataraman] Fix bugs in inferring R file
8398f2e [Shivaram Venkataraman] Add sparkR-submit helper script Also adjust R file path for YARN cluster mode
bd6705b [Zongheng Yang] Merge pull request #154 from sun-rui/SPARKR-150
c7964c9 [Sun Rui] Merge with upstream master.
7feac38 [Sun Rui] Use default arguments for sortBy() and sortKeyBy().
de2bfb3 [Sun Rui] Fix minor comments and add more test cases.
0c6e071 [Zongheng Yang] Merge pull request #157 from lythesia/master
f5038c0 [lythesia] pull out anonymous functions in groupByKey
ba6f044 [lythesia] fixes for reduceByKeyLocally
343b6ab [Oscar Olmedo] Export sparkR.stop Closes #156 from oscaroboto/master
25639cf [Shivaram Venkataraman] Replace tabs with spaces
bb25920 [Shivaram Venkataraman] Merge branch 'dputler-master'
fd836db [hlin09] fix tests.
24a7f13 [hlin09] More docs
a465165 [hlin09] More docs
6ad4fc3 [hlin09] add docs
b082a35 [lythesia] add reduceByKeyLocally
7ca6512 [Shivaram Venkataraman] First cut of SparkRRunner
193f5fe [hlin09] loop 1-12 test pass.
345f1b8 [dputler] [SPARKR-195] Implemented project style guidelines for if-else statements
8043559 [Sun Rui] Add a TODO to use binary search in the range partitioner.
91b2fd6 [Sun Rui] Add more test cases.
e8ebbe4 [Shivaram Venkataraman] Merge pull request #152 from cafreeman/sparkr-sql
0c53d6c [dputler] Data frames now coerced to lists, and messages issued for a data frame or matrix on how they are parallelized
6d57ec0 [cafreeman] Remove json test file since we're using a temp
ac1ef09 [cafreeman] Update registerTempTable test
d9da451 [Sun Rui] [SPARKR-150] phase 1: implement sortBy() and sortByKey().
08ff30b [Shivaram Venkataraman] Merge pull request #153 from hqzizania/master
9767e8e [hqzizania] 	modified:   pkg/man/collect-methods.Rd
5d69f0a [hqzizania] 	modified:   pkg/R/RDD.R
4914091 [hqzizania] 	modified:   pkg/inst/tests/test_rdd.R
742a68b [cafreeman] Update test_sparkRSQL.R
a95823e [hqzizania] 	modified:   pkg/R/RDD.R
2d04526 [cafreeman] Formatting
fae9bdd [cafreeman] Renamed to SQLUtils.scala
39888ea [Chris Freeman] Update test_sparkSQL.R
fce2453 [cafreeman] Updated documentation for SQLContext
13fbf12 [cafreeman] Regenerated .Rd files
51ecf41 [cafreeman] Updated Scala object
30d7337 [cafreeman] Added SparkSQL test
74b3ed6 [cafreeman] Incorporate code feedback
554bda0 [Zongheng Yang] Merge pull request #147 from shivaram/sparkr-ec2-fixes
a5f4f8f [cafreeman] Squashed commit of the following:
f34bb88 [Shivaram Venkataraman] Remove profiling information from this PR
c662f29 [Zongheng Yang] Merge pull request #146 from shivaram/spark-1.2-build
21e9b74 [Zongheng Yang] Merge pull request #145 from lythesia/master
76f6b9e [Shivaram Venkataraman] Merge pull request #149 from hqzizania/master
1c2dbec [lythesia] minor fix for refactoring join code
5b380d3 [hqzizania] 	modified:   pkg/man/combineByKey.Rd 	modified:   pkg/man/groupByKey.Rd 	modified:   pkg/man/partitionBy.Rd 	modified:   pkg/man/reduceByKey.Rd
98794fe [hqzizania] 	modified:   pkg/R/RDD.R
b66534d [Zongheng Yang] Merge pull request #144 from shivaram/fix-rd-files
60da1df [Shivaram Venkataraman] Initialize timing variables
179aa75 [Shivaram Venkataraman] Bunch of fixes for longer running jobs 1. Increase the timeout for socket connection to wait for long jobs 2. Add some profiling information in worker.R 3. Put temp file writes before stdin writes in RRDD.scala
06d99f0 [Shivaram Venkataraman] Fix URI to have right number of slashes
add97f5 [Shivaram Venkataraman] Use URL encode to create valid URIs for jars
4eec962 [lythesia] refactor join functions
73430c6 [Shivaram Venkataraman] Make SparkR work on paths with spaces on Windows
aaf8f47 [Shivaram Venkataraman] Exclude hadoop client from Spark dependency
227ee42 [Zongheng Yang] Merge pull request #141 from shivaram/SPARKR-140
ac5ceb1 [Shivaram Venkataraman] Fix code review comments
32394de [Shivaram Venkataraman] Regenerate Rd files for SparkR This fixes a number of issues in SparkR man pages. The main changes are 1. Don't export or generate docs for PipelineRDD 2. Fix variable names for Filter, count to match base methods 3. Document missing arguments for sparkR.init, print.jobj etc.
e157bf6 [Shivaram Venkataraman] Use prev_serialized to track if JRDD is serialized This changes introduces a new variable in PipelineRDD environment to track if the prev_jrdd is serialized or not.
7428a7e [Zongheng Yang] Merge pull request #143 from shivaram/SPARKR-181
7dd1797 [Shivaram Venkataraman] Address code review comments
8f81c45 [Shivaram Venkataraman] Remove roxygen export for PipelinedRDD
0cb90f1 [Zongheng Yang] Merge pull request #142 from shivaram/SPARKR-169
d1c6e6c [Shivaram Venkataraman] Buffer stderr from R and return it on Exception This change buffers the last 100 lines from R process and passes these lines back to the driver if we have an exception. This will help users debug why their tasks failed on the cluster
d6c1393 [Shivaram Venkataraman] Suppress warnings from normalizePath
a382835 [Shivaram Venkataraman] Fix serialization tracking in pipelined RDDs When creating a pipeline RDD, we need to check if the JavaRDD belonging to the parent is serialized.
da39529 [Zongheng Yang] Merge pull request #140 from sun-rui/SPARKR-183
2814caa [Sun Rui] Merge with upstream master.
cd2a5b3 [Sun Rui] Add reference to Nagle's algorithm and clean code.
52356b6 [Shivaram Venkataraman] Merge pull request #139 from shivaram/fix-backend-exit
97e5a1f [Sun Rui] [SPARKR-183] Fix the issue that parallelize collect tests are slow.
a9f8e8e [Shivaram Venkataraman] Merge pull request #138 from concretevitamin/fix-collect-test
125ae43 [Shivaram Venkataraman] Fix SparkR backend to exit in more cases This change has two fixes 1. When the workspace is saved (from R or RStudio) the backend connection seems to be closed before the finalizer is run. In such cases we reopen the connection and stop the backend 2. With RStudio when R is restarted, there are port-conflicts which appear due to a race condition between the JVM and rsession restart. This change adds a 1 sec sleep to avoid this race.
12c102a [Zongheng Yang] Simplify a unit test.
9c0637a [Zongheng Yang] Merge pull request #137 from shivaram/fix-docs
0df0e18 [Shivaram Venkataraman] Fix documentation for includePackage
7549f88 [Zongheng Yang] Merge pull request #136 from shivaram/man-updates
7edbe46 [Shivaram Venkataraman] Add missing man pages
9cb9567 [Shivaram Venkataraman] Merge pull request #131 from shivaram/rJavaExpt
1fa722e [Shivaram Venkataraman] Rename to SerDe now
2fcb051 [Shivaram Venkataraman] Rename to SerDeJVMR
d112cf0 [Shivaram Venkataraman] Style fixes
9fd01cc [Shivaram Venkataraman] Remove unnecessary braces
0881931 [Shivaram Venkataraman] Some more style fixes
f00b531 [Shivaram Venkataraman] Address code review comments. Big changes include style fixes throughout for named arguments
c09ba05 [Shivaram Venkataraman] Change jobj id to be just an integer Add a new print.jobj that gets the class name and prints it Also add a utility function isInstanceOf
be05b16 [Shivaram Venkataraman] Check if context, connection exist before stopping
d596a23 [Shivaram Venkataraman] Address code review comments
396e7ac [Shivaram Venkataraman] Changes to make new backend work on Windows This change uses file.path to construct the Java binary path in a OS agnostic way and uses system2 to handle quoting binary paths correctly. Tests pass on Mac OSX and a Windows EC2 instance.
e7a4e03 [Shivaram Venkataraman] Remove unused file BACKEND.md
62f380b [Shivaram Venkataraman] Update worker.R to use new deserialization call
8b9c4e6 [Shivaram Venkataraman] Change RDD name, setName to use new backend
6dcd5c5 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into rJavaExpt
0873397 [Shivaram Venkataraman] Refactor java object tracking into a new singleton. Also add comments describing each class
95db964 [Shivaram Venkataraman] Add comments, cleanup new R code
bcd4258 [Zongheng Yang] Merge pull request #130 from lythesia/master
74dbc5e [Sun Rui] Match method using parameter types.
7ad4a4d [Sun Rui] Use 1 char to represent types on the backend->client direction.
bace887 [Sun Rui] Use an integer count for the backend java object ID because Uniqueness isn't guaranteed by System.identityHashCode().
b38d04f [Sun Rui] Use 1 char to represent types on the client -> backend direction.
f88bc68 [lythesia] Merge branch 'master' of github.com:lythesia/SparkR-pkg
71d41f5 [lythesia] add test case for fullOuterJoin
eb4f423 [lythesia] --amend
cffecc5 [lythesia] add test case for fullOuterJoin
a547dd2 [Shivaram Venkataraman] Move classTag, rddRef into newJObject call This avoids them getting eagerly garbage collected
1255391 [Shivaram Venkataraman] Add a finalizer for jobj objects This enables Java objects to be garbage collected on the backend when they are no longer referenced in R. Also rename newJava to newJObject to be more consistent with callJMethod
70fa409 [Sun Rui] Add YARN Conf Dir to the class path when launching the backend.
a1108ca [lythesia] add fullOuterJoin in RDD.R
2152727 [Shivaram Venkataraman] Remove empty file
cd08bee [Shivaram Venkataraman] Update all functions to use new backend All unit tests pass.
9de49b7 [Shivaram Venkataraman] Add high level calls for methods, constructors Also update BACKEND.md
5a97ea4 [Shivaram Venkataraman] Add jobj S3 class that holds backend refs
e071d3e [Shivaram Venkataraman] Change SparkRBackend to use general method calls This change uses a custom protocl + JNI to invoke any method on a given object type. Also update serializers, deserializers to make code more concise
49f0404 [Shivaram Venkataraman] Merge pull request #129 from lythesia/master
7f8cd82 [lythesia] update man
4715ed2 [Yi Lu] Update RDD.R
5a53801 [lythesia] fix name,setName
4f3870b [lythesia] add name,setName in RDD.R
1c25700 [Shivaram Venkataraman] Merge pull request #128 from sun-rui/SPARKR-165
c8507d8 [Sun Rui] [SPARKR-165] IS_SCALAR is not present in R before 3.1
2cff2bd [Sun Rui] Add function to invoke Java method.
7a31da1 [Shivaram Venkataraman] Merge branch 'dputler-master'. Closes #119
0ceba82 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/dputler/SparkR-pkg into dputler-master
735f70c [Shivaram Venkataraman] Merge pull request #125 from 7c00/rawcon
fccfe6c [Shivaram Venkataraman] Merge pull request #127 from sun-rui/SPARKR-164
387bd57 [Sun Rui] [SPARKR-164] Temporary files used by SparkR accumulat as time goes on.
5f2268f [Shivaram Venkataraman] Add support to stop backend
5f745c0 [Shivaram Venkataraman] Update notes in backend
22015c1 [Shivaram Venkataraman] Add first cut of SparkR Backend
52821da [Todd Gao] switch the order of packages and function deps
d7b0007 [Todd Gao] remove memCompress
cb6873e [Shivaram Venkataraman] Merge pull request #126 from sun-rui/SPARKR-147
c5962eb [Todd Gao] further optimize using rawConnection
f04c6e0 [Sun Rui] [SPARKR-147] Support multiple directories as input to textFile.
b7de604 [Todd Gao] optimize execFunctionDeps loading in worker.R
4d4fc30 [Shivaram Venkataraman] Merge pull request #122 from cafreeman/master
b508877 [cafreeman] Update SparkR_IDE_Setup.sh
21ed9d7 [cafreeman] Update build.sbt
f73ec16 [cafreeman] Delete SparkR_IDE_Setup_Guide.md
d63b026 [cafreeman] Delete SparkR_Quick_Start_Guide.md
6e6cb62 [cafreeman] Update SparkR_IDE_Setup.sh
bc6042b [cafreeman] Update build.sbt
a8197d5 [cafreeman] Merge remote-tracking branch 'upstream/master'
d671564 [Zongheng Yang] Merge pull request #123 from shivaram/jcheck-void
76b8d00 [Zongheng Yang] Merge pull request #124 from shivaram/master
b690d58 [Shivaram Venkataraman] Specify how to change Spark versions in README
0fb003d [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into jcheck-void
1c227b4 [Shivaram Venkataraman] Also add a check in context.R
96812b6 [Shivaram Venkataraman] Check for exceptions after void method calls
f5c216d [cafreeman] Merge remote-tracking branch 'upstream/master'
90c8933 [Zongheng Yang] Merge pull request #121 from shivaram/fix-sort-order
bd0e3b4 [Shivaram Venkataraman] Fix saveAsTextFile test case
2e55f67 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into fix-sort-order
f10c607 [Shivaram Venkataraman] Merge pull request #118 from sun-rui/saveAsTextFile
6c9bfc0 [Sun Rui] Merge remote-tracking branch 'SparkR_upstream/master' into saveAsTextFile
6faedbe [cafreeman] Update SparkR_IDE_Setup_Guide.md
57008bc [cafreeman] Update SparkR_IDE_Setup.sh
bb1c17d [cafreeman] Update SparkR_IDE_Setup.sh
538bfdb [cafreeman] Update SparkR_Quick_Start_Guide.md
31322c6 [cafreeman] Update SparkR_IDE_Setup.sh
ca3f593 [Sun Rui] Refactor RRDD code.
df58d95 [cafreeman] Update SparkR_Quick_Start_Guide.md
b488c88 [cafreeman] Rename Spark_IDE_Setup.sh to SparkR_IDE_Setup.sh
b2545a4 [cafreeman] Added IDE Setup Guide
0ffb5de [cafreeman] Merge branch 'master' of https://github.com/cafreeman/SparkR-pkg
bd8fbfb [cafreeman] Merge remote-tracking branch 'upstream/master'
98efa5b [cafreeman] Added Quick Start Guide
3cf88f2 [Shivaram Venkataraman] Sort lists before comparing in unit tests Since Spark doesn't guarantee that shuffle results will always be in the same order, we need to sort the results before comparing for deterministic behavior
d621dbc [Shivaram Venkataraman] Merge pull request #120 from sun-rui/objectFile
c4a44d7 [Sun Rui] Add @seealso in comments and extract some common code into a function.
724e3a4 [cafreeman] Update Spark_IDE_Setup.sh
8153e5a [Sun Rui] [SPARKR-146] Support read/save object files in SparkR.
17f9909 [cafreeman] Update Spark_IDE_Setup.sh
a9eb080 [cafreeman] IDE Shell Script
64d800c [dputler] Merge remote branch 'upstream/master'
1fbdb2e [dputler] Added the ability for the user to specify a text file location throught the use of tilde expansion or just the file name if it is in the working directory.
d83c017 [Shivaram Venkataraman] Merge pull request #113 from sun-rui/stringHashCodeInC
a7d9cdb [Sun Rui] Fix build on Windows.
7d81b05 [Shivaram Venkataraman] Merge pull request #114 from hlin09/hlin09
47c4bb7 [hlin09] fix reviews
a457f7f [Shivaram Venkataraman] Merge pull request #116 from dputler/master
0fa48d1 [Shivaram Venkataraman] Merge pull request #117 from sun-rui/keyBy
85cfeb4 [Sun Rui] [SPARKR-144] Implement saveAsTextFile() in the RDD class.
09083d9 [Sun Rui] Add keyBy() to the RDD class.
caad5d7 [dputler] Adding the script to install software on the Cloudera Quick Start VM.
dca3d05 [hlin09] Minor fix.
ece5f7d [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09
a40874b [hlin09] Use extendible accumulators aggregate the cogroup values.
d0347ce [Zongheng Yang] Merge pull request #112 from sun-rui/outer_join
492f76e [Sun Rui] Refine code and add description.
ba01358 [Shivaram Venkataraman] Merge pull request #115 from sun-rui/SPARKR-130
5c8e46e [Sun Rui] Fix per the review comments.
7190a2c [Sun Rui] Update comment to add a reference to storage levels.
1da705e [hlin09] Fix the review comments.
c4b77be [Sun Rui] [SPARKR-130] Add persist(storageLevel) API to RDD.
b424a1a [hlin09] Add function cogroup().
9770312 [Shivaram Venkataraman] Merge pull request #111 from hlin09/hlin09
cead7df [hlin09] fix review comments.
54f712e [Sun Rui] Implement string hash code in C.
425f0c6 [Sun Rui] Add leftOuterJoin() and rightOuterJoin() to the RDD class.
39509c7 [hlin09] add Rd file for foreach and foreachPartition.
63d6ac7 [hlin09] Adds function foreach() and foreachPartition().
9c954df [Zongheng Yang] Merge pull request #105 from sun-rui/join
c71228d [Sun Rui] Pre-allocate list with fixed length. Add test case for join() using string key.
bc3e9f6 [Shivaram Venkataraman] Merge pull request #108 from concretevitamin/take-optimize
c06fc90 [Zongheng Yang] Fix: only optimize for unserialized dataset case.
d399aeb [Zongheng Yang] Apply size-capping on logical representation instead of physical.
e4217dd [Zongheng Yang] Merge pull request #107 from shivaram/master
7952180 [Shivaram Venkataraman] Copy, use getLocalDirs from Spark Utils.scala
08e24c3 [Zongheng Yang] Merge pull request #109 from hlin09/hlin09
97d4e02 [Zongheng Yang] Min() upper-bound size with actual size.
bb779bf [hlin09] Rename the filter function to filterRDD to follow the API consistency. Filter() is also kept.
ce1661f [Zongheng Yang] Fix slow take(): deserialize only up to necessary # of elements.
4dca9b1 [Shivaram Venkataraman] Merge pull request #106 from hlin09/hlin09
1220d92 [hlin09] Adds function numPartitions().
2326a65 [Shivaram Venkataraman] Use SPARK_LOCAL_DIRS to create tmp files
e119757 [hlin09] Minor fix.
9c24c8b [hlin09] Adds function countByKey().
48fce67 [hlin09] Adds countByValue().
6679eef [Sun Rui] Update documentation for join().
70586b4 [Sun Rui] Add join() to the RDD class.
e6fb999 [Zongheng Yang] Merge pull request #103 from shivaram/rlibdir-fix
a21f146 [Shivaram Venkataraman] Merge pull request #102 from hlin09/hlin09
32eb619 [Shivaram Venkataraman] Merge pull request #104 from sun-rui/add_keys_values
d8692e9 [Sun Rui] Add keys() and values() for the RDD class.
18b9be1 [Shivaram Venkataraman] Allow users to set where SparkR is installed This also adds a warning if somebody tries to call sparkR.init multiple times.
a17f135 [hlin09] Adds tests for flatMap and flatMapValues.
4bcf59b [hlin09] Adds function flatMapValues.
4a193ef [Zongheng Yang] Merge pull request #101 from ashutoshraina/master
60d22f2 [Ashutosh Raina] changed sbt version
5400793 [Zongheng Yang] Merge pull request #98 from shivaram/windows-fixes-build
36d61a7 [Shivaram Venkataraman] Merge pull request #97 from hlin09/hlin09
f7d7d89 [hlin09] Remove redundant code in test.
6bbe823 [hlin09] minor style fix.
9b47f3a [Shivaram Venkataraman] Merge pull request #100 from hnahak87/patch-1
7f6e4ea [Harihar Nahak] Update logistic_regression.R
a605047 [Shivaram Venkataraman] Merge pull request #99 from hlin09/makefile
323151d [hlin09] Fix yar flag in Makefile to remove build error in Maven.
8911897 [hlin09] Make reserialize() private function in package.
79aee73 [Shivaram Venkataraman] Add notes on how to build SparkR on windows
49a99e7 [Shivaram Venkataraman] Clean up some commented code
ddc271b [Shivaram Venkataraman] Only append file:/// to non empty jar paths
a53952e [Shivaram Venkataraman] Add windows build scripts
325b179 [hlin09] Merge remote-tracking branch 'upstream/master' into hlin09
daf5040 [hlin09] Add reserialize() before union if two RDDs are not both serialized.
536afb1 [hlin09] Add new function of union().
7044677 [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes
d22a02d [Zongheng Yang] Merge pull request #94 from shivaram/windows-fixes-stdin
51924f7 [Shivaram Venkataraman] Merge pull request #90 from oscaroboto/master
eb97d85 [Shivaram Venkataraman] Merge pull request #96 from sun-rui/add_clarification_readme
5a128f4 [Sun Rui] Add clarification on setting Spark master when launching the SparkR shell.
187526a [oscaroboto] Update sparkR.R
32c567b [Shivaram Venkataraman] Merge pull request #95 from concretevitamin/master
4cd2d5e [Zongheng Yang] Notes about spark-ec2.
1c28e3b [Shivaram Venkataraman] Merge branch 'master' of https://github.com/amplab-extras/SparkR-pkg into windows-fixes
8e8a029 [Zongheng Yang] Merge pull request #92 from shivaram/sparkr-yarn
721043b [Zongheng Yang] Update README.md with YARN instructions.
1681f58 [Shivaram Venkataraman] Use temporary files for input instead of stdin This fixes a bug for Windows where stdin would get truncated
b084314 [oscaroboto] removed ... from example
44c93d4 [oscaroboto] Added example to SparkR.R
be82dcc [Shivaram Venkataraman] Merge pull request #93 from hlin09/hlin09
868554d [oscaroboto] Update sparkR.R
488ac47 [hlin09] Add generated Rd file of previous added functions, distinct() and mapValues().
b2740ad [hlin09] Add test for filter all elements. Add filter() as alias.
08d3631 [hlin09] Minor style fixes.
2c0e34f [hlin09] Adds function Filter(), which extracts the elements that satisfy a predicate.
5951d3b [Shivaram Venkataraman] Remove SBT plugin
4e70ced [oscaroboto] changed ExecutorEnv to sparkExecutorEnvMap, to make it consistent with sparkEnvirMap
903d18a [oscaroboto] changed executorEnv to sparkExecutorEnvMap,  will do the same in R
f97346e [oscaroboto] executorEnv to lower-case e
88a524e [oscaroboto] Added LD_LIBRARY_PATH to the ExecutorEnv. This is need so that the nodes can find libjvm.so, or if the master has a different LD_LIBRARY_PATH then the nodes. Make sure to export LD_LIBRARY_PATH  that includes the path to libjvm.so in the nodes.
1d208ae [oscaroboto] added the YARN_CONF_DIR to the classpath
8a9b75c [oscaroboto] forgot to change hm and ee inside the for loops
579db58 [Shivaram Venkataraman] Merge pull request #91 from sun-rui/add_max_min
4381efa [Sun Rui] use reduce() to implemement max() and min().
a5459c5 [Shivaram Venkataraman] Consolidate yarn flags
86b04eb [Shivaram Venkataraman] Don't use quotes around yarn
bf0797f [Shivaram Venkataraman] Add dependency on spark yarn module
af5fe77 [Shivaram Venkataraman] Fix SBT build, add dependency tree plugin
4917607 [Sun Rui] Add maximum() and minimum() API to RDD.
51bbbe4 [Shivaram Venkataraman] Changes to make SparkR work with YARN
9d5e3ab [oscaroboto] a few stylistic changes. Also change vars to sparkEnvirMap and eevars to ExecutorEnv, to match sparkR.R
578f545 [oscaroboto] a few stylistic changes
39eea2f [oscaroboto] Modification to dynamically create a sparkContext with YARN. Added .setExecutorEnv to the sparkConf in createSparkContext within the RRDD object. This modification was made together with sparkR.R
17ec42e [oscaroboto] A modification to dynamically create a sparkContext with YARN. sparkR.R modified to pass custom Jar file names and EnvironmentEnv to the sparkConf. RRDD.scala was also modified to accept the new inputs to creatSparkContext.
624ac9d [Shivaram Venkataraman] Merge pull request #87 from sun-rui/SPARKR-125
4f213db [Shivaram Venkataraman] Merge pull request #89 from sun-rui/SPARKR-108
eb833c5 [Shivaram Venkataraman] Merge pull request #88 from hlin09/hlin09
07bf971 [Sun Rui] [SPARKR-108] Implement map-side reduction for reduceByKey().
4accba1 [hlin09] Fixes style and adds an optional param 'numPartition' in distinct().
80d303a [hlin09] typo fixed.
e37a9b5 [hlin09] Adds function distinct() and mapValues().
08dac06 [Sun Rui] [SPARKR-125] Get the iterator of the parent RDD before launching a R worker process in compute() of RRDD/PairwiseRRDD
c4ba53c [Shivaram Venkataraman] Merge pull request #85 from edwardt/master
72a9d27 [root] reorder to keep relative ordering the same
f3fcb10 [root] fix up build.sbt also to match pom.xml
5ecbe3e [root] Make spark verison configurable in build script per ISSUE122
a44e63d [Shivaram Venkataraman] Merge pull request #84 from sun-rui/SPARKR-94
fbb5663 [Sun Rui] Add {} to one-line functions and add a test case for lookup where no match is found.
95beb4e [Shivaram Venkataraman] Merge pull request #82 from edwardt/master
36776c5 [edwardt] missed one 0.9.0 revert
b26deec [Sun Rui] [SPARKR-94] Add a  method to get an element of a pair RDD object by key.
1ba256e [edwardt] Keep 0.9.0 and says uses 1.1.0 by default
5380c43 [root] missed one version
21f74da [root] upgrade to spark version 1.1.0 to match lastest merge list
ddfcde9 [root] merge
67d067a [Shivaram Venkataraman] Merge pull request #81 from sun-rui/SparkR-117
993868f [Sun Rui] [SPARKR-117] Update Spark dependency to 1.1.0
d20661a [Zongheng Yang] Merge pull request #80 from sun-rui/master
0b2da9f [Sun Rui] Update Rd file and add a test case for mapPartitions.
5879648 [Sun Rui] Add mapPartitions() method to RDD for API consistency.
c033461 [Shivaram Venkataraman] Merge pull request #79 from sun-rui/fix-kmeans
f62b77e [Sun Rui] Adjust coding style.
b40911d [Sun Rui] Fix syntax error in examples/kmeans.R.
5304451 [Shivaram Venkataraman] Merge pull request #78 from sun-rui/master
70ffbfb [Sun Rui] Fix a bug that modifications to build.sbt won't trigger rebuilding.
a25696c [Shivaram Venkataraman] Merge pull request #76 from edwardt/addjira
b8bbd93 [edwardt] Update README.md
615d930 [edwardt] Update README.md
e522e69 [edwardt] Update README.md
03e6ced [edwardt] Update README.md
3007015 [root] don't check in gedit buffer file'
c35c9a6 [root] Add where to enter bugs ad feeback
469eae3 [edwardt] Update README.md
61b4a43 [edwardt] Update Makefile (style uniformity)
ce3337d [edwardt] Update README.md
7ff68fc [root] Merge branch 'master' of https://github.com/edwardt/SparkR-pkg
16353f5 [root] add links to devtools and install_github
513b9e5 [Shivaram Venkataraman] Merge pull request #72 from edwardt/master
31608a4 [edwardt] Update Makefile (style uniformity)
4ffe146 [root] Makefile: factor out SPARKR_VERSION to reduce potential copy&paste error; cp & rm called with -f in build/clean phase; .gitignore includes checkpoints and unit test log generated by run-tests.sh
715275f [Zongheng Yang] Merge pull request #68 from shivaram/master
90e2083 [Shivaram Venkataraman] Add return type to hasNext
8eb983d [Shivaram Venkataraman] Fix up comment
2206164 [Shivaram Venkataraman] Delete temporary files after they are read This change deletes temporary files used for communication between Rscript and the JVM once they have been completely read.
5881da7 [Zongheng Yang] Merge pull request #67 from shivaram/improve-shuffle
81251e2 [Shivaram Venkataraman] Address code review comments
a5f573f [Shivaram Venkataraman] Use a better list append in shuffles This is helpful in scenarios where we have a large number of values in a bucket
388e64d [Shivaram Venkataraman] Merge pull request #55 from RevolutionAnalytics/master
e1f95b6 [Zongheng Yang] Merge pull request #65 from concretevitamin/parallelize-fix
fc1a71a [Zongheng Yang] Fix that collect(parallelize(sc,1:72,15)) drops elements.
b8204c5 [Zongheng Yang] Minor: update a URL in README.
86f30c3 [Antonio Piccolboni] better fix for amplab-extras/SparkR-pkg#53
b3c318d [Antonio Piccolboni] delayed loading to have all namespaces available.
f323e97 [Antonio Piccolboni] tentative fix for amplab-extras/SparkR-pkg#53
6f82269 [Zongheng Yang] Merge pull request #48 from shivaram/master
8f433e5 [Shivaram Venkataraman] Move up Hadoop in pom.xml and add back protobufs As Hadoop 1.0.4 doesn't use protobufs, we can't exclude protobufs from Spark always. This change tries to order the dependencies so that the shader first picks up Hadoop's protobufs over Mesos.
bfe7e26 [Shivaram Venkataraman] Merge pull request #36 from RevolutionAnalytics/vectorize-examples
059ae41 [Antonio Piccolboni] and more formatting
9dbd531 [Antonio Piccolboni] more formatting per committer request
948738a [Antonio Piccolboni] converted tabs to spaces per project request
49f5f5a [Shivaram Venkataraman] Merge pull request #35 from shivaram/master
3eb5ad3 [Shivaram Venkataraman] on_failure -> after_failure in travis.yml
139bdee [Shivaram Venkataraman] Cache sbt, maven, ivy dependencies
4ebced2 [Shivaram Venkataraman] Merge pull request #34 from shivaram/master
8437061 [Shivaram Venkataraman] Exclude protobuf from Spark dependency in Maven This avoids pulling in multiple versions of protobuf from Mesos and Hadoop.
91aa527 [Antonio Piccolboni] vectorized version, 36s 10 slices 10^6 per slice. The older version takes 30 sec on 1/10th of data.
f137a57 [Antonio Piccolboni] for rstudio users
1f7ffb0 [Antonio Piccolboni] implemented using matrices and vectorized calls wherever possible
46b23df [Antonio Piccolboni] replace require with library
b15d7db [Antonio Piccolboni] faster parsing
8b7aeb3 [Antonio Piccolboni] 22x speed improvement, 3X mem impovement
c5bce07 [Zongheng Yang] Merge pull request #30 from shivaram/string-tests
21fa2d8 [Shivaram Venkataraman] Fix bug where serialized was not changed for RRRD Reason: When an RRDD is created in getJRDD we have converted any possibly unserialized RDD to a serialized RDD.
9d1ea20 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into string-tests
7b9348c [Shivaram Venkataraman] Add tests for partition with string keys Add two tests one with a string array and one from a textFile to test both codepaths
aacd726 [Shivaram Venkataraman] Update README with maven proxy instructions
803e62c [Shivaram Venkataraman] Merge pull request #28 from concretevitamin/master
7c093e6 [Zongheng Yang] Use inherits() to test an object's class.
061c591 [Shivaram Venkataraman] Merge pull request #26 from hafen/master
90f9fda [Ryan Hafen] Fix isRdd() to properly check for class
5b10cc7 [Zongheng Yang] Merge pull request #24 from shivaram/master
7014f83 [Shivaram Venkataraman] Remove unused transformers in maven's pom.xml
b00cea5 [Shivaram Venkataraman] Add support for a Maven build
11ec9b2 [Shivaram Venkataraman] Merge pull request #12 from concretevitamin/pipelined
6b18a90 [Zongheng Yang] Merge branch 'master' into pipelined
57127b8 [Zongheng Yang] Merge pull request #23 from shivaram/master
1ac3940 [Zongheng Yang] Review feedback.
a06fb34 [Zongheng Yang] Remove outdated comment.
0a1fc13 [Shivaram Venkataraman] Fixes for using SparkR with Hadoop2. 1. Exclude ASM, Netty from Hadoop similar to Spark. 2. Concat services files to ensure HDFS filesystems work. 3. Update README with an example
9a1db44 [Zongheng Yang] Merge pull request #22 from shivaram/master
e462448 [Shivaram Venkataraman] Use `$` for calling `put` instead of .jrcall
ed4559a [Shivaram Venkataraman] Add support for passing Spark environment vars This change creates a new `createSparkContext` method in RRDD as we can't pass Map<String, String> through rJava. Also use SPARK_MEM in local mode to increase heap size and update the README with some examples.
10228fb [Shivaram Venkataraman] Merge pull request #20 from concretevitamin/digit-ex
1398d9f [Zongheng Yang] Add linear_solver_mnist to examples/.
d484c2a [Zongheng Yang] Add tests for actions on PipelinedRDD.
d9cb95c [Zongheng Yang] Add setCheckpointDir() to context.R; comment fix.
f8bc8a9 [Zongheng Yang] Minor edits per Shivaram's comments.
8cd67f7 [Shivaram Venkataraman] Merge pull request #15 from shivaram/master
d4468a9 [Shivaram Venkataraman] Remove trailing comma
e2714b8 [Shivaram Venkataraman] Remove Apache Staging repo and update README
334eace [Zongheng Yang] Add a multi-transformation test to benchmark on pipelining.
5650ad7 [Zongheng Yang] Put serialized field inside env for both RDD and PipelinedRDD.
0b9e8bb [Zongheng Yang] First cut at PipelinedRDD.
a4c431e [Zongheng Yang] Add `isCheckpointed` field and checkpoint().
dac0795 [Zongheng Yang] Minor inline comment style fix.
bfb8e26 [Zongheng Yang] Add isCached field (inside an env) and unpersist().
295bff6 [Zongheng Yang] Merge pull request #11 from shivaram/master
4cb209c [Shivaram Venkataraman] Search rLibDir in worker before libPaths This ensures we pick up the SparkR intended and not an older version installed on the same machine
ef198ff [Zongheng Yang] Merge pull request #10 from shivaram/unit-tests
e0557a8 [Shivaram Venkataraman] Update travis to install plyr
8b18bc1 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg into unit-tests
4a9ca31 [Shivaram Venkataraman] Use smaller broadcast and plyr instead of Matrix Matrix package takes around 2s to load and slows down unit tests.
21c6a61 [Zongheng Yang] Merge pull request #8 from shivaram/master
08c2947 [Shivaram Venkataraman] Move dev install directory to front of libPaths
bda42ee [Shivaram Venkataraman] Merge pull request #7 from JoshRosen/travis
cc5f5c0 [Josh Rosen] Add Travis CI integration (using craigcitro/r-travis)
b6c864b [Shivaram Venkataraman] Merge pull request #6 from concretevitamin/env-style-fix
4fcef22 [Zongheng Yang] Use one style ($) for accessing names in environments.
8a948c6 [Shivaram Venkataraman] Merge pull request #4 from shivaram/master
24978eb [Shivaram Venkataraman] Update README to use install_github
8899db4 [Shivaram Venkataraman] Update TODO.md
91792de [Shivaram Venkataraman] Update Spark requirements
f34f4bf [Shivaram Venkataraman] Check tests for failures and output error msg
cd750d3 [Shivaram Venkataraman] Update run-tests to use new path
1877b7c [Shivaram Venkataraman] Unset R_TESTS to make tests work with R CMD check Also silence Akka remoting logs and update Makefile to build on log4j changes
e60e18a [Shivaram Venkataraman] Update README to remove Spark installation notes
4450189 [Shivaram Venkataraman] Add Spark 0.9 dependency from Apache Staging Also clean up assembly jar from inst on make clean
5eb2131 [Shivaram Venkataraman] Update repo path in README
ec8210e [Shivaram Venkataraman] Remove broadcastId hack as it is public in Spark
9f0e080 [Shivaram Venkataraman] Merge branch 'install-github'
5c88fbd [Shivaram Venkataraman] Add helper script to run tests
77450a1 [Shivaram Venkataraman] Remove dependency on Spark Logging
6cb00d1 [Shivaram Venkataraman] Update README and add helper script install-dev.sh
28346ca [Shivaram Venkataraman] Only normalize if SPARK_HOME is not empty
0fd6571 [Shivaram Venkataraman] Normalize SPARK_HOME before passing it
ff96d5c [Shivaram Venkataraman] Pass in SPARK_HOME and jar file path
34c4dce [Shivaram Venkataraman] Move src into pkg and update Makefile This enables the package to be installed using install_github using devtools and automates the build procedure.
b25afed [Shivaram Venkataraman] Change package name to edu.berkeley.cs.amplab
c691464 [Shivaram Venkataraman] Add Apache 2.0 License file
27a4a4b [Shivaram Venkataraman] Add notes on how to compile roxygen2 docs
ca63844 [Shivaram Venkataraman] Add broadcast documentation Also generate documentation for sample, takeSample etc.
e4dd976 [Shivaram Venkataraman] Update TODO.md
e42d435 [Shivaram Venkataraman] Add support for broadcast variables
6b638e7 [Shivaram Venkataraman] Add the assembly jar to SparkContext
bf24e32 [Shivaram Venkataraman] Merge branch 'master' of github.com:amplab/SparkR-pkg
43c05ce [Zongheng Yang] Fix a flaky/incorrect test for sampleRDD().
c6a9dfc [Zongheng Yang] Initial port of the kmeans example.
6885581 [Zongheng Yang] Implement element-level sampleRDD() and takeSample() with tests.
d3a4987 [Zongheng Yang] Add a test for lapplyPartitionsWithIndex on pairwise RDD.
c7899c1 [Zongheng Yang] Add lapplyPartitionsWithIndex, with a test and an alias function.
a9a7436 [Shivaram Venkataraman] Add DFC example from Tselil, Benjamin and Jonah
fbc5a95 [Zongheng Yang] Implement take() and takeSample().
c4a3409 [Shivaram Venkataraman] Use RDD instead of RRDD
dfad3f5 [Zongheng Yang] Add test_utils.R: a unit test for convertJListToRList().
a45227d [Zongheng Yang] Update .gitignore.
238fe6e [Zongheng Yang] Add a unit test for textFile().
a88898b [Zongheng Yang] Rename test_rrd to test_rrdd
10c8baa [Shivaram Venkataraman] Make SparkR work as a standalone package. Changes include: 1. Adding a new `sbt` project that builds RRDD.scala 2. Change the onLoad functions to load the assembly jar for SparkR 3. Set rLibDir in RRDD.scala and worker.R to load things correctly
78adcd8 [Shivaram Venkataraman] Add a gitignore
ca6108f [Shivaram Venkataraman] Merge branch 'SparkR-scalacode' of ../SparkR
999bd61 [Shivaram Venkataraman] Update collectPartition in R and use ClassTag
c58f63e [Shivaram Venkataraman] Update collectPartition in R and use ClassTag
48265fd [Shivaram Venkataraman] Use new version of collectPartitions in take
d4fe086 [Shivaram Venkataraman] Move collectPartitions to JavaRDDLike Also remove numPartitions in JavaRDD and update R code
bfecd7b [Shivaram Venkataraman] Scala 2.10 changes 1. Update sparkR script 2. Use classTag instead of classManifest
092a4b3 [Shivaram Venkataraman] Add combineByKey, update TODO
ac0d81d [Shivaram Venkataraman] Add more documentation
d1dc3fa [Shivaram Venkataraman] Add more documentation
c515e3a [Shivaram Venkataraman] Update TODO
db56a34 [Shivaram Venkataraman] Add a test case for include package
41cea51 [Shivaram Venkataraman] Ensure all parent environments are serialized. Also add a test case with an inline function
a978e84 [Shivaram Venkataraman] Add support to include packages in the worker
12bf8ce [Shivaram Venkataraman] Add support to include packages in the worker
fb7e72c [Shivaram Venkataraman] Cleanup TODO
16ac314 [Shivaram Venkataraman] Add documentation for functions in context, sparkR
85b1d25 [Shivaram Venkataraman] Set license to Apache
88f1101 [Shivaram Venkataraman] Add unit test running instructions
c40768e [Shivaram Venkataraman] Update TODO
0c7efbf [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions
5880d42 [Shivaram Venkataraman] Refactor RRDD.scala and add comments to functions
2dee36c [Shivaram Venkataraman] Remove empty test file
a82219b [Shivaram Venkataraman] Update TODOs
5db00dc [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other changes include 1. Adding unit tests for basic RDD functions and shuffle 2. Add a word count example 3. Change the dependency serialization to handle double loading of SparkR    package 4. Allow partitionBy to operate on any RDDs to create pair-wise RDD.
f196479 [Shivaram Venkataraman] Add reduceByKey, groupByKey and refactor shuffle Other changes include 1. Adding unit tests for basic RDD functions and shuffle 2. Add a word count example 3. Change the dependency serialization to handle double loading of SparkR    package 4. Allow partitionBy to operate on any RDDs to create pair-wise RDD.
987e36f [Shivaram Venkataraman] Add perf todo
0b03265 [Shivaram Venkataraman] Update TODO with testing, docs todo
685aaad [Zongheng Yang] First cut at refactoring worker.R. Remove pairwiseWorker.R.
95b9ddc [Zongheng Yang] First cut at refactoring worker.R. Remove pairwiseWorker.R.
4f00895 [Zongheng Yang] Remove the unnecessary `pairwise' flag in RRDD class. Reasons:
75d36d9 [Zongheng Yang] Working versions: partitionBy() and collectPartition() for RRDD.
e3fbd9d [Zongheng Yang] Working versions: partitionBy() and collectPartition() for RRDD.
67a4335 [Zongheng Yang] Add unit test for parallelize() and collect() pairwise data.
100ae65 [Zongheng Yang] Properly parallelize() and collect() pairwise data.
cd0a5e2 [Zongheng Yang] Properly parallelize() and collect() pairwise data.
aea16c3 [Zongheng Yang] WIP: second cut at partitionBy. Running into R/Scala communication issues.
45eb943 [Zongheng Yang] WIP: second cut at partitionBy. Running into R/Scala communication issues.
11c893b [Zongheng Yang] WIP: need to figure out the logic of (whether or not) shipping a hash func
82c201a [Zongheng Yang] WIP: need to figure out the logic of (whether or not) shipping a hash func
b3bfad2 [Zongheng Yang] Update TODO: take() done.
0e45293 [Zongheng Yang] Add ability to parallelize key-val collections in R.
f60406a [Zongheng Yang] Add ability to parallelize key-val collections in R.
7d7fe3b [Zongheng Yang] Re-implement take(): take a partition at a time and append.
a054e55 [Zongheng Yang] Fix take() tests(): mode difference.
9de0935 [Zongheng Yang] Implement take() for RRDD.
1e4427e [Zongheng Yang] Implement take() for RRDD.
ec3cd67 [Shivaram Venkataraman] Use temp file in Spark to pipe output
417aaed [Shivaram Venkataraman] Use temp file in Spark to pipe output
bb0a3c3 [Shivaram Venkataraman] Add conf directory to classpath
9594d8a [Shivaram Venkataraman] Clean up LR example
3b26b58 [Shivaram Venkataraman] Add a list of things to do.
cabce68 [Shivaram Venkataraman] Fix warnings from package check
fde3f9c [Shivaram Venkataraman] Flatten by default and disable recursive unlist
ab2e061 [Shivaram Venkataraman] Create LIB_DIR before installing SparkR package
555220a [Shivaram Venkataraman] Add readme and update Makefile
1319cda [Shivaram Venkataraman] Make standalone programs run with sparkR
ae19fa8 [Shivaram Venkataraman] Add support for cache and use `tempfile`
4e89ca4 [Shivaram Venkataraman] Add support for apply, reduce, count Also serialize closures using `save` and add two examples
25a0bea [Shivaram Venkataraman] Add support for apply, reduce, count Also serialize closures using `save` and add two examples
f50223f [Zongheng Yang] Make parallelize() and collect() use lists. Add a few more tests for them.
fc7693f [Zongheng Yang] Refactor and enhance the previously added unit test a little bit.
6de9b81 [Zongheng Yang] Add a simple unit test for parallelize().
8b95155 [Zongheng Yang] Add testthat skeleton infrastructure
ef305bf [Zongheng Yang] parallelize() followed by collect() now work for vectors/lists of strings and numerics (should work for other primitives as well).
dc16af4 [Zongheng Yang] Comment: toArray() allocates memory for a copy
f50121e [Zongheng Yang] Make parallelize() return JavaRDD[Array[Byte]]. Add RRDD.scala with a helper function in the singleton object.
46eb063 [Zongheng Yang] Make parallelize() return JavaRDD[Array[Byte]]. Add RRDD.scala with a helper function in the singleton object.
6b4938a [Zongheng Yang] parallelize(): a raw can be parallelized by JavaSparkContext and get back JavaRDD
978aa0f [Zongheng Yang] Add parallelize() skeleton: only return serialized slices now
84c1fd2 [Zongheng Yang] Use .jsimplify() to get around generic List's get() type erasure problem
f16b891 [Zongheng Yang] Convert a few reflectionc alls to .jcall
1284c13 [Zongheng Yang] WIP on collect(): JavaListToRList() failed with errors.
4c2e516 [Zongheng Yang] Add simple prototype of S4 class RRDD. Make TextFile() returns an RRDD.
82aa17a [Zongheng Yang] Add textFile()
83ce63f [Zongheng Yang] Create a JavaSparkContext and save it in .sparkEnv using sparkR.init()
01cdf0e [Zongheng Yang] Add Makefile for SparkR
fc9cae2 [Shivaram Venkataraman] Add skeleton R package
2015-04-08 22:45:40 -07:00
Kay Ousterhout 9d44ddce1d [SPARK-6753] Clone SparkConf in ShuffleSuite tests
Prior to this change, the unit test for SPARK-3426 did not clone the
original SparkConf, which meant that that test did not use the options
set by suites that subclass ShuffleSuite.scala. This commit fixes that
problem.

JoshRosen would be great if you could take a look at this, since you wrote this
test originally.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #5401 from kayousterhout/SPARK-6753 and squashes the following commits:

368c540 [Kay Ousterhout] [SPARK-6753] Clone SparkConf in ShuffleSuite tests
2015-04-08 10:26:45 -07:00
Kay Ousterhout 8d2a36c0fd [SPARK-6754] Remove unnecessary TaskContextHelper
The TaskContextHelper was originally necessary because TaskContext was written in Java, which does
not have a way to specify that classes are package-private, so TaskContextHelper existed to work
around this. Now that TaskContext has been re-written in Scala, this class is no longer necessary.

rxin can you look at this? It looks like you missed this bit of cleanup when you moved TaskContext from Java to Scala in #4324

cc ScrapCodes and pwendell who added this originally.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #5402 from kayousterhout/SPARK-6754 and squashes the following commits:

f089800 [Kay Ousterhout] [SPARK-6754] Remove unnecessary TaskContextHelper
2015-04-07 22:40:42 -07:00
Josh Rosen c83e03948b [SPARK-6737] Fix memory leak in OutputCommitCoordinator
This patch fixes a memory leak in the DAGScheduler, which caused us to leak a map entry per submitted stage.  The problem is that the OutputCommitCoordinator needs to be informed when stages end in order to remove entries from its `authorizedCommitters` map, but the DAGScheduler only called it in one of the four code paths that are used to mark stages as completed.

This patch fixes this issue by consolidating the processing of stage completion into a new `markStageAsFinished` method and updates DAGSchedulerSuite's `assertDataStructuresEmpty` assertion to also check the OutputCommitCoordinator data structures.  I've also added a comment at the top of DAGScheduler so that we remember to update this test when adding new data structures.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #5397 from JoshRosen/SPARK-6737 and squashes the following commits:

af3b02f [Josh Rosen] Consolidate stage completion handling code in a single method.
e96ce3a [Josh Rosen] Consolidate stage completion handling code in a single method.
3052aea [Josh Rosen] Comment update
7896899 [Josh Rosen] Fix SPARK-6737 by informing OutputCommitCoordinator of all stage end events.
4ead1dc [Josh Rosen] Add regression tests for SPARK-6737
2015-04-07 16:18:55 -07:00
Xiangrui Meng e6f08fb42f Revert "[SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path"
This reverts commit 596ba77c5f.
2015-04-07 14:34:15 -07:00
Masayoshi TSUZUKI 596ba77c5f [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
escape spaces in the arguments.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #5347 from tsudukim/feature/SPARK-6568 and squashes the following commits:

9180aaf [Masayoshi TSUZUKI] [SPARK-6568] spark-shell.cmd --jars option does not accept the jar that has space in its path
2015-04-07 14:29:53 -07:00
sksamuel 2c32bef179 Replace use of .size with .length for Arrays
Invoking .size on arrays is valid, but requires an implicit conversion to SeqLike. This incurs a compile time overhead and more importantly a runtime overhead, as the Array must be wrapped before the method can be invoked. For example, the difference in generated byte code is:

  public int withSize();
    Code:
       0: getstatic     #23                 // Field scala/Predef$.MODULE$:Lscala/Predef$;
       3: aload_0
       4: invokevirtual #25                 // Method array:()[I
       7: invokevirtual #29                 // Method scala/Predef$.intArrayOps:([I)Lscala/collection/mutable/ArrayOps;
      10: invokeinterface #34,  1           // InterfaceMethod scala/collection/mutable/ArrayOps.size:()I
      15: ireturn

  public int withLength();
    Code:
       0: aload_0
       1: invokevirtual #25                 // Method array:()[I
       4: arraylength
       5: ireturn

Author: sksamuel <sam@sksamuel.com>

Closes #5376 from sksamuel/master and squashes the following commits:

77ec261 [sksamuel] Replace use of .size with .length for Arrays.
2015-04-07 10:43:22 -07:00