Commit graph

3642 commits

Author SHA1 Message Date
Reynold Xin 96ba04bbf9 [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
The pull request includes two changes:

1. Removes SortOrder introduced by SPARK-2125. The key ordering already includes the SortOrder information since an Ordering can be reverse. This is similar to Java's Comparator interface. Rarely does an API accept both a Comparator as well as a SortOrder.

2. Replaces the sortWith call in HashShuffleReader with an in-place quick sort.

Author: Reynold Xin <rxin@apache.org>

Closes #1631 from rxin/sortOrder and squashes the following commits:

c9d37e1 [Reynold Xin] [SPARK-2726] and [SPARK-2727] Remove SortOrder and do in-place sort.
2014-07-29 01:12:44 -07:00
Aaron Davidson 39ab87b924 Use commons-lang3 in SignalLogger rather than commons-lang
Spark only transitively depends on the latter, based on the Hadoop version.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1621 from aarondav/lang3 and squashes the following commits:

93c93bf [Aaron Davidson] Use commons-lang3 in SignalLogger rather than commons-lang
2014-07-28 13:37:44 -07:00
Cheng Lian a7a9d14479 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)
JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.

In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:

629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-28 12:07:30 -07:00
Patrick Wendell e5bbce9a60 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit f6ff2a61d0.
2014-07-27 18:46:58 -07:00
Andrew Or ecf30ee7e7 [SPARK-1777] Prevent OOMs from single partitions
**Problem.** When caching, we currently unroll the entire RDD partition before making sure we have enough free memory. This is a common cause for OOMs especially when (1) the BlockManager has little free space left in memory, and (2) the partition is large.

**Solution.** We maintain a global memory pool of `M` bytes shared across all threads, similar to the way we currently manage memory for shuffle aggregation. Then, while we unroll each partition, periodically check if there is enough space to continue. If not, drop enough RDD blocks to ensure we have at least `M` bytes to work with, then try again. If we still don't have enough space to unroll the partition, give up and drop the block to disk directly if applicable.

**New configurations.**
- `spark.storage.bufferFraction` - the value of `M` as a fraction of the storage memory. (default: 0.2)
- `spark.storage.safetyFraction` - a margin of safety in case size estimation is slightly off. This is the equivalent of the existing `spark.shuffle.safetyFraction`. (default 0.9)

For more detail, see the [design document](https://issues.apache.org/jira/secure/attachment/12651793/spark-1777-design-doc.pdf). Tests pending for performance and memory usage patterns.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1165 from andrewor14/them-rdd-memories and squashes the following commits:

e77f451 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
c7c8832 [Andrew Or] Simplify logic + update a few comments
269d07b [Andrew Or] Very minor changes to tests
6645a8a [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
b7e165c [Andrew Or] Add new tests for unrolling blocks
f12916d [Andrew Or] Slightly clean up tests
71672a7 [Andrew Or] Update unrollSafely tests
369ad07 [Andrew Or] Correct ensureFreeSpace and requestMemory behavior
f4d035c [Andrew Or] Allow one thread to unroll multiple blocks
a66fbd2 [Andrew Or] Rename a few things + update comments
68730b3 [Andrew Or] Fix weird scalatest behavior
e40c60d [Andrew Or] Fix MIMA excludes
ff77aa1 [Andrew Or] Fix tests
1a43c06 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
b9a6eee [Andrew Or] Simplify locking behavior on unrollMemoryMap
ed6cda4 [Andrew Or] Formatting fix (super minor)
f9ff82e [Andrew Or] putValues -> putIterator + putArray
beb368f [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
8448c9b [Andrew Or] Fix tests
a49ba4d [Andrew Or] Do not expose unroll memory check period
69bc0a5 [Andrew Or] Always synchronize on putLock before unrollMemoryMap
3f5a083 [Andrew Or] Simplify signature of ensureFreeSpace
dce55c8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
8288228 [Andrew Or] Synchronize put and unroll properly
4f18a3d [Andrew Or] bufferFraction -> unrollFraction
28edfa3 [Andrew Or] Update a few comments / log messages
728323b [Andrew Or] Do not synchronize every 1000 elements
5ab2329 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
129c441 [Andrew Or] Fix bug: Use toArray rather than array
9a65245 [Andrew Or] Update a few comments + minor control flow changes
57f8d85 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
abeae4f [Andrew Or] Add comment clarifying the MEMORY_AND_DISK case
3dd96aa [Andrew Or] AppendOnlyBuffer -> Vector (+ a few small changes)
f920531 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
0871835 [Andrew Or] Add an effective storage level interface to BlockManager
64e7d4c [Andrew Or] Add/modify a few comments (minor)
8af2f35 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
4f4834e [Andrew Or] Use original storage level for blocks dropped to disk
ecc8c2d [Andrew Or] Fix binary incompatibility
24185ea [Andrew Or] Avoid dropping a block back to disk if reading from disk
2b7ee66 [Andrew Or] Fix bug in SizeTracking*
9b9a273 [Andrew Or] Fix tests
20eb3e5 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
649bdb3 [Andrew Or] Document spark.storage.bufferFraction
a10b0e7 [Andrew Or] Add initial memory request threshold + rename a few things
e9c3cb0 [Andrew Or] cacheMemoryMap -> unrollMemoryMap
198e374 [Andrew Or] Unfold -> unroll
0d50155 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
d9d02a8 [Andrew Or] Remove unused param in unfoldSafely
ec728d8 [Andrew Or] Add tests for safe unfolding of blocks
22b2209 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
078eb83 [Andrew Or] Add check for hasNext in PrimitiveVector.iterator
0871535 [Andrew Or] Fix tests in BlockManagerSuite
d68f31e [Andrew Or] Safely unfold blocks for all memory puts
5961f50 [Andrew Or] Fix tests
195abd7 [Andrew Or] Refactor: move unfold logic to MemoryStore
1e82d00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
3ce413e [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
d5dd3b4 [Andrew Or] Free buffer memory in finally
ea02eec [Andrew Or] Fix tests
b8e1d9c [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
a8704c1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
e1b8b25 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
87aa75c [Andrew Or] Fix mima excludes again (typo)
11eb921 [Andrew Or] Clarify comment (minor)
50cae44 [Andrew Or] Remove now duplicate mima exclude
7de5ef9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
df47265 [Andrew Or] Fix binary incompatibility
6d05a81 [Andrew Or] Merge branch 'master' of github.com:apache/spark into them-rdd-memories
f94f5af [Andrew Or] Update a few comments (minor)
776aec9 [Andrew Or] Prevent OOM if a single RDD partition is too large
bbd3eea [Andrew Or] Fix CacheManagerSuite to use Array
97ea499 [Andrew Or] Change BlockManager interface to use Arrays
c12f093 [Andrew Or] Add SizeTrackingAppendOnlyBuffer and tests
2014-07-27 16:08:16 -07:00
Cheng Lian f6ff2a61d0 [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
(This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)

JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)

Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1600 from liancheng/jdbc and squashes the following commits:

ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-27 13:03:38 -07:00
Cheng Lian 2bbf235376 [SPARK-2705][CORE] Fixed stage description in stage info page
Stage description should be a `String`, but was changed to an `Option[String]` by mistake:

![stage-desc-small](https://cloud.githubusercontent.com/assets/230655/3655611/f6d0b0f6-117b-11e4-83ed-71000dcd5009.png)

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1524 from liancheng/fix-stage-desc and squashes the following commits:

3c69327 [Cheng Lian] Fixed stage description object type in Web UI stage table
2014-07-27 12:35:21 -07:00
Matei Zaharia 985705301e SPARK-2684: Update ExternalAppendOnlyMap to take an iterator as input
This will decrease object allocation from the "update" closure used in map.changeValue.

Author: Matei Zaharia <matei@databricks.com>

Closes #1607 from mateiz/spark-2684 and squashes the following commits:

b7d89e6 [Matei Zaharia] Add insertAll for Iterables too, and fix some code style
561fc97 [Matei Zaharia] Update ExternalAppendOnlyMap to take an iterator as input
2014-07-27 11:20:20 -07:00
Matei Zaharia b547f69bdb SPARK-2680: Lower spark.shuffle.memoryFraction to 0.2 by default
Author: Matei Zaharia <matei@databricks.com>

Closes #1593 from mateiz/spark-2680 and squashes the following commits:

3c949c4 [Matei Zaharia] Lower spark.shuffle.memoryFraction to 0.2 by default
2014-07-26 22:44:17 -07:00
Josh Rosen ba46bbed5d [SPARK-2601] [PySpark] Fix Py4J error when transforming pickleFiles
Similar to SPARK-1034, the problem was that Py4J didn’t cope well with the fake ClassTags used in the Java API.  It doesn’t look like there’s any reason why PythonRDD needs to take a ClassTag, since it just ignores the type of the previous RDD, so I removed the type parameter and we no longer pass ClassTags from Python.

Author: Josh Rosen <joshrosen@apache.org>

Closes #1605 from JoshRosen/spark-2601 and squashes the following commits:

b68e118 [Josh Rosen] Fix Py4J error when transforming pickleFiles [SPARK-2601]
2014-07-26 17:37:05 -07:00
Reynold Xin 12901643b7 [SPARK-2704] Name threads in ConnectionManager and mark them as daemon.
handleMessageExecutor, handleReadWriteExecutor, and handleConnectExecutor are not marked as daemon and not named. I think there exists some condition in which Spark programs won't terminate because of this.

Stack dump attached in https://issues.apache.org/jira/browse/SPARK-2704

Author: Reynold Xin <rxin@apache.org>

Closes #1604 from rxin/daemon and squashes the following commits:

98d6a6c [Reynold Xin] [SPARK-2704] Name threads in ConnectionManager and mark them as daemon.
2014-07-26 15:00:32 -07:00
bpaulin c183b92c3c [SPARK-2279] Added emptyRDD method to Java API
Added emptyRDD method to Java API with tests.

Author: bpaulin <bob@bobpaulin.com>

Closes #1597 from bobpaulin/SPARK-2279 and squashes the following commits:

5ad57c2 [bpaulin] [SPARK-2279] Added emptyRDD method to Java API
2014-07-26 10:27:09 -07:00
Hossein 66f26a4610 [SPARK-2696] Reduce default value of spark.serializer.objectStreamReset
The current default value of spark.serializer.objectStreamReset is 10,000.
When trying to re-partition (e.g., to 64 partitions) a large file (e.g., 500MB), containing 1MB records, the serializer will cache 10000 x 1MB x 64 ~= 640 GB which will cause out of memory errors.

This patch sets the default value to a more reasonable default value (100).

Author: Hossein <hossein@databricks.com>

Closes #1595 from falaki/objectStreamReset and squashes the following commits:

650a935 [Hossein] Updated documentation
1aa0df8 [Hossein] Reduce default value of spark.serializer.objectStreamReset
2014-07-26 01:04:56 -07:00
Josh Rosen cf3e9fd84d [SPARK-1458] [PySpark] Expose sc.version in Java and PySpark
Author: Josh Rosen <joshrosen@apache.org>

Closes #1596 from JoshRosen/spark-1458 and squashes the following commits:

fdbb0bf [Josh Rosen] Add SparkContext.version to Python & Java [SPARK-1458]
2014-07-26 00:54:05 -07:00
Reynold Xin 9d8666cac8 Part of [SPARK-2456] Removed some HashMaps from DAGScheduler by storing information in Stage.
This is part of the scheduler cleanup/refactoring effort to make the scheduler code easier to maintain.

@kayousterhout @markhamstra please take a look ...

Author: Reynold Xin <rxin@apache.org>

Closes #1561 from rxin/dagSchedulerHashMaps and squashes the following commits:

1c44e15 [Reynold Xin] Clear pending tasks in submitMissingTasks.
620a0d1 [Reynold Xin] Use filterKeys.
5b54404 [Reynold Xin] Code review feedback.
c1e9a1c [Reynold Xin] Removed some HashMaps from DAGScheduler by storing information in Stage.
2014-07-25 18:45:02 -07:00
Michael Armbrust afd757a241 Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
This reverts commit 06dc0d2c6b.

#1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #1594 from marmbrus/revertJDBC and squashes the following commits:

59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
2014-07-25 15:36:57 -07:00
Kay Ousterhout 37ad3b7245 [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI.
Due to problems with when we update runningStages (in DAGScheduler.scala)
and how we decide to send a SparkListenerStageCompleted message to
SparkListeners, sometimes stages can be shown as "running" in the UI forever
(even after they have failed).  This issue can manifest when stages are
resubmitted with 0 tasks, or when the DAGScheduler catches non-serializable
tasks. The problem also resulted in a (small) memory leak in the DAGScheduler,
where stages can stay in runningStages forever. This commit fixes
that problem and adds a unit test.

Thanks tsudukim for helping to look into this issue!

cc markhamstra rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1566 from kayousterhout/dag_fix and squashes the following commits:

217d74b [Kay Ousterhout] [SPARK-1726] [SPARK-2567] Eliminate zombie stages in UI.
2014-07-25 15:14:13 -07:00
jerryshao 47b6b38ca8 [SPARK-2125] Add sort flag and move sort into shuffle implementations
This patch adds a sort flag into ShuffleDependecy and moves sort into hash shuffle implementation.

Moving sort into shuffle implementation can give space for other shuffle implementations (like sort-based shuffle) to better optimize sort through shuffle.

Author: jerryshao <saisai.shao@intel.com>

Closes #1210 from jerryshao/SPARK-2125 and squashes the following commits:

2feaf7b [jerryshao] revert MimaExcludes
ceddf75 [jerryshao] add MimaExeclude
f674ff4 [jerryshao] Add missing Scope restriction
b9fe0dd [jerryshao] Fix some style issues according to comments
ef6b729 [jerryshao] Change sort flag into Option
3f6eeed [jerryshao] Fix issues related to unit test
2f552a5 [jerryshao] Minor changes about naming and order
c92a281 [jerryshao] Move sort into shuffle implementations
2014-07-25 14:34:38 -07:00
Cheng Lian 06dc0d2c6b [SPARK-2410][SQL] Merging Hive Thrift/JDBC server
JIRA issue:

- Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
- Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)

Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).

(Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)

TODO

- [x] Use `spark-submit` to launch the server, the CLI and beeline
- [x] Migration guideline draft for Shark users

----

Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:

```bash
$ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
```

This actually shows usage information of `SparkSubmit` rather than `BeeLine`.

~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~

**UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1399 from liancheng/thriftserver and squashes the following commits:

090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
1083e9d [Cheng Lian] Fixed failed test suites
7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
9cc0f06 [Cheng Lian] Starts beeline with spark-submit
cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
061880f [Cheng Lian] Addressed all comments by @pwendell
7755062 [Cheng Lian] Adapts test suites to spark-submit settings
40bafef [Cheng Lian] Fixed more license header issues
e214aab [Cheng Lian] Added missing license headers
b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
2014-07-25 12:20:49 -07:00
Yin Huai 32bcf9af94 [SPARK-2683] unidoc failed because org.apache.spark.util.CallSite uses Java keywords as value names
Renaming `short` to `shortForm` and `long` to `longForm`.

JIRA: https://issues.apache.org/jira/browse/SPARK-2683

Author: Yin Huai <huai@cse.ohio-state.edu>

Closes #1585 from yhuai/SPARK-2683 and squashes the following commits:

5ddb843 [Yin Huai] "short" and "long" are Java keyworks. In order to generate javadoc, renaming "short" to "shortForm" and "long" to "longForm".
2014-07-25 11:14:51 -07:00
Reynold Xin eb82abd8e3 [SPARK-2529] Clean closures in foreach and foreachPartition.
Author: Reynold Xin <rxin@apache.org>

Closes #1583 from rxin/closureClean and squashes the following commits:

8982fe6 [Reynold Xin] [SPARK-2529] Clean closures in foreach and foreachPartition.
2014-07-25 01:10:05 -07:00
Matei Zaharia 8529ced35c SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup
JIRA: https://issues.apache.org/jira/browse/SPARK-2657

Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.

There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.

Author: Matei Zaharia <matei@databricks.com>

Closes #1555 from mateiz/compact-groupby and squashes the following commits:

845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
07621a7 [Matei Zaharia] Review comments
0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
f61f040 [Matei Zaharia] Fix line lengths
59da88b0 [Matei Zaharia] Fix line lengths
197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
2014-07-25 00:32:32 -07:00
Doris Xin 2f75a4a30e [SPARK-2656] Python version of stratified sampling
exact sample size not supported for now.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1554 from dorx/pystratified and squashes the following commits:

4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50)
bdc3f8b [Doris Xin] updated unit to check sample holistically
7713c7b [Doris Xin] Python version of stratified sampling
2014-07-24 23:42:08 -07:00
Davies Liu 14174abd42 [SPARK-2538] [PySpark] Hash based disk spilling aggregation
During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.

It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).

Author: Davies Liu <davies.liu@gmail.com>

Closes #1460 from davies/spill and squashes the following commits:

cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
37d71f7 [Davies Liu] balance the partitions
902f036 [Davies Liu] add shuffle.py into run-tests
dcf03a9 [Davies Liu] fix memory_info() of psutil
67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
400be01 [Davies Liu] address all the comments
6178844 [Davies Liu] refactor and improve docs
fdd0a49 [Davies Liu] add long doc string for ExternalMerger
1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
e6cc7f9 [Davies Liu] Merge branch 'master' into spill
3652583 [Davies Liu] address comments
e78a0a0 [Davies Liu] fix style
24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
57ee7ef [Davies Liu] update docs
286aaff [Davies Liu] let spilled aggregation in Python configurable
e9a40f6 [Davies Liu] recursive merger
6edbd1f [Davies Liu] Hash based disk spilling aggregation
2014-07-24 22:53:47 -07:00
Neville Li fec641b84d SPARK-2250: show stage RDDs in UI
Author: Neville Li <neville@spotify.com>

Closes #1188 from nevillelyh/neville/ui and squashes the following commits:

d3ac425 [Neville Li] SPARK-2250: show persisted RDD in stage UI
f075db9 [Neville Li] SPARK-2035: show call stack even when description is available
2014-07-24 14:13:00 -07:00
Rahul Singhal 46e224aaa2 SPARK-2150: Provide direct link to finished application UI in yarn resou...
...rce manager UI

Use the event logger directory to provide a direct link to finished
application UI in yarn resourcemanager UI.

Author: Rahul Singhal <rahul.singhal@guavus.com>

Closes #1094 from rahulsinghaliitd/SPARK-2150 and squashes the following commits:

95f230c [Rahul Singhal] SPARK-2150: Provide direct link to finished application UI in yarn resource manager UI
2014-07-24 09:31:04 -05:00
Sandy Ryza e34922a221 SPARK-2310. Support arbitrary Spark properties on the command line with ...
...spark-submit

The PR allows invocations like
  spark-submit --class org.MyClass --spark.shuffle.spill false myjar.jar

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1253 from sryza/sandy-spark-2310 and squashes the following commits:

1dc9855 [Sandy Ryza] More doc and cleanup
00edfb9 [Sandy Ryza] Review comments
91b244a [Sandy Ryza] Change format to --conf PROP=VALUE
8fabe77 [Sandy Ryza] SPARK-2310. Support arbitrary Spark properties on the command line with spark-submit
2014-07-23 23:11:26 -07:00
GuoQiang Li 9e7725c86e SPARK-2662: Fix NPE for JsonProtocol
Author: GuoQiang Li <witgo@qq.com>

Closes #1511 from witgo/JsonProtocol and squashes the following commits:

2b6227f [GuoQiang Li] Fix NPE for JsonProtocol
2014-07-23 22:50:39 -07:00
Ian O Connell efdaeb1119 [SPARK-2102][SQL][CORE] Add option for kryo registration required and use a resource pool in Spark SQL for Kryo instances.
Author: Ian O Connell <ioconnell@twitter.com>

Closes #1377 from ianoc/feature/SPARK-2102 and squashes the following commits:

5498566 [Ian O Connell] Docs update suggested by Patrick
20e8555 [Ian O Connell] Slight style change
f92c294 [Ian O Connell] Add docs for new KryoSerializer option
f3735c8 [Ian O Connell] Add using a kryo resource pool for the SqlSerializer
4e5c342 [Ian O Connell] Register the SparkConf for kryo, it gets swept into serialization
665805a [Ian O Connell] Add a spark.kryo.registrationRequired option for configuring the Kryo Serializer
2014-07-23 16:30:11 -07:00
Rui Li 91903e0a50 SPARK-2277: clear host->rack info properly
Hi mridulm, I just think of this issue of [#1212](https://github.com/apache/spark/pull/1212): I added FakeRackUtil to hold the host -> rack mapping. It should be cleaned up after use so that it won't mess up with test cases others may add later.
Really sorry about this.

Author: Rui Li <rui.li@intel.com>

Closes #1454 from lirui-intel/SPARK-2277-fix-UT and squashes the following commits:

f8ea25c [Rui Li] SPARK-2277: clear host->rack info properly
2014-07-23 16:23:24 -07:00
woshilaiceshide f776bc9887 [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
Make spark's "local[N]" better.
In our company, we use "local[N]" in production. It works exellentlly. It's our best choice.

Author: woshilaiceshide <woshilaiceshide@qq.com>

Closes #1544 from woshilaiceshide/localX and squashes the following commits:

6c85154 [woshilaiceshide] [CORE] SPARK-2640: In "local[N]", free cores of the only executor should be touched by "spark.task.cpus" for every finish/start-up of tasks.
2014-07-23 11:05:41 -07:00
Andrew Or 25921110fc [SPARK-2609] Log thread ID when spilling ExternalAppendOnlyMap
It's useful to know whether one thread is constantly spilling or multiple threads are spilling relatively infrequently. Right now everything looks a little jumbled and we can't tell which lines belong to the same thread. For instance:

```
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (194 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (198 times so far)
06:14:37 ExternalAppendOnlyMap: Spilling in-memory map of 10 MB to disk (197 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 9 MB to disk (45 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 23 MB to disk (198 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 38 MB to disk (25 times so far)
06:14:38 ExternalAppendOnlyMap: Spilling in-memory map of 161 MB to disk (25 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 0 MB to disk (199 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (166 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (199 times so far)
06:14:39 ExternalAppendOnlyMap: Spilling in-memory map of 4 MB to disk (200 times so far)
```

Author: Andrew Or <andrewor14@gmail.com>

Closes #1517 from andrewor14/external-log and squashes the following commits:

90e48bb [Andrew Or] Log thread ID when spilling
2014-07-23 10:31:45 -07:00
Xiangrui Meng 4c7243e109 [SPARK-2617] Correct doc and usages of preservesPartitioning
The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.

This PR

1. adds notes in `maPartitions*`,
2. makes `RDD.sample` preserve partitioner,
3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
4. fixes some wrong usages in MLlib.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:

b361e65 [Xiangrui Meng] update doc based on pwendell's comments
3b1ba19 [Xiangrui Meng] update doc
357575c [Xiangrui Meng] fix unit test
20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
2014-07-23 00:58:55 -07:00
GuoQiang Li ddadf1b004 [YARN][SPARK-2606]:In some cases,the spark UI pages display incorrect
The issue is caused by #1112 .

Author: GuoQiang Li <witgo@qq.com>

Closes #1501 from witgo/webui_style and squashes the following commits:

4b34998 [GuoQiang Li] In some cases, pages display incorrect in WebUI
2014-07-22 20:34:40 -05:00
Aaron Davidson 85d3596e65 SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
### Why and what?
Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.

This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.

Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl

### Memory implications
An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.

Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).

The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
This results in a worst-case sorting overhead of 4N bytes.

Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.

### Performance implications
As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.

Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".

<table>
<tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
<tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
<tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
<tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
<tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
</table>

The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.

In short, this patch should significantly improve performance for users running either Java 6 or 7.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1502 from aarondav/sort and squashes the following commits:

652d936 [Aaron Davidson] Update license, move Sorter to java src
a7b5b1c [Aaron Davidson] fix licenses
5c0efaf [Aaron Davidson] Update tmpLength
ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
034bf10 [Aaron Davidson] Change to Apache v2 Timsort
b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
2014-07-22 11:58:53 -07:00
Gregory Owen c3462c6568 [SPARK-2086] Improve output of toDebugString to make shuffle boundaries more clear
Changes RDD.toDebugString() to show hierarchy and shuffle transformations more clearly

New output:

```
(3) FlatMappedValuesRDD[325] at apply at Transformer.scala:22
 |  MappedValuesRDD[324] at apply at Transformer.scala:22
 |  CoGroupedRDD[323] at apply at Transformer.scala:22
 +-(5) MappedRDD[320] at apply at Transformer.scala:22
 |  |  MappedRDD[319] at apply at Transformer.scala:22
 |  |  MappedValuesRDD[318] at apply at Transformer.scala:22
 |  |  MapPartitionsRDD[317] at apply at Transformer.scala:22
 |  |  ShuffledRDD[316] at apply at Transformer.scala:22
 |  +-(10) MappedRDD[315] at apply at Transformer.scala:22
 |     |   ParallelCollectionRDD[314] at apply at Transformer.scala:22
 +-(100) MappedRDD[322] at apply at Transformer.scala:22
     |   ParallelCollectionRDD[321] at apply at Transformer.scala:22
```

Author: Gregory Owen <greowen@gmail.com>

Closes #1364 from GregOwen/to-debug-string and squashes the following commits:

08f5c78 [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
1603f7b [Gregory Owen] toDebugString: prettier debug printing to show shuffles and joins more clearly
2014-07-21 18:55:01 -07:00
Kay Ousterhout f6e7302cb4 Improve scheduler delay tooltip.
As a result of shivaram's experience debugging long scheduler delay, I think we should improve the tooltip to point people in the right direction if scheduler delay is large.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1488 from kayousterhout/better_tooltips and squashes the following commits:

22176fd [Kay Ousterhout] Improve scheduler delay tooltip.
2014-07-20 20:18:18 -07:00
Sandy Ryza 9564f85489 SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1474 from sryza/sandy-spark-2564 and squashes the following commits:

35b8388 [Sandy Ryza] Fix compile error on upmerge
7b985fb [Sandy Ryza] Fix test compile error
43f79e6 [Sandy Ryza] SPARK-2564. ShuffleReadMetrics.totalBlocksRead is redundant
2014-07-20 14:45:34 -07:00
Reynold Xin fa51b0fb5b [SPARK-2598] RangePartitioner's binary search does not use the given Ordering
We should fix this in branch-1.0 as well.

Author: Reynold Xin <rxin@apache.org>

Closes #1500 from rxin/rangePartitioner and squashes the following commits:

c0a94f5 [Reynold Xin] [SPARK-2598] RangePartitioner's binary search does not use the given Ordering.
2014-07-20 11:06:06 -07:00
Sandy Ryza 98ab411225 SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical section...
...s of CoGroupedRDD and PairRDDFunctions

This also removes an unnecessary tuple creation in cogroup.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1447 from sryza/sandy-spark-2519-2 and squashes the following commits:

b6d9699 [Sandy Ryza] Remove missed Tuple2 match in CoGroupedRDD
a109828 [Sandy Ryza] Remove another pattern matching in MappedValuesRDD and revert some changes in PairRDDFunctions
be10f8a [Sandy Ryza] SPARK-2519 part 2. Remove pattern matching on Tuple2 in critical sections of CoGroupedRDD and PairRDDFunctions
2014-07-20 01:24:32 -07:00
Reynold Xin 1efb3698b6 Revert "[SPARK-2521] Broadcast RDD object (instead of sending it along with every task)."
This reverts commit 7b8cd17525.
2014-07-19 16:56:22 -07:00
Lijie Xu 805f329bb1 put 'curRequestSize = 0' after 'logDebug' it
This is a minor change. We should first logDebug($curRequestSize) and then set it to 0.

Author: Lijie Xu <csxulijie@gmail.com>

Closes #1477 from JerryLead/patch-1 and squashes the following commits:

aed722d [Lijie Xu] put 'curRequestSize = 0' after 'logDebug' it
2014-07-19 01:27:26 -07:00
Reynold Xin 7b8cd17525 [SPARK-2521] Broadcast RDD object (instead of sending it along with every task).
Currently (as of Spark 1.0.1), Spark sends RDD object (which contains closures) using Akka along with the task itself to the executors. This is inefficient because all tasks in the same stage use the same RDD object, but we have to send RDD object multiple times to the executors. This is especially bad when a closure references some variable that is very large. The current design led to users having to explicitly broadcast large variables.

The patch uses broadcast to send RDD objects and the closures to executors, and use Akka to only send a reference to the broadcast RDD/closure along with the partition specific information for the task. For those of you who know more about the internals, Spark already relies on broadcast to send the Hadoop JobConf every time it uses the Hadoop input, because the JobConf is large.

The user-facing impact of the change include:

1. Users won't need to decide what to broadcast anymore, unless they would want to use a large object multiple times in different operations
2. Task size will get smaller, resulting in faster scheduling and higher task dispatch throughput.

In addition, the change will simplify some internals of Spark, eliminating the need to maintain task caches and the complex logic to broadcast JobConf (which also led to a deadlock recently).

A simple way to test this:
```scala
val a = new Array[Byte](1000*1000); scala.util.Random.nextBytes(a);
sc.parallelize(1 to 1000, 1000).map { x => a; x }.groupBy { x => a; x }.count
```

Numbers on 3 r3.8xlarge instances on EC2
```
master branch: 5.648436068 s, 4.715361895 s, 5.360161877 s
with this change: 3.416348793 s, 1.477846558 s, 1.553432156 s
```

Author: Reynold Xin <rxin@apache.org>

Closes #1452 from rxin/broadcast-task and squashes the following commits:

762e0be [Reynold Xin] Warn large broadcasts.
ade6eac [Reynold Xin] Log broadcast size.
c3b6f11 [Reynold Xin] Added a unit test for clean up.
754085f [Reynold Xin] Explain why broadcasting serialized copy of the task.
04b17f0 [Reynold Xin] [SPARK-2521] Broadcast RDD object once per TaskSet (instead of sending it for every task).
2014-07-18 23:52:47 -07:00
Kay Ousterhout 7b971b91ca [SPARK-2571] Correctly report shuffle read metrics.
Currently, shuffle read metrics are incorrectly reported when stages have multiple shuffle dependencies (they are set to be the metrics from just one of the shuffle dependencies, rather than the accumulated metrics from all of the shuffle dependencies).  This fixes that problem, and should probably be back-ported to the 0.9 branch.

Thanks ryanra for discovering this problem!

cc rxin andrewor14

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1476 from kayousterhout/join_bug and squashes the following commits:

0203a16 [Kay Ousterhout] Fix broken unit tests.
f463c2e [Kay Ousterhout] [SPARK-2571] Correctly report shuffle read metrics.
2014-07-18 14:40:32 -07:00
Reynold Xin 586e716e47 Reservoir sampling implementation.
This is going to be used in https://issues.apache.org/jira/browse/SPARK-2568

Author: Reynold Xin <rxin@apache.org>

Closes #1478 from rxin/reservoirSample and squashes the following commits:

17bcbf3 [Reynold Xin] Added seed.
badf20d [Reynold Xin] Renamed the method.
6940010 [Reynold Xin] Reservoir sampling implementation.
2014-07-18 12:41:50 -07:00
Sandy Ryza 30b8d369d4 SPARK-2553. Fix compile error
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1479 from sryza/sandy-spark-2553 and squashes the following commits:

2cb5ed8 [Sandy Ryza] SPARK-2553. Fix compile error
2014-07-18 00:47:43 -07:00
Sandy Ryza e52b8719cf SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency...
... per key

My humble opinion is that avoiding allocations in this performance-critical section is worth the extra code.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1461 from sryza/sandy-spark-2553 and squashes the following commits:

7eaf7f2 [Sandy Ryza] SPARK-2553. CoGroupedRDD unnecessarily allocates a Tuple2 per dependency per key
2014-07-17 23:57:08 -07:00
Andrew Or 6afca2d107 [SPARK-2411] Add a history-not-found page to standalone Master
**Problem.** Right now, if you click on an application after it has finished, it simply refreshes the page if there are no event logs for the application. This is not super intuitive especially because event logging is not enabled by default. We should direct the user to enable this if they attempt to view a SparkUI after the fact without event logs.

**Fix.** The new page conveys different messages in each of the following scenarios:
(1) Application did not enable event logging,
(2) Event logs are not found in the specified directory, and
(3) Exception is thrown while replaying the logs

Here are screenshots of what the page looks like in each of the above scenarios:

(1)
<img src="https://issues.apache.org/jira/secure/attachment/12656204/Event%20logging%20not%20enabled.png" width="75%">

(2)
<img src="https://issues.apache.org/jira/secure/attachment/12656203/Application%20history%20not%20found.png">

(3)
<img src="https://issues.apache.org/jira/secure/attachment/12656202/Application%20history%20load%20error.png" width="95%">

Author: Andrew Or <andrewor14@gmail.com>

Closes #1336 from andrewor14/master-link and squashes the following commits:

2f06206 [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link
97cddc0 [Andrew Or] Add different severity levels
832b687 [Andrew Or] Mention spark.eventLog.dir in error message
51980c3 [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link
ded208c [Andrew Or] Merge branch 'master' of github.com:apache/spark into master-link
89d6405 [Andrew Or] Reword message
e7df7ed [Andrew Or] Add a history not found page to standalone Master
2014-07-17 19:45:59 -07:00
Reynold Xin 72e9021eaf [SPARK-2299] Consolidate various stageIdTo* hash maps in JobProgressListener
This should reduce memory usage for the web ui as well as slightly increase its speed in draining the UI event queue.

@andrewor14

Author: Reynold Xin <rxin@apache.org>

Closes #1262 from rxin/ui-consolidate-hashtables and squashes the following commits:

1ac3f97 [Reynold Xin] Oops. Properly handle description.
f5736ad [Reynold Xin] Code review comments.
b8828dc [Reynold Xin] Merge branch 'master' into ui-consolidate-hashtables
7a7b6c4 [Reynold Xin] Revert css change.
f959bb8 [Reynold Xin] [SPARK-2299] Consolidate various stageIdTo* hash maps in JobProgressListener to speed it up.
63256f5 [Reynold Xin] [SPARK-2320] Reduce <pre> block font size.
2014-07-17 18:58:48 -07:00
Reynold Xin d988d345d5 [SPARK-2534] Avoid pulling in the entire RDD in various operators
This should go into both master and branch-1.0.

Author: Reynold Xin <rxin@apache.org>

Closes #1450 from rxin/agg-closure and squashes the following commits:

e40f363 [Reynold Xin] Mima check excludes.
9186364 [Reynold Xin] Define the return type more explicitly.
38e348b [Reynold Xin] Fixed the cases in RDD.scala.
ea6b34d [Reynold Xin] Blah
89b9c43 [Reynold Xin] Fix other instances of accidentally pulling in extra stuff in closures.
73b2783 [Reynold Xin] [SPARK-2534] Avoid pulling in the entire RDD in groupByKey.
2014-07-17 10:54:53 -07:00
Andrew Or 9c73822a08 [SPARK-2423] Clean up SparkSubmit for readability
It is currently non-trivial to trace through how different combinations of cluster managers (e.g. yarn) and deploy modes (e.g. cluster) are processed in SparkSubmit. Moving forward, it will be easier to extend SparkSubmit if we first re-organize the code by grouping related logic together.

This is a precursor to fixing standalone-cluster mode, which is currently broken (SPARK-2260).

Author: Andrew Or <andrewor14@gmail.com>

Closes #1349 from andrewor14/submit-cleanup and squashes the following commits:

8f99200 [Andrew Or] script -> program (minor)
30f2e65 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-cleanup
fe484a1 [Andrew Or] Move deploy mode checks after yarn code
7167824 [Andrew Or] Re-order config options and update comments
0b01ff8 [Andrew Or] Clean up SparkSubmit for readability
2014-07-17 01:13:32 -07:00
Aaron Davidson 7c23c0dc3e [SPARK-2412] CoalescedRDD throws exception with certain pref locs
If the first pass of CoalescedRDD does not find the target number of locations AND the second pass finds new locations, an exception is thrown, as "groupHash.get(nxt_replica).get" is not valid.

The fix is just to add an ArrayBuffer to groupHash for that replica if it didn't already exist.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1337 from aarondav/2412 and squashes the following commits:

f587b5d [Aaron Davidson] getOrElseUpdate
3ad8a3c [Aaron Davidson] [SPARK-2412] CoalescedRDD throws exception with certain pref locs
2014-07-17 01:01:14 -07:00
Aaron Davidson 9c249743ea [SPARK-2154] Schedule next Driver when one completes (standalone mode)
Author: Aaron Davidson <aaron@databricks.com>

Closes #1405 from aarondav/2154 and squashes the following commits:

24e9ef9 [Aaron Davidson] [SPARK-2154] Schedule next Driver when one completes (standalone mode)
2014-07-16 14:16:48 -07:00
Aaron Davidson 8867cd0bc2 SPARK-1097: Do not introduce deadlock while fixing concurrency bug
We recently added this lock on 'conf' in order to prevent concurrent creation. However, it turns out that this can introduce a deadlock because Hadoop also synchronizes on the Configuration objects when creating new Configurations (and they do so via a static REGISTRY which contains all created Configurations).

This fix forces all Spark initialization of Configuration objects to occur serially by using a static lock that we control, and thus also prevents introducing the deadlock.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1409 from aarondav/1054 and squashes the following commits:

7d1b769 [Aaron Davidson] SPARK-1097: Do not introduce deadlock while fixing concurrency bug
2014-07-16 14:10:17 -07:00
Reynold Xin 7c8d123225 [SPARK-2317] Improve task logging.
We use TID to indicate task logging. However, TID itself does not capture stage or retries, making it harder to correlate with the application itself. This pull request changes all logging messages for tasks to include both the TID and the stage id, stage attempt, task id, and task attempt.  I've consulted various people but unfortunately this is a really hard task.

Driver log looks like:

```
14/06/28 18:53:29 INFO DAGScheduler: Submitting 10 missing tasks from Stage 0 (MappedRDD[1] at map at <console>:13)
14/06/28 18:53:29 INFO TaskSchedulerImpl: Adding task set 0.0 with 10 tasks
14/06/28 18:53:29 INFO TaskSetManager: Re-computing pending task lists.
14/07/15 19:44:40 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 0, localhost, PROCESS_LOCAL, 1855 bytes)
14/07/15 19:44:40 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1855 bytes)
14/07/15 19:44:40 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 2, localhost, PROCESS_LOCAL, 1855 bytes)
14/07/15 19:44:40 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 3, localhost, PROCESS_LOCAL, 1855 bytes)
14/07/15 19:44:40 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 4, localhost, PROCESS_LOCAL, 1855 bytes)
14/07/15 19:44:40 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 5, localhost, PROCESS_LOCAL, 1855 bytes)
14/07/15 19:44:40 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 6, localhost, PROCESS_LOCAL, 1855 bytes)
...
14/07/15 19:44:40 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 1) in 64 ms on localhost (4/10)
14/07/15 19:44:40 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 4) in 63 ms on localhost (5/10)
14/07/15 19:44:40 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 2) in 63 ms on localhost (6/10)
14/07/15 19:44:40 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 7) in 62 ms on localhost (7/10)
14/07/15 19:44:40 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 6) in 63 ms on localhost (8/10)
14/07/15 19:44:40 INFO TaskSetManager: Finished task 9.0 in stage 1.0 (TID 9) in 8 ms on localhost (9/10)
14/07/15 19:44:40 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 8) in 9 ms on localhost (10/10)

```

Executor log looks like
```
14/07/15 19:44:40 INFO Executor: Running task 0.0 in stage 1.0 (TID 0)
14/07/15 19:44:40 INFO Executor: Running task 3.0 in stage 1.0 (TID 3)
14/07/15 19:44:40 INFO Executor: Running task 1.0 in stage 1.0 (TID 1)
14/07/15 19:44:40 INFO Executor: Running task 4.0 in stage 1.0 (TID 4)
14/07/15 19:44:40 INFO Executor: Running task 2.0 in stage 1.0 (TID 2)
14/07/15 19:44:40 INFO Executor: Running task 5.0 in stage 1.0 (TID 5)
14/07/15 19:44:40 INFO Executor: Running task 6.0 in stage 1.0 (TID 6)
14/07/15 19:44:40 INFO Executor: Running task 7.0 in stage 1.0 (TID 7)
14/07/15 19:44:40 INFO Executor: Finished task 3.0 in stage 1.0 (TID 3). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 2.0 in stage 1.0 (TID 2). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 0.0 in stage 1.0 (TID 0). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 1.0 in stage 1.0 (TID 1). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 5.0 in stage 1.0 (TID 5). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 4.0 in stage 1.0 (TID 4). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 6.0 in stage 1.0 (TID 6). 847 bytes result sent to driver
14/07/15 19:44:40 INFO Executor: Finished task 7.0 in stage 1.0 (TID 7). 847 bytes result sent to driver
```

Author: Reynold Xin <rxin@apache.org>

Closes #1259 from rxin/betterTaskLogging and squashes the following commits:

c28ada1 [Reynold Xin] Fix unit test failure.
987d043 [Reynold Xin] Updated log messages.
c6cfd46 [Reynold Xin] Merge branch 'master' into betterTaskLogging
b7b1bcc [Reynold Xin] Fixed a typo.
f9aba3c [Reynold Xin] Made it compile.
f8a5c06 [Reynold Xin] Merge branch 'master' into betterTaskLogging
07264e6 [Reynold Xin] Defensive check against unknown TaskEndReason.
76bbd18 [Reynold Xin] FailureSuite not serializable reporting.
4659b20 [Reynold Xin] Remove unused variable.
53888e3 [Reynold Xin] [SPARK-2317] Improve task logging.
2014-07-16 11:50:49 -07:00
Xiangrui Meng 96f28c9726 [SPARK-2522] set default broadcast factory to torrent
HttpBroadcastFactory is the current default broadcast factory. It sends the broadcast data to each worker one by one, which is slow when the cluster is big. TorrentBroadcastFactory scales much better than http. Maybe we should make torrent the default broadcast method.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1437 from mengxr/bt-broadcast and squashes the following commits:

ed492fe [Xiangrui Meng] set default broadcast factory to torrent
2014-07-16 11:27:51 -07:00
Reynold Xin ef48222c10 [SPARK-2517] Remove some compiler warnings.
Author: Reynold Xin <rxin@apache.org>

Closes #1433 from rxin/compile-warning and squashes the following commits:

8d0b890 [Reynold Xin] Remove some compiler warnings.
2014-07-16 11:15:07 -07:00
Sandy Ryza fc7edc9e76 SPARK-2519. Eliminate pattern-matching on Tuple2 in performance-critical...
... aggregation code

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1435 from sryza/sandy-spark-2519 and squashes the following commits:

640706a [Sandy Ryza] SPARK-2519. Eliminate pattern-matching on Tuple2 in performance-critical aggregation code
2014-07-16 11:07:16 -07:00
Reynold Xin efe2a8b126 Tightening visibility for various Broadcast related classes.
In preparation for SPARK-2521.

Author: Reynold Xin <rxin@apache.org>

Closes #1438 from rxin/broadcast and squashes the following commits:

432f1cc [Reynold Xin] Tightening visibility for various Broadcast related classes.
2014-07-16 10:44:54 -07:00
Rui Li 33e64ecacb SPARK-2277: make TaskScheduler track hosts on rack
Hi mateiz, I've created [SPARK-2277](https://issues.apache.org/jira/browse/SPARK-2277) to make TaskScheduler track hosts on each rack. Please help to review, thanks.

Author: Rui Li <rui.li@intel.com>

Closes #1212 from lirui-intel/trackHostOnRack and squashes the following commits:

2b4bd0f [Rui Li] SPARK-2277: refine UT
fbde838 [Rui Li] SPARK-2277: add UT
7bbe658 [Rui Li] SPARK-2277: rename the method
5e4ef62 [Rui Li] SPARK-2277: remove unnecessary import
79ac750 [Rui Li] SPARK-2277: make TaskScheduler track hosts on rack
2014-07-16 22:53:37 +05:30
Henry Saputra 9c12de5092 [SPARK-2500] Move the logInfo for registering BlockManager to BlockManagerMasterActor.register method
PR for SPARK-2500

Move the logInfo call for BlockManager to BlockManagerMasterActor.register instead of BlockManagerInfo constructor.

Previously the loginfo call for registering the registering a BlockManager is happening in the BlockManagerInfo constructor. This kind of confusing because the code could call "new BlockManagerInfo" without actually registering a BlockManager and could confuse when reading the log files.

Author: Henry Saputra <henry.saputra@gmail.com>

Closes #1424 from hsaputra/move_registerblockmanager_log_to_registration_method and squashes the following commits:

3370b4a [Henry Saputra] Move the loginfo for BlockManager to BlockManagerMasterActor.register instead of BlockManagerInfo constructor.
2014-07-15 21:21:52 -07:00
Reynold Xin 4576d80a51 [SPARK-2469] Use Snappy (instead of LZF) for default shuffle compression codec
This reduces shuffle compression memory usage by 3x.

Author: Reynold Xin <rxin@apache.org>

Closes #1415 from rxin/snappy and squashes the following commits:

06c1a01 [Reynold Xin] SPARK-2469: Use Snappy (instead of LZF) for default shuffle compression codec.
2014-07-15 18:47:39 -07:00
witgo 72ea56da8e SPARK-1291: Link the spark UI to RM ui in yarn-client mode
Author: witgo <witgo@qq.com>

Closes #1112 from witgo/SPARK-1291 and squashes the following commits:

6022bcd [witgo] review commit
1fbb925 [witgo] add addAmIpFilter to yarn alpha
210299c [witgo] review commit
1b92a07 [witgo] review commit
6896586 [witgo] Add comments to addWebUIFilter
3e9630b [witgo] review commit
142ee29 [witgo] review commit
1fe7710 [witgo] Link the spark UI to RM ui in yarn-client mode
2014-07-15 13:52:56 -05:00
William Benton cb09e93c1d Reformat multi-line closure argument.
Author: William Benton <willb@redhat.com>

Closes #1419 from willb/reformat-2486 and squashes the following commits:

2676231 [William Benton] Reformat multi-line closure argument.
2014-07-15 09:13:39 -07:00
Reynold Xin dd95abada7 [SPARK-2399] Add support for LZ4 compression.
Based on Greg Bowyer's patch from JIRA https://issues.apache.org/jira/browse/SPARK-2399

Author: Reynold Xin <rxin@apache.org>

Closes #1416 from rxin/lz4 and squashes the following commits:

6c8fefe [Reynold Xin] Fixed typo.
8a14d38 [Reynold Xin] [SPARK-2399] Add support for LZ4 compression.
2014-07-15 01:46:57 -07:00
lianhuiwang 7446f5ff93 discarded exceeded completedDrivers
When completedDrivers number exceeds the threshold, the first Max(spark.deploy.retainedDrivers, 1) will be discarded.

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #1114 from lianhuiwang/retained-drivers and squashes the following commits:

8789418 [lianhuiwang] discarded exceeded completedDrivers
2014-07-15 00:22:06 -07:00
Kousuke Saruta c6d75745de [SPARK-2390] Files in staging directory cannot be deleted and wastes the space of HDFS
When running jobs with YARN Cluster mode and using HistoryServer, the files in the Staging Directory (~/.sparkStaging on HDFS) cannot be deleted.
HistoryServer uses directory where event log is written, and the directory is represented as a instance of o.a.h.f.FileSystem created by using FileSystem.get.

On the other hand, ApplicationMaster has a instance named fs, which also created by using FileSystem.get.

FileSystem.get returns cached same instance when URI passed to the method represents same file system and the method is called by same user.
Because of the behavior, when the directory for event log is on HDFS, fs of ApplicationMaster and fileSystem of FileLogger is same instance.
When shutting down ApplicationMaster, fileSystem.close is called in FileLogger#stop, which is invoked by SparkContext#stop indirectly.

And ApplicationMaster#cleanupStagingDir also called by JVM shutdown hook. In this method, fs.delete(stagingDirPath) is invoked.
Because fs.delete in ApplicationMaster is called after fileSystem.close in FileLogger, fs.delete fails and results not deleting files in the staging directory.

I think, calling fileSystem.delete is not needed.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1326 from sarutak/SPARK-2390 and squashes the following commits:

10e1a88 [Kousuke Saruta] Removed fileSystem.close from FileLogger.scala not to prevent any other FileSystem operation
2014-07-14 23:55:39 -07:00
Aaron Davidson a2aa7bebae Add/increase severity of warning in documentation of groupBy()
groupBy()/groupByKey() is notorious for being a very convenient API that can lead to poor performance when used incorrectly.

This PR just makes it clear that users should be cautious not to rely on this API when they really want a different (more performant) one, such as reduceByKey().

(Note that one source of confusion is the name; this groupBy() is not the same as a SQL GROUP-BY, which is used for aggregation and is more similar in nature to Spark's reduceByKey().)

Author: Aaron Davidson <aaron@databricks.com>

Closes #1380 from aarondav/warning and squashes the following commits:

f60da39 [Aaron Davidson] Give better advice
d0afb68 [Aaron Davidson] Add/increase severity of warning in documentation of groupBy()
2014-07-14 23:38:12 -07:00
William Benton 1f99fea53b SPARK-2486: Utils.getCallSite is now resilient to bogus frames
When running Spark under certain instrumenting profilers,
Utils.getCallSite could crash with an NPE.  This commit
makes it more resilient to failures occurring while inspecting
stack frames.

Author: William Benton <willb@redhat.com>

Closes #1413 from willb/spark-2486 and squashes the following commits:

b7c0274 [William Benton] Use explicit null checks instead of Try()
0f0c1ae [William Benton] Utils.getCallSite is now resilient to bogus frames
2014-07-14 23:09:13 -07:00
li-zhihui 3dd8af7a66 [SPARK-1946] Submit tasks after (configured ratio) executors have been registered
Because submitting tasks and registering executors are asynchronous, in most situation, early stages' tasks run without preferred locality.

A simple solution is sleeping few seconds in application, so that executors have enough time to register.

The PR add 2 configuration properties to make TaskScheduler submit tasks after a few of executors have been registered.

\# Submit tasks only after (registered executors / total executors) arrived the ratio, default value is 0
spark.scheduler.minRegisteredExecutorsRatio = 0.8

\# Whatever minRegisteredExecutorsRatio is arrived, submit tasks after the maxRegisteredWaitingTime(millisecond), default value is 30000
spark.scheduler.maxRegisteredExecutorsWaitingTime = 5000

Author: li-zhihui <zhihui.li@intel.com>

Closes #900 from li-zhihui/master and squashes the following commits:

b9f8326 [li-zhihui] Add logs & edit docs
1ac08b1 [li-zhihui] Add new configs to user docs
22ead12 [li-zhihui] Move waitBackendReady to postStartHook
c6f0522 [li-zhihui] Bug fix: numExecutors wasn't set & use constant DEFAULT_NUMBER_EXECUTORS
4d6d847 [li-zhihui] Move waitBackendReady to TaskSchedulerImpl.start & some code refactor
0ecee9a [li-zhihui] Move waitBackendReady from DAGScheduler.submitStage to TaskSchedulerImpl.submitTasks
4261454 [li-zhihui] Add docs for new configs & code style
ce0868a [li-zhihui] Code style, rename configuration property name of minRegisteredRatio & maxRegisteredWaitingTime
6cfb9ec [li-zhihui] Code style, revert default minRegisteredRatio of yarn to 0, driver get --num-executors in yarn/alpha
812c33c [li-zhihui] Fix driver lost --num-executors option in yarn-cluster mode
e7b6272 [li-zhihui] support yarn-cluster
37f7dc2 [li-zhihui] support yarn mode(percentage style)
3f8c941 [li-zhihui] submit stage after (configured ratio of) executors have been registered
2014-07-14 15:32:49 -05:00
Daoyuan 38ccd6ebd4 move some test file to match src code
Just move some test suite to corresponding package

Author: Daoyuan <daoyuan.wang@intel.com>

Closes #1401 from adrian-wang/movetestfiles and squashes the following commits:

d1a6803 [Daoyuan] move some test file to match src code
2014-07-14 10:40:44 -07:00
Daniel Darabos 2245c87af4 Use the Executor's ClassLoader in sc.objectFile().
This makes it possible to read classes from the object file which were specified in the user-provided jars. (By default ObjectInputStream uses latestUserDefinedLoader, which may or may not be the right one.)

I created this because I ran into the following problem. I have x:RDD[X] with X being defined in the jar that I provide to SparkContext. I save it with x.saveAsObjectFile("x"). I try to load it with sc.objectFile\[X\]("x"). It fails with ClassNotFoundException.

After a good while of debugging I figured out that Utils.deserialize() most likely uses the ClassLoader of Utils. This is the bootstrap ClassLoader, so it is not aware of the dynamically added jars. This patch fixes the issue.

A more robust fix would be to always default to Thread.currentThread.getContextClassLoader. This would prevent this problem from biting anyone in the future. It would be a bit harder to test though. On the topic of testing, if you'd like to see tests for this, I will need some hand-holding. Thanks!

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #181 from darabos/master and squashes the following commits:

45a011a [Daniel Darabos] Add test for SPARK-1877. (Fixed in 52eb54d.)
e13e090 [Daniel Darabos] Merge branch 'master' of https://github.com/apache/spark
61fe0d0 [Daniel Darabos] Fix style (line too long).
1b5df2c [Daniel Darabos] Use the Executor's ClassLoader in sc.objectFile(). This makes it possible to read classes from the object file which were specified in the user-provided jars. (By default ObjectInputStream uses latestUserDefinedLoader, which may or may not be the right one.)
2014-07-12 00:07:42 -07:00
Andrew Or f4f46dec5a [Minor] Remove unused val in Master
Author: Andrew Or <andrewor14@gmail.com>

Closes #1365 from andrewor14/master-fs and squashes the following commits:

497f100 [Andrew Or] Sneak in a space and hope no one will notice
05ba6da [Andrew Or] Remove unused val
2014-07-11 00:21:16 -07:00
Prashant Sharma 628932b8d0 [SPARK-1776] Have Spark's SBT build read dependencies from Maven.
Patch introduces the new way of working also retaining the existing ways of doing things.

For example build instruction for yarn in maven is
`mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
in sbt it can become
`MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
Also supports
`sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:

a8ac951 [Prashant Sharma] Updated sbt version.
62b09bb [Prashant Sharma] Improvements.
fa6221d [Prashant Sharma] Excluding sql from mima
4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
72651ca [Prashant Sharma] Addresses code reivew comments.
acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
ac4312c [Prashant Sharma] Revert "minor fix"
6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
446768e [Prashant Sharma] minor fix
89b9777 [Prashant Sharma] Merge conflicts
d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
dccc8ac [Prashant Sharma] updated mima to check against 1.0
a49c61b [Prashant Sharma] Fix for tools jar
a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
cf88758 [Prashant Sharma] cleanup
9439ea3 [Prashant Sharma] Small fix to run-examples script.
96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
4973dbd [Patrick Wendell] Example build using pom reader.
2014-07-10 11:03:37 -07:00
Masayoshi TSUZUKI c2babc089b SPARK-2115: Stage kill link is too close to stage details link
Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #1350 from tsudukim/feature/SPARK-2115 and squashes the following commits:

e2263b0 [Masayoshi TSUZUKI] Moved (kill) link to the right side. Add confirmation dialog when (kill) link is clicked.
2014-07-10 01:18:37 -07:00
Kay Ousterhout 339441f545 [SPARK-2384] Add tooltips to UI.
This patch adds tooltips to clarify some points of confusion in the UI.  When users mouse over some of the table headers (shuffle read, write, and input size) as well as over the "scheduler delay" metric shown for each stage, a black tool tip (see image below) pops up describing the metric in more detail.  After the tooltip mechanism is added by this commit, I imagine others may want to add more tooltips for other things in the UI, but I think this is a good starting point.

![tooltip](https://cloud.githubusercontent.com/assets/1108612/3491905/994e179e-059f-11e4-92f2-c6c12d248d81.jpg)

This looks scary-big but much of it is adding the bootstrap tool tip JavaScript.

Also I have no idea what to put for the license in tooltip (I left it the same -- the Twitter apache header) or for JQuery (left it as nothing) -- @mateiz what's the right thing here?

cc @pwendell @andrewor14 @rxin

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1314 from kayousterhout/tooltips and squashes the following commits:

19981b5 [Kay Ousterhout] Exclude non-licensed javascript files from style check
d9ab5a9 [Kay Ousterhout] Response to Andrew's review
7752449 [Kay Ousterhout] [SPARK-2384] Add tooltips to UI.
2014-07-08 22:57:21 -07:00
Andrew Or bf04a390e4 [SPARK-2392] Executors should not start their own HTTP servers
Executors currently start their own unused HTTP file servers. This is because we use the same SparkEnv class for both executors and drivers, and we do not distinguish this case.

In the longer term, we should separate out SparkEnv for the driver and SparkEnv for the executors.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1335 from andrewor14/executor-http-server and squashes the following commits:

46ef263 [Andrew Or] Start HTTP server only on the driver
2014-07-08 17:35:31 -07:00
Daniel Darabos c8a2313cdf [SPARK-2403] Catch all errors during serialization in DAGScheduler
https://issues.apache.org/jira/browse/SPARK-2403

Spark hangs for us whenever we forget to register a class with Kryo. This should be a simple fix for that. But let me know if you have a better suggestion.

I did not write a new test for this. It would be pretty complicated and I'm not sure it's worthwhile for such a simple change. Let me know if you disagree.

Author: Daniel Darabos <darabos.daniel@gmail.com>

Closes #1329 from darabos/spark-2403 and squashes the following commits:

3aceaad [Daniel Darabos] Print full stack trace for miscellaneous exceptions during serialization.
52c22ba [Daniel Darabos] Only catch NonFatal exceptions.
361e962 [Daniel Darabos] Catch all errors during serialization in DAGScheduler.
2014-07-08 10:43:46 -07:00
witgo 3cd5029be7 Resolve sbt warnings during build Ⅱ
Author: witgo <witgo@qq.com>

Closes #1153 from witgo/expectResult and squashes the following commits:

97541d8 [witgo] merge master
ead26e7 [witgo] Resolve sbt warnings during build
2014-07-08 00:31:42 -07:00
ankit.bhardwaj 42f3abd529 [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kry...
Due to the non registration of BoundedPriorityQueue  with kryoserializer, operations which are dependend on BoundedPriorityQueue are giving exceptions.One such instance is using top along with kryo serialization.
Fixed the issue by registering BoundedPriorityQueue with kryoserializer.

Author: ankit.bhardwaj <ankit.bhardwaj@guavus.com>

Closes #1299 from AnkitBhardwaj12/BoundedPriorityQueueWithKryoIssue and squashes the following commits:

a4ae8ed [ankit.bhardwaj] [SPARK-2306]:BoundedPriorityQueue is private and not registered with Kryo
2014-07-04 22:06:10 -07:00
Reynold Xin 0db5d5a22e Added SignalLogger to HistoryServer.
This was omitted in #1260. @aarondav

Author: Reynold Xin <rxin@apache.org>

Closes #1300 from rxin/historyServer and squashes the following commits:

af720a3 [Reynold Xin] Added SignalLogger to HistoryServer.
2014-07-04 17:33:07 -07:00
Aaron Davidson 97a0bfe1c0 SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
JIRA: https://issues.apache.org/jira/browse/SPARK-2282

This issue is caused by a buildup of sockets in the TIME_WAIT stage of TCP, which is a stage that lasts for some period of time after the communication closes.

This solution simply allows us to reuse sockets that are in TIME_WAIT, to avoid issues with the buildup of the rapid creation of these sockets.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1220 from aarondav/SPARK-2282 and squashes the following commits:

2e5cab3 [Aaron Davidson] SPARK-2282: Reuse PySpark Accumulator sockets to avoid crashing Spark
2014-07-03 23:02:36 -07:00
Andrew Or 3894a49be9 [SPARK-2307][Reprise] Correctly report RDD blocks on SparkUI
**Problem.** The existing code in `ExecutorPage.scala` requires a linear scan through all the blocks to filter out the uncached ones. Every refresh could be expensive if there are many blocks and many executors.

**Solution.** The proper semantics should be the following: `StorageStatusListener` should contain only block statuses that are cached. This means as soon as a block is unpersisted by any mean, its status should be removed. This is reflected in the changes made in `StorageStatusListener.scala`.

Further, the `StorageTab` must stop relying on the `StorageStatusListener` changing a dropped block's status to `StorageLevel.NONE` (which no longer happens). This is reflected in the changes made in `StorageTab.scala` and `StorageUtils.scala`.

----------

If you have been following this chain of PRs like pwendell, you will quickly notice that this reverts the changes in #1249, which reverts the changes in #1080. In other words, we are adding back the changes from #1080, and fixing SPARK-2307 on top of those changes. Please ask questions if you are confused.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1255 from andrewor14/storage-ui-fix-reprise and squashes the following commits:

45416fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into storage-ui-fix-reprise
a82ea25 [Andrew Or] Add tests for StorageStatusListener
8773b01 [Andrew Or] Update comment / minor changes
3afde3f [Andrew Or] Correctly report the number of blocks on SparkUI
2014-07-03 22:48:23 -07:00
Aaron Davidson 586feb5c95 [SPARK-2350] Don't NPE while launching drivers
Prior to this change, we could throw a NPE if we launch a driver while another one is waiting, because removing from an iterator while iterating over it is not safe.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1289 from aarondav/master-fail and squashes the following commits:

1cf1cf4 [Aaron Davidson] SPARK-2350: Don't NPE while launching drivers
2014-07-03 22:31:41 -07:00
Raymond Liu 5fa0a05763 [SPARK-1097] Workaround Hadoop conf ConcurrentModification issue
Workaround Hadoop conf ConcurrentModification issue

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1273 from colorant/hadoopRDD and squashes the following commits:

994e98b [Raymond Liu] Address comments
e2cda3d [Raymond Liu] Workaround Hadoop conf ConcurrentModification issue
2014-07-03 19:24:22 -07:00
Andrew Or c480537739 [SPARK] Fix NPE for ExternalAppendOnlyMap
It did not handle null keys very gracefully before.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1288 from andrewor14/fix-external and squashes the following commits:

312b8d8 [Andrew Or] Abstract key hash code
ed5adf9 [Andrew Or] Fix NPE for ExternalAppendOnlyMap
2014-07-03 10:26:50 -07:00
yantangzhai 3bbeca6489 [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
The spark.local.dir is configured as a list of multiple paths as follows /data1/sparkenv/local,/data2/sparkenv/local. If the disk data2 of the driver node has error, the application will exit since DiskBlockManager exits directly at createLocalDirs. If the disk data2 of the worker node has error, the executor will exit either.
DiskBlockManager should not exit directly at createLocalDirs if one of spark.local.dir has error. Since spark.local.dir has multiple paths, a problem should not affect the overall situation.
I think DiskBlockManager could ignore the bad directory at createLocalDirs.

Author: yantangzhai <tyz0303@163.com>

Closes #1274 from YanTangZhai/SPARK-2324 and squashes the following commits:

609bf48 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
df08673 [yantangzhai] [SPARK-2324] SparkContext should not exit directly when spark.local.dir is a list of multiple paths and one of them has error
2014-07-03 10:14:35 -07:00
Kay Ousterhout 05c3d90e35 [SPARK-2185] Emit warning when task size exceeds a threshold.
This functionality was added in an earlier commit but shortly
after was removed due to a bad git merge (totally my fault).

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1149 from kayousterhout/warning_bug and squashes the following commits:

3f1bb00 [Kay Ousterhout] Fixed Json tests
462a664 [Kay Ousterhout] Removed task set name from warning message
e89b2f6 [Kay Ousterhout] Fixed Json tests.
7af424c [Kay Ousterhout] Emit warning when task size exceeds a threshold.
2014-07-01 01:56:51 -07:00
Peter MacKinnon 3319a3e3c6 SPARK-2332 [build] add exclusion for old servlet-api on hadoop-client in core
Fix for class of test suite failures in jenkins

Author: Peter MacKinnon <pmackinn@redhat.com>

Closes #1271 from pdmack/master and squashes the following commits:

cfe59fd [Peter MacKinnon] exclude servlet-api in hadoop-client for sbt
6f39fec [Peter MacKinnon] add exclusion for old servlet-api on hadoop-client in core
2014-07-01 00:28:38 -07:00
Reynold Xin 5fccb567b3 [SPARK-2318] When exiting on a signal, print the signal name first.
Author: Reynold Xin <rxin@apache.org>

Closes #1260 from rxin/signalhandler1 and squashes the following commits:

8e73552 [Reynold Xin] Uh add Logging back in ApplicationMaster.
0402ba8 [Reynold Xin] Synchronize SignalLogger.register.
dc70705 [Reynold Xin] Added SignalLogger to YARN ApplicationMaster.
79a21b4 [Reynold Xin] Added license header.
0da052c [Reynold Xin] Added the SignalLogger itself.
e587d2e [Reynold Xin] [SPARK-2318] When exiting on a signal, print the signal name first.
2014-06-30 15:12:38 -07:00
Reynold Xin 358ae1534d [SPARK-2322] Exception in resultHandler should NOT crash DAGScheduler and shutdown SparkContext.
This should go into 1.0.1.

Author: Reynold Xin <rxin@apache.org>

Closes #1264 from rxin/SPARK-2322 and squashes the following commits:

c77c07f [Reynold Xin] Added comment to SparkDriverExecutionException and a test case for accumulator.
5d8d920 [Reynold Xin] [SPARK-2322] Exception in resultHandler could crash DAGScheduler and shutdown SparkContext.
2014-06-30 11:50:22 -07:00
Andrew Ash 6803642253 SPARK-2077 Log serializer that actually ends up being used
I could settle with this being a debug also if we provided an example of how to turn it on in `log4j.properties`

https://issues.apache.org/jira/browse/SPARK-2077

Author: Andrew Ash <andrew@andrewash.com>

Closes #1017 from ash211/SPARK-2077 and squashes the following commits:

580f680 [Andrew Ash] Drop to debug
0266415 [Andrew Ash] SPARK-2077 Log serializer that actually ends up being used
2014-06-29 23:29:05 -07:00
William Benton a484030dae SPARK-897: preemptively serialize closures
These commits cause `ClosureCleaner.clean` to attempt to serialize the cleaned closure with the default closure serializer and throw a `SparkException` if doing so fails.  This behavior is enabled by default but can be disabled at individual callsites of `SparkContext.clean`.

Commit 98e01ae8 fixes some no-op assertions in `GraphSuite` that this work exposed; I'm happy to put that in a separate PR if that would be more appropriate.

Author: William Benton <willb@redhat.com>

Closes #143 from willb/spark-897 and squashes the following commits:

bceab8a [William Benton] Commented DStream corner cases for serializability checking.
64d04d2 [William Benton] FailureSuite now checks both messages and causes.
3b3f74a [William Benton] Stylistic and doc cleanups.
b215dea [William Benton] Fixed spurious failures in ImplicitOrderingSuite
be1ecd6 [William Benton] Don't check serializability of DStream transforms.
abe816b [William Benton] Make proactive serializability checking optional.
5bfff24 [William Benton] Adds proactive closure-serializablilty checking
ed2ccf0 [William Benton] Test cases for SPARK-897.
2014-06-29 23:27:34 -07:00
jerryshao 66135a341d [SPARK-2104] Fix task serializing issues when sort with Java non serializable class
Details can be see in [SPARK-2104](https://issues.apache.org/jira/browse/SPARK-2104). This work is based on Reynold's work, add some unit tests to validate the issue.

@rxin , would you please take a look at this PR, thanks a lot.

Author: jerryshao <saisai.shao@intel.com>

Closes #1245 from jerryshao/SPARK-2104 and squashes the following commits:

c8ee362 [jerryshao] Make field partitions transient
2b41917 [jerryshao] Minor changes
47d763c [jerryshao] Fix task serializing issue when sort with Java non serializable class
2014-06-29 23:00:00 -07:00
Kay Ousterhout 7b71a0e096 [SPARK-1683] Track task read metrics.
This commit adds a new metric in TaskMetrics to record
the input data size and displays this information in the UI.

An earlier version of this commit also added the read time,
which can be useful for diagnosing straggler problems,
but unfortunately that change introduced a significant performance
regression for jobs that don't do much computation. In order to
track read time, we'll need to do sampling.

The screenshots below show the UI with the new "Input" field,
which I added to the stage summary page, the executor summary page,
and the per-stage page.

![image](https://cloud.githubusercontent.com/assets/1108612/3167930/2627f92a-eb77-11e3-861c-98ea5bb7a1a2.png)

![image](https://cloud.githubusercontent.com/assets/1108612/3167936/475a889c-eb77-11e3-9706-f11c48751f17.png)

![image](https://cloud.githubusercontent.com/assets/1108612/3167948/80ebcf12-eb77-11e3-87ed-349fce6a770c.png)

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #962 from kayousterhout/read_metrics and squashes the following commits:

f13b67d [Kay Ousterhout] Correctly format input bytes on executor page
8b70cde [Kay Ousterhout] Added comment about potential inaccuracy of bytesRead
d1016e8 [Kay Ousterhout] Udated SparkListenerSuite test
8461492 [Kay Ousterhout] Miniscule style fix
ae04d99 [Kay Ousterhout] Remove input metrics for parallel collections
719f19d [Kay Ousterhout] Style fixes
bb6ec62 [Kay Ousterhout] Small fixes
869ac7b [Kay Ousterhout] Updated Json tests
44a0301 [Kay Ousterhout] Fixed accidentally added line
4bd0568 [Kay Ousterhout] Added input source, renamed Hdfs to Hadoop.
f27e535 [Kay Ousterhout] Updates based on review comments and to fix rebase
bf41029 [Kay Ousterhout] Updated Json tests to pass
0fc33e0 [Kay Ousterhout] Added explicit backward compatibility test
4e52925 [Kay Ousterhout] Added Json output and associated tests.
365400b [Kay Ousterhout] [SPARK-1683] Track task read metrics.
2014-06-29 22:01:42 -07:00
Reynold Xin cdf613fc52 [SPARK-2320] Reduce exception/code block font size in web ui
Author: Reynold Xin <rxin@apache.org>

Closes #1261 from rxin/ui-pre-size and squashes the following commits:

7ab1a69 [Reynold Xin] [SPARK-2320] Reduce exception/code block font size in web ui
2014-06-29 16:46:28 -07:00
Reynold Xin 2053d793cc Improve MapOutputTracker error logging.
Author: Reynold Xin <rxin@apache.org>

Closes #1258 from rxin/mapOutputTracker and squashes the following commits:

a7c95b6 [Reynold Xin] Improve MapOutputTracker error logging.
2014-06-28 21:05:03 -07:00
Andrew Or f17510e371 [SPARK-2259] Fix highly misleading docs on cluster / client deploy modes
The existing docs are highly misleading. For standalone mode, for example, it encourages the user to use standalone-cluster mode, which is not officially supported. The safeguards have been added in Spark submit itself to prevent bad documentation from leading users down the wrong path in the future.

This PR is prompted by countless headaches users of Spark have run into on the mailing list.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1200 from andrewor14/submit-docs and squashes the following commits:

5ea2460 [Andrew Or] Rephrase cluster vs client explanation
c827f32 [Andrew Or] Clarify spark submit messages
9f7ed8f [Andrew Or] Clarify client vs cluster deploy mode + add safeguards
2014-06-27 16:11:31 -07:00
Andrew Or 21e0f77b63 [SPARK-2307] SparkUI - storage tab displays incorrect RDDs
The issue here is that the `StorageTab` listens for updates from the `StorageStatusListener`, but when a block is kicked out of the cache, `StorageStatusListener` removes it from its list. Thus, there is no way for the `StorageTab` to know whether a block has been dropped.

This issue was introduced in #1080, which was itself a bug fix. Here we revert that PR and offer a different fix for the original bug (SPARK-2144).

Author: Andrew Or <andrewor14@gmail.com>

Closes #1249 from andrewor14/storage-ui-fix and squashes the following commits:

af019ce [Andrew Or] Fix SPARK-2307
2014-06-27 15:23:25 -07:00
witgo 18f29b96c7 SPARK-2181:The keys for sorting the columns of Executor page in SparkUI are incorrect
Author: witgo <witgo@qq.com>

Closes #1135 from witgo/SPARK-2181 and squashes the following commits:

39dad90 [witgo] The keys for sorting the columns of Executor page in SparkUI are incorrect
2014-06-26 21:59:21 -07:00
Xiangrui Meng c23f5db32b [SPARK-2251] fix concurrency issues in random sampler
The following code is very likely to throw an exception:

~~~
val rdd = sc.parallelize(0 until 111, 10).sample(false, 0.1)
rdd.zip(rdd).count()
~~~

because the same random number generator is used in compute partitions.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1229 from mengxr/fix-sample and squashes the following commits:

f1ee3d7 [Xiangrui Meng] fix concurrency issues in random sampler
2014-06-26 21:46:55 -07:00
Reynold Xin d1636dd72f [SPARK-2297][UI] Make task attempt and speculation more explicit in UI.
New UI:

![screen shot 2014-06-26 at 1 43 52 pm](https://cloud.githubusercontent.com/assets/323388/3404643/82b9ddc6-fd73-11e3-96f9-f7592a7aee79.png)

Author: Reynold Xin <rxin@apache.org>

Closes #1236 from rxin/ui-task-attempt and squashes the following commits:

3b645dd [Reynold Xin] Expose attemptId in Stage.
c0474b1 [Reynold Xin] Beefed up unit test.
c404bdd [Reynold Xin] Fix ReplayListenerSuite.
f56be4b [Reynold Xin] Fixed JsonProtocolSuite.
e29e0f7 [Reynold Xin] Minor update.
5e4354a [Reynold Xin] [SPARK-2297][UI] Make task attempt and speculation more explicit in UI.
2014-06-26 21:13:26 -07:00
Reynold Xin bf578deaf2 Removed throwable field from FetchFailedException and added MetadataFetchFailedException
FetchFailedException used to have a Throwable field, but in reality we never propagate any of the throwable/exceptions back to the driver because Executor explicitly looks for FetchFailedException and then sends FetchFailed as the TaskEndReason.

This pull request removes the throwable and adds a MetadataFetchFailedException that extends FetchFailedException (so now MapOutputTracker throws MetadataFetchFailedException instead).

Author: Reynold Xin <rxin@apache.org>

Closes #1227 from rxin/metadataFetchException and squashes the following commits:

5cb1e0a [Reynold Xin] MetadataFetchFailedException extends FetchFailedException.
8861ee2 [Reynold Xin] Throw MetadataFetchFailedException in MapOutputTracker.
2014-06-26 21:12:16 -07:00
Reynold Xin 6587ef7c17 [SPARK-2286][UI] Report exception/errors for failed tasks that are not ExceptionFailure
Also added inline doc for each TaskEndReason.

Author: Reynold Xin <rxin@apache.org>

Closes #1225 from rxin/SPARK-2286 and squashes the following commits:

6a7959d [Reynold Xin] Fix unit test failure.
cf9d5eb [Reynold Xin] Merge branch 'master' into SPARK-2286
a61fae1 [Reynold Xin] Move to line above ...
38c7391 [Reynold Xin] [SPARK-2286][UI] Report exception/errors for failed tasks that are not ExceptionFailure.
2014-06-26 14:03:22 -07:00
Reynold Xin 4a346e242c [SPARK-2284][UI] Mark all failed tasks as failures.
Previously only tasks failed with ExceptionFailure reason was marked as failure.

Author: Reynold Xin <rxin@apache.org>

Closes #1224 from rxin/SPARK-2284 and squashes the following commits:

be79dbd [Reynold Xin] [SPARK-2284][UI] Mark all failed tasks as failures.
2014-06-25 22:35:03 -07:00
Mark Hamstra b88a59a668 [SPARK-1749] Job cancellation when SchedulerBackend does not implement killTask
This is a fixed up version of #686 (cc @markhamstra @pwendell).  The last commit (the only one I authored) reflects the changes I made from Mark's original patch.

Author: Mark Hamstra <markhamstra@gmail.com>
Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1219 from kayousterhout/mark-SPARK-1749 and squashes the following commits:

42dfa7e [Kay Ousterhout] Got rid of terrible double-negative name
80b3205 [Kay Ousterhout] Don't notify listeners of job failure if it wasn't successfully cancelled.
d156d33 [Mark Hamstra] Do nothing in no-kill submitTasks
9312baa [Mark Hamstra] code review update
cc353c8 [Mark Hamstra] scalastyle
e61f7f8 [Mark Hamstra] Catch UnsupportedOperationException when DAGScheduler tries to cancel a job on a SchedulerBackend that does not implement killTask
2014-06-25 20:57:48 -07:00
Sebastien Rainville 1132e472ec [SPARK-2204] Launch tasks on the proper executors in mesos fine-grained mode
The scheduler for Mesos in fine-grained mode launches tasks on the wrong executors. `MesosSchedulerBackend.resourceOffers(SchedulerDriver, List[Offer])` is assuming that `TaskSchedulerImpl.resourceOffers(Seq[WorkerOffer])` is returning task lists in the same order as the offers it was passed, but in the current implementation `TaskSchedulerImpl.resourceOffers` shuffles the offers to avoid assigning the tasks always to the same executors. The result is that the tasks are launched on the wrong executors. The jobs are sometimes able to complete, but most of the time they fail. It seems that as soon as something goes wrong with a task for some reason Spark is not able to recover since it's mistaken as to where the tasks are actually running. Also, it seems that the more the cluster is under load the more likely the job is to fail because there's a higher probability that Spark is trying to launch a task on a slave that doesn't actually have enough resources, again because it's using the wrong offers.

The solution is to not assume that the order in which the tasks are returned is the same as the offers, and simply launch the tasks on the executor decided by `TaskSchedulerImpl.resourceOffers`. What I am not sure about is that I considered slaveId and executorId to be the same, which is true at least in my setup, but I don't know if that is always true.

I tested this on top of the 1.0.0 release and it seems to work fine on our cluster.

Author: Sebastien Rainville <sebastien@hopper.com>

Closes #1140 from sebastienrainville/fine-grained-mode-fix-master and squashes the following commits:

a98b0e0 [Sebastien Rainville] Use a HashMap to retrieve the offer indices
d6ffe54 [Sebastien Rainville] Launch tasks on the proper executors in mesos fine-grained mode
2014-06-25 13:21:18 -07:00
Reynold Xin 7ff2c754f3 [SPARK-2270] Kryo cannot serialize results returned by asJavaIterable
and thus groupBy/cogroup are broken in Java APIs when Kryo is used).

@pwendell this should be merged into 1.0.1.

Thanks @sorenmacbeth for reporting this & helping out with the fix.

Author: Reynold Xin <rxin@apache.org>

Closes #1206 from rxin/kryo-iterable-2270 and squashes the following commits:

09da0aa [Reynold Xin] Updated the comment.
009bf64 [Reynold Xin] [SPARK-2270] Kryo cannot serialize results returned by asJavaIterable (and thus groupBy/cogroup are broken in Java APIs when Kryo is used).
2014-06-25 12:43:22 -07:00
Andrew Or 9aa603296c [SPARK-2258 / 2266] Fix a few worker UI bugs
**SPARK-2258.** Worker UI displays zombie processes if the executor throws an exception before a process is launched. This is because we only inform the Worker of the change if the process is already launched, which in this case it isn't.

**SPARK-2266.** We expose "Some(app-id)" on the log page. This is fairly minor.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1213 from andrewor14/fix-worker-ui and squashes the following commits:

c1223fe [Andrew Or] Fix worker UI bugs
2014-06-25 12:23:08 -07:00
CodingCat acc01ab326 SPARK-2038: rename "conf" parameters in the saveAsHadoop functions with source-compatibility
https://issues.apache.org/jira/browse/SPARK-2038

to differentiate with SparkConf object and at the same time keep the source level compatibility

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1137 from CodingCat/SPARK-2038 and squashes the following commits:

11abeba [CodingCat] revise the comments
7ee5712 [CodingCat] to keep the source-compatibility
763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
2014-06-25 00:23:32 -07:00
witgo b6b44853cd SPARK-2248: spark.default.parallelism does not apply in local mode
Author: witgo <witgo@qq.com>

Closes #1194 from witgo/SPARK-2248 and squashes the following commits:

6ac950b [witgo] spark.default.parallelism does not apply in local mode
2014-06-24 19:45:03 -07:00
Michael Armbrust 2714968e1b Fix possible null pointer in acumulator toString
Author: Michael Armbrust <michael@databricks.com>

Closes #1204 from marmbrus/nullPointerToString and squashes the following commits:

35b5fce [Michael Armbrust] Fix possible null pointer in acumulator toString
2014-06-24 19:39:19 -07:00
Xiangrui Meng 8ca41769fb [SPARK-1112, 2156] Bootstrap to fetch the driver's Spark properties.
This is an alternative solution to #1124 . Before launching the executor backend, we first fetch driver's spark properties and use it to overwrite executor's spark properties. This should be better than #1124.

@pwendell Are there spark properties that might be different on the driver and on the executors?

Author: Xiangrui Meng <meng@databricks.com>

Closes #1132 from mengxr/akka-bootstrap and squashes the following commits:

77ff32d [Xiangrui Meng] organize imports
68e1dfb [Xiangrui Meng] use timeout from AkkaUtils; remove props from RegisteredExecutor
46d332d [Xiangrui Meng] fix a test
7947c18 [Xiangrui Meng] increase slack size for akka
4ab696a [Xiangrui Meng] bootstrap to retrieve driver spark conf
2014-06-24 19:06:07 -07:00
Kay Ousterhout 1978a9033e Fix broken Json tests.
The assertJsonStringEquals method was missing an "assert" so
did not actually check that the strings were equal. This commit
adds the missing assert and fixes subsequently revealed problems
with the JsonProtocolSuite.

@andrewor14 I changed some of the test functionality to match what it
looks like you intended based on the expected strings -- let me know if
anything here looks wrong.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #1198 from kayousterhout/json_test_fix and squashes the following commits:

77f858f [Kay Ousterhout] Fix broken Json tests.
2014-06-24 16:54:50 -07:00
Rui Li 924b7082b1 SPARK-1937: fix issue with task locality
Don't check executor/host availability when creating a TaskSetManager. Because the executors may haven't been registered when the TaskSetManager is created, in which case all tasks will be considered "has no preferred locations", and thus losing data locality in later scheduling.

Author: Rui Li <rui.li@intel.com>
Author: lirui-intel <rui.li@intel.com>

Closes #892 from lirui-intel/delaySchedule and squashes the following commits:

8444d7c [Rui Li] fix code style
fafd57f [Rui Li] keep locality constraints within the valid levels
18f9e05 [Rui Li] restrict allowed locality
5b3fb2f [Rui Li] refine UT
99f843e [Rui Li] add unit test and fix bug
fff4123 [Rui Li] fix computing valid locality levels
685ed3d [Rui Li] remove delay shedule for pendingTasksWithNoPrefs
7b0177a [Rui Li] remove redundant code
c7b93b5 [Rui Li] revise patch
3d7da02 [lirui-intel] Update TaskSchedulerImpl.scala
cab4c71 [Rui Li] revised patch
539a578 [Rui Li] fix code style
cf0d6ac [Rui Li] fix code style
3dfae86 [Rui Li] re-compute pending tasks when new host is added
a225ac2 [Rui Li] SPARK-1937: fix issue with task locality
2014-06-24 11:40:37 -07:00
jerryshao 56eb8af187 [SPARK-2124] Move aggregation into shuffle implementations
This PR is a sub-task of SPARK-2044 to move the execution of aggregation into shuffle implementations.

I leave `CoGoupedRDD` and `SubtractedRDD` unchanged because they have their implementations of aggregation. I'm not sure is it suitable to change these two RDDs.

Also I do not move sort related code of `OrderedRDDFunctions` into shuffle, this will be solved in another sub-task.

Author: jerryshao <saisai.shao@intel.com>

Closes #1064 from jerryshao/SPARK-2124 and squashes the following commits:

4a05a40 [jerryshao] Modify according to comments
1f7dcc8 [jerryshao] Style changes
50a2fd6 [jerryshao] Fix test suite issue after moving aggregator to Shuffle reader and writer
1a96190 [jerryshao] Code modification related to the ShuffledRDD
308f635 [jerryshao] initial works of move combiner to ShuffleManager's reader and writer
2014-06-23 20:25:46 -07:00
Henry Saputra 383bf72c11 Cleanup on Connection, ConnectionManagerId, ConnectionManager classes part 2
Cleanup on Connection, ConnectionManagerId, and ConnectionManager classes part 2 while I was working at the code there to help IDE:
1. Remove unused imports
2. Remove parentheses in method calls that do not have side affect.
3. Add parentheses in method calls that do have side effect or not simple get to object properties.
4. Change if-else check (via isInstanceOf) for Connection class type with Scala expression for consistency and cleanliness.
5. Remove semicolon
6. Remove extra spaces.
7. Remove redundant return for consistency

Author: Henry Saputra <henry.saputra@gmail.com>

Closes #1157 from hsaputra/cleanup_connection_classes_part2 and squashes the following commits:

4be6906 [Henry Saputra] Fix Spark Scala style for line over 100 chars.
85b24f7 [Henry Saputra] Cleanup on Connection and ConnectionManager classes part 2 while I was working at the code there to help IDE: 1. Remove unused imports 2. Remove parentheses in method calls that do not have side affect. 3. Add parentheses in method calls that do have side effect. 4. Change if-else check (via isInstanceOf) for Connection class type with Scala expression for consitency and cleanliness. 5. Remove semicolon 6. Remove extra spaces.
2014-06-23 17:13:26 -07:00
Marcelo Vanzin 21ddd7d1e9 [SPARK-1768] History server enhancements.
Two improvements to the history server:

- Separate the HTTP handling from history fetching, so that it's easy to add
  new backends later (thinking about SPARK-1537 in the long run)

- Avoid loading all UIs in memory. Do lazy loading instead, keeping a few in
  memory for faster access. This allows the app limit to go away, since holding
  just the listing in memory shouldn't be too expensive unless the user has millions
  of completed apps in the history (at which point I'd expect other issues to arise
  aside from history server memory usage, such as FileSystem.listStatus()
  starting to become ridiculously expensive).

I also fixed a few minor things along the way which aren't really worth mentioning.
I also removed the app's log path from the UI since that information may not even
exist depending on which backend is used (even though there is only one now).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #718 from vanzin/hist-server and squashes the following commits:

53620c9 [Marcelo Vanzin] Add mima exclude, fix scaladoc wording.
c21f8d8 [Marcelo Vanzin] Feedback: formatting, docs.
dd8cc4b [Marcelo Vanzin] Standardize on using spark.history.* configuration.
4da3a52 [Marcelo Vanzin] Remove UI from ApplicationHistoryInfo.
2a7f68d [Marcelo Vanzin] Address review feedback.
4e72c77 [Marcelo Vanzin] Remove comment about ordering.
249bcea [Marcelo Vanzin] Remove offset / count from provider interface.
ca5d320 [Marcelo Vanzin] Remove code that deals with unfinished apps.
6e2432f [Marcelo Vanzin] Second round of feedback.
b2c570a [Marcelo Vanzin] Make class package-private.
4406f61 [Marcelo Vanzin] Cosmetic change to listing header.
e852149 [Marcelo Vanzin] Initialize new app array to expected size.
e8026f4 [Marcelo Vanzin] Review feedback.
49d2fd3 [Marcelo Vanzin] Fix a comment.
91e96ca [Marcelo Vanzin] Fix scalastyle issues.
6fbe0d8 [Marcelo Vanzin] Better handle failures when loading app info.
eee2f5a [Marcelo Vanzin] Ensure server.stop() is called when shutting down.
bda2fa1 [Marcelo Vanzin] Rudimentary paging support for the history UI.
b284478 [Marcelo Vanzin] Separate history server from history backend.
2014-06-23 13:53:44 -07:00
witgo 409d24e2b2 SPARK-2229: FileAppender throw an llegalArgumentException in jdk6
Author: witgo <witgo@qq.com>

Closes #1174 from witgo/SPARK-2229 and squashes the following commits:

f85f321 [witgo] FileAppender throw anIllegalArgumentException in JDK6
e1a8da8 [witgo] SizeBasedRollingPolicy throw an java.lang.IllegalArgumentException in JDK6
2014-06-22 18:25:16 -07:00
Sean Owen 9fe28c35df SPARK-1316. Remove use of Commons IO
Commons IO is actually barely used, and is not a declared dependency. This just replaces with equivalents from the JDK and Guava.

Author: Sean Owen <sowen@cloudera.com>

Closes #1173 from srowen/SPARK-1316 and squashes the following commits:

2eb53db [Sean Owen] Reorder Guava import
8fde404 [Sean Owen] Remove use of Commons IO, which is not actually a dependency
2014-06-22 11:47:49 -07:00
Marcelo Vanzin 648553d48e Fix some tests.
- JavaAPISuite was trying to compare a bare path with a URI. Fix by
  extracting the path from the URI, since we know it should be a
  local path anyway/

- b9be1609 excluded the ASM dependency everywhere, but easymock needs
  it (because cglib needs it). So re-add the dependency, with test
  scope this time.

The second one above actually uncovered a weird situation: the maven
test target works, even though I can't find the class sbt complains
about in its classpath. sbt complains with:

  [error] Uncaught exception when running org.apache.spark.util
  .random.RandomSamplerSuite: java.lang.NoClassDefFoundError:
  org/objectweb/asm/Type

To avoid more weirdness caused by that, I explicitly added the asm
dependency to both maven and sbt (for tests only), and verified
the classes don't end up in the final assembly.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #917 from vanzin/flaky-tests and squashes the following commits:

d022320 [Marcelo Vanzin] Fix some tests.
2014-06-20 20:05:12 -07:00
Anant 010c460d62 [SPARK-2061] Made splits deprecated in JavaRDDLike
The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.

Author: Anant <anant.asty@gmail.com>

Closes #1062 from anantasty/SPARK-2061 and squashes the following commits:

b83ce6b [Anant] Fixed syntax issue
21f9210 [Anant] Fixed version number in deprecation string
9315b76 [Anant] made related changes to use partitions in python api
8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
2014-06-20 18:57:24 -07:00
Patrick Wendell a678642495 HOTFIX: Fixing style error introduced by 08d0ac 2014-06-20 18:44:54 -07:00
Doris Xin e99903b84a [SPARK-1970] Update unit test in XORShiftRandomSuite to use ChiSquareTest from commons-math3
Updating the chisquare unit test in XORShiftRandomSuite to use the ChiSquareTest in commons-math3 instead of hardcoding the chisquare statistic for the desired confidence interval.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1073 from dorx/math3Unit and squashes the following commits:

da0e891 [Doris Xin] remove math3 from common pom
9954143 [Doris Xin] merge master
c19948f [Doris Xin] Merge branch 'master' into math3Unit
8f84f19 [Doris Xin] [SPARK-1970] unit test in XORShiftRandomSuite
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
2014-06-20 18:42:02 -07:00
Andrew Ash 08d0aca78c SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
Before:

```
14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED SelectChannelConnector@0.0.0.0:4040: java.net.BindException: Address already in use
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:444)
	at sun.nio.ch.Net.bind(Net.java:436)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
	at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.eclipse.jetty.server.Server.doStart(Server.java:293)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
	at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
	at $line3.$read$$iwC$$iwC.<init>(<console>:8)
	at $line3.$read$$iwC.<init>(<console>:14)
	at $line3.$read.<init>(<console>:16)
	at $line3.$read$.<init>(<console>:20)
	at $line3.$read$.<clinit>(<console>)
	at $line3.$eval$.<init>(<console>:7)
	at $line3.$eval$.<clinit>(<console>)
	at $line3.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/06/08 23:58:23 WARN AbstractLifeCycle: FAILED org.eclipse.jetty.server.Server@7439e55a: java.net.BindException: Address already in use
java.net.BindException: Address already in use
	at sun.nio.ch.Net.bind0(Native Method)
	at sun.nio.ch.Net.bind(Net.java:444)
	at sun.nio.ch.Net.bind(Net.java:436)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:214)
	at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.open(SelectChannelConnector.java:187)
	at org.eclipse.jetty.server.AbstractConnector.doStart(AbstractConnector.java:316)
	at org.eclipse.jetty.server.nio.SelectChannelConnector.doStart(SelectChannelConnector.java:265)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.eclipse.jetty.server.Server.doStart(Server.java:293)
	at org.eclipse.jetty.util.component.AbstractLifeCycle.start(AbstractLifeCycle.java:64)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply$mcV$sp(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at org.apache.spark.ui.JettyUtils$$anonfun$1.apply(JettyUtils.scala:192)
	at scala.util.Try$.apply(Try.scala:161)
	at org.apache.spark.ui.JettyUtils$.connect$1(JettyUtils.scala:191)
	at org.apache.spark.ui.JettyUtils$.startJettyServer(JettyUtils.scala:205)
	at org.apache.spark.ui.WebUI.bind(WebUI.scala:99)
	at org.apache.spark.SparkContext.<init>(SparkContext.scala:223)
	at org.apache.spark.repl.SparkILoop.createSparkContext(SparkILoop.scala:957)
	at $line3.$read$$iwC$$iwC.<init>(<console>:8)
	at $line3.$read$$iwC.<init>(<console>:14)
	at $line3.$read.<init>(<console>:16)
	at $line3.$read$.<init>(<console>:20)
	at $line3.$read$.<clinit>(<console>)
	at $line3.$eval$.<init>(<console>:7)
	at $line3.$eval$.<clinit>(<console>)
	at $line3.$eval.$print(<console>)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:788)
	at org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1056)
	at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:614)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:645)
	at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:609)
	at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:796)
	at org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:841)
	at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:753)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:121)
	at org.apache.spark.repl.SparkILoopInit$$anonfun$initializeSpark$1.apply(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkIMain.beQuietDuring(SparkIMain.scala:263)
	at org.apache.spark.repl.SparkILoopInit$class.initializeSpark(SparkILoopInit.scala:120)
	at org.apache.spark.repl.SparkILoop.initializeSpark(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1$$anonfun$apply$mcZ$sp$5.apply$mcV$sp(SparkILoop.scala:913)
	at org.apache.spark.repl.SparkILoopInit$class.runThunks(SparkILoopInit.scala:142)
	at org.apache.spark.repl.SparkILoop.runThunks(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoopInit$class.postInitialization(SparkILoopInit.scala:104)
	at org.apache.spark.repl.SparkILoop.postInitialization(SparkILoop.scala:56)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply$mcZ$sp(SparkILoop.scala:930)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop$$anonfun$process$1.apply(SparkILoop.scala:884)
	at scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:884)
	at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:982)
	at org.apache.spark.repl.Main$.main(Main.scala:31)
	at org.apache.spark.repl.Main.main(Main.scala)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:606)
	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:292)
	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
14/06/08 23:58:23 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
14/06/08 23:58:23 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
14/06/08 23:58:23 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
````

After:
```
14/06/09 00:04:12 INFO JettyUtils: Failed to create UI at port, 4040. Trying again.
14/06/09 00:04:12 INFO JettyUtils: Error was: Failure(java.net.BindException: Address already in use)
14/06/09 00:04:12 INFO Server: jetty-8.y.z-SNAPSHOT
14/06/09 00:04:12 INFO AbstractConnector: Started SelectChannelConnector@0.0.0.0:4041
14/06/09 00:04:12 INFO SparkUI: Started SparkUI at http://aash-mbp.local:4041
```

Lengthy logging comes from this line of code in Jetty: http://grepcode.com/file/repo1.maven.org/maven2/org.eclipse.jetty.aggregate/jetty-all/9.1.3.v20140225/org/eclipse/jetty/util/component/AbstractLifeCycle.java#210

Author: Andrew Ash <andrew@andrewash.com>

Closes #1019 from ash211/SPARK-1902 and squashes the following commits:

0dd02f7 [Andrew Ash] Leave old org.eclipse.jetty silencing in place
1e2866b [Andrew Ash] Address CR comments
9d85eed [Andrew Ash] SPARK-1902 Silence stacktrace from logs when doing port failover to port n+1
2014-06-20 18:26:10 -07:00
Andrew Or 01125a1162 Clean up CacheManager et al.
**UPDATE**

I have removed the special handling for `StorageLevel.MEMORY_*_SER` for now, because it introduces a potential performance regression. With the latest changes, this PR should include mainly style (code readability) fixes. The only functionality change is the update in `MemoryStore#putBytes` to actually return updated blocks, though this is a minor bug fix.

Now this is mainly a precursor to another PR (once again).

---------
*Old comment*

The deserialized version of a partition may occupy much more space than the serialized version. Therefore, if a partition is to be cached with `StorageLevel.MEMORY_*_SER`, we don't need to fully unroll it into an `ArrayBuffer`, but instead we can unroll it into a potentially much smaller `ByteBuffer`. This may save us from OOMs in this case.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1083 from andrewor14/unroll-them-partitions and squashes the following commits:

7048aa0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
3d9a366 [Andrew Or] Minor change for readability
d12b95f [Andrew Or] Remove unused imports (minor)
a4c387b [Andrew Or] Merge branch 'master' of github.com:apache/spark into unroll-them-partitions
cf5f565 [Andrew Or] Remove special handling for MEM_*_SER
0091ec0 [Andrew Or] Address review feedback
44ef282 [Andrew Or] Actually return updated blocks in putBytes
2941c89 [Andrew Or] Clean up BlockStore (minor)
a8f181d [Andrew Or] Add special handling for StorageLevel.MEMORY_*_SER
2014-06-20 17:14:33 -07:00
Allan Douglas R. de Oliveira 6a224c31e8 SPARK-1868: Users should be allowed to cogroup at least 4 RDDs
Adds cogroup for 4 RDDs.

Author: Allan Douglas R. de Oliveira <allandouglas@gmail.com>

Closes #813 from douglaz/more_cogroups and squashes the following commits:

f8d6273 [Allan Douglas R. de Oliveira] Test python groupWith for one more case
0e9009c [Allan Douglas R. de Oliveira] Added scala tests
c3ffcdd [Allan Douglas R. de Oliveira] Added java tests
517a67f [Allan Douglas R. de Oliveira] Added tests for python groupWith
2f402d5 [Allan Douglas R. de Oliveira] Removed TODO
17474f4 [Allan Douglas R. de Oliveira] Use new cogroup function
7877a2a [Allan Douglas R. de Oliveira] Fixed code
ba02414 [Allan Douglas R. de Oliveira] Added varargs cogroup to pyspark
c4a8a51 [Allan Douglas R. de Oliveira] Added java cogroup 4
e94963c [Allan Douglas R. de Oliveira] Fixed spacing
f1ee57b [Allan Douglas R. de Oliveira] Fixed scala style issues
d7196f1 [Allan Douglas R. de Oliveira] Allow the cogroup of 4 RDDs
2014-06-20 11:03:03 -07:00
Patrick Wendell e5514790d7 HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
Author: Patrick Wendell <pwendell@gmail.com>

Closes #1141 from pwendell/hotfix and squashes the following commits:

83e4c79 [Patrick Wendell] HOTFIX: SPARK-2208 local metrics tests can fail on fast machines
2014-06-19 21:06:28 -07:00
nravi f14b00a9c6 [SPARK-2151] Recognize memory format for spark-submit
int format expected for input memory parameter when spark-submit is invoked in standalone cluster mode. Make it consistent with rest of Spark.

Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #1095 from nishkamravi2/master and squashes the following commits:

2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark
3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark
5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark
eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2014-06-19 17:11:06 -07:00
WangTao 67fca189c9 Minor fix
The value "env" is never used in SparkContext.scala.
Add detailed comment for method setDelaySeconds in MetadataCleaner.scala instead of the unsure one.

Author: WangTao <barneystinson@aliyun.com>

Closes #1105 from WangTaoTheTonic/master and squashes the following commits:

688358e [WangTao] Minor fix
2014-06-18 23:24:57 -07:00
Doris Xin 45a95f82ca Remove unicode operator from RDD.scala
Some IDEs don’t support unicode characters in source code. Check if this breaks binary compatibility.

Author: Doris Xin <doris.s.xin@gmail.com>

Closes #1119 from dorx/unicode and squashes the following commits:

05618c3 [Doris Xin] Remove unicode operator from RDD.scala
2014-06-18 15:01:29 -07:00
Mark Hamstra 4cbeea83e0 SPARK-2158 Clean up core/stdout file from FileAppenderSuite
@tdas

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #1100 from markhamstra/SPARK-2158 and squashes the following commits:

ae8e069 [Mark Hamstra] Response to TD's review
2f1e201 [Mark Hamstra] Cleanup 'stdout' file within FileAppenderSuite
2014-06-18 14:56:41 -07:00
Reynold Xin dd96fcda01 Updated the comment for SPARK-2162.
A follow up on #1103

@andrewor14

Author: Reynold Xin <rxin@apache.org>

Closes #1117 from rxin/SPARK-2162 and squashes the following commits:

a4231de [Reynold Xin] Updated the comment for SPARK-2162.
2014-06-18 12:48:58 -07:00
Raymond Liu 5ad5e3486a [SPARK-2162] Double check in doGetLocal to avoid read on removed block.
other wise, it will either read in vain in memory level case, or throw exception in disk level case when it believe the block is there while actually it had been removed.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1103 from colorant/bm and squashes the following commits:

daac114 [Raymond Liu] Address comments
d1ea287 [Raymond Liu] Double check in doGetLocal to avoid read on removed block.
2014-06-18 10:57:45 -07:00
Patrick Wendell 9e4b4bd083 Revert "SPARK-2038: rename "conf" parameters in the saveAsHadoop functions"
This reverts commit 443f5e1bbc.

This commit unfortunately would break source compatibility if users have named
the hadoopConf parameter.
2014-06-17 19:34:17 -07:00
Andrew Or a14807e84c [SPARK-2147 / 2161] Show removed executors on the UI
This PR includes two changes
- **[SPARK-2147]** When an application finishes cleanly (i.e. `sc.stop()` is called), all of its executors used to disappear from the Master UI. This no longer happens.
- **[SPARK-2161]** This adds a "Removed Executors" table to Master UI, so the user can find out why their executors died from the logs, for instance. The equivalent table already existed in the Worker UI, but was hidden because of a bug (the comment `//scalastyle:off` disconnected the `Seq[Node]` that represents the HTML for table).

This should go into 1.0.1 if possible.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1102 from andrewor14/remember-removed-executors and squashes the following commits:

2e2298f [Andrew Or] Add hash code method to ExecutorInfo (minor)
abd72e0 [Andrew Or] Merge branch 'master' of github.com:apache/spark into remember-removed-executors
792f992 [Andrew Or] Add missing equals method in ExecutorInfo
3390b49 [Andrew Or] Add executor state column to WorkerPage
161f8a2 [Andrew Or] Display finished executors table (fix bug)
fbb65b8 [Andrew Or] Removed unused method
c89bb6e [Andrew Or] Add table for removed executors in MasterWebUI
fe47402 [Andrew Or] Show exited executors on the Master UI
2014-06-17 12:25:55 -07:00
CodingCat 443f5e1bbc SPARK-2038: rename "conf" parameters in the saveAsHadoop functions
to distinguish with SparkConf object

https://issues.apache.org/jira/browse/SPARK-2038

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1087 from CodingCat/SPARK-2038 and squashes the following commits:

763975f [CodingCat] style fix
d91288d [CodingCat] rename "conf" parameters in the saveAsHadoop functions
2014-06-17 12:17:48 -07:00
Sandy Ryza 2794990e9e SPARK-2146. Fix takeOrdered doc
Removes Python syntax in Scaladoc, corrects result in Scaladoc, and removes irrelevant cache() call in Python doc.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1086 from sryza/sandy-spark-2146 and squashes the following commits:

185ff18 [Sandy Ryza] Use Seq instead of Array
c996120 [Sandy Ryza] SPARK-2146.  Fix takeOrdered doc
2014-06-17 12:03:22 -07:00
Andrew Ash b92d16b114 SPARK-1063 Add .sortBy(f) method on RDD
This never got merged from the apache/incubator-spark repo (which is now deleted) but there had been several rounds of code review on this PR there.

I think this is ready for merging.

Author: Andrew Ash <andrew@andrewash.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@apache.org>

Closes #369 from ash211/sortby and squashes the following commits:

d09147a [Andrew Ash] Fix Ordering import
43d0a53 [Andrew Ash] Fix missing .collect()
29a54ed [Andrew Ash] Re-enable test by converting to a closure
5a95348 [Andrew Ash] Add license for RDDSuiteUtils
64ed6e3 [Andrew Ash] Remove leaked diff
d4de69a [Andrew Ash] Remove scar tissue
63638b5 [Andrew Ash] Add Python version of .sortBy()
45e0fde [Andrew Ash] Add Java version of .sortBy()
adf84c5 [Andrew Ash] Re-indent to keep line lengths under 100 chars
9d9b9d8 [Andrew Ash] Use parentheses on .collect() calls
0457b69 [Andrew Ash] Ignore failing test
99f0baf [Andrew Ash] Merge branch 'master' into sortby
222ae97 [Andrew Ash] Try moving Ordering objects out to a different class
3fd0dd3 [Andrew Ash] Add (failing) test for sortByKey with explicit Ordering
b8b5bbc [Andrew Ash] Align remove extra spaces that were used to align ='s in test code
8c53298 [Andrew Ash] Actually use ascending and numPartitions parameters
381eef2 [Andrew Ash] Correct silly typo
7db3e84 [Andrew Ash] Support ascending and numPartitions params in sortBy()
0f685fd [Andrew Ash] Merge remote-tracking branch 'origin/master' into sortby
ca4490d [Andrew Ash] Add .sortBy(f) method on RDD
2014-06-17 11:47:48 -07:00
Andrew Or 09deb3eee0 [SPARK-2144] ExecutorsPage reports incorrect # of RDD blocks
This is reproducible whenever we drop a block because of memory pressure.

This is because StorageStatusListener actually never removes anything from the block maps of its StorageStatuses. Instead, when a block is dropped, it sets the block's storage level to `StorageLevel.NONE`, when it should just remove it from the map.

This PR includes this simple fix.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1080 from andrewor14/ui-blocks and squashes the following commits:

fcf9f1a [Andrew Or] Remove BlockStatus if it is no longer cached
2014-06-17 01:28:22 -07:00
Daniel Darabos 23a12ce20c SPARK-2035: Store call stack for stages, display it on the UI.
I'm not sure about the test -- I get a lot of unrelated failures for some reason. I'll try to sort it out. But hopefully the automation will test this for me if I send a pull request :).

I'll attach a demo HTML in [Jira](https://issues.apache.org/jira/browse/SPARK-2035).

Author: Daniel Darabos <darabos.daniel@gmail.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #981 from darabos/darabos-call-stack and squashes the following commits:

f7c6bfa [Daniel Darabos] Fix bad merge. I undid 83c226d454 by Doris.
3d0a48d [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
b857849 [Daniel Darabos] Style: Break long line.
ecb5690 [Daniel Darabos] Include the last Spark method in the full stack trace. Otherwise it is not visible if the stage name is overridden.
d00a85b [Patrick Wendell] Make call sites for stages non-optional and well defined
b9eba24 [Daniel Darabos] Make StageInfo.details non-optional. Add JSON serialization code for the new field. Verify JSON backward compatibility.
4312828 [Daniel Darabos] Remove Mima excludes for CallSite. They should be unnecessary now, with SPARK-2070 fixed.
0920750 [Daniel Darabos] Merge remote-tracking branch 'upstream/master' into darabos-call-stack
a4b1faf [Daniel Darabos] Add Mima exclusions for the CallSite changes it has picked up. They are private methods/classes, so we ought to be safe.
932f810 [Daniel Darabos] Use empty CallSite instead of null in DAGSchedulerSuite. Outside of testing, this parameter always originates in SparkContext.scala, and will never be null.
ccd89d1 [Daniel Darabos] Fix long lines.
ac173e4 [Daniel Darabos] Hide "show details" if there are no details to show.
6182da6 [Daniel Darabos] Set a configurable limit on maximum call stack depth. It can be useful in memory-constrained situations with large numbers of stages.
8fe2e34 [Daniel Darabos] Store call stack for stages, display it on the UI.
2014-06-17 00:08:05 -07:00
CodingCat 716c88aa14 SPARK-2039: apply output dir existence checking for all output formats
https://issues.apache.org/jira/browse/SPARK-2039

apply output dir existence checking for all output formats

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1088 from CodingCat/SPARK-2039 and squashes the following commits:

c52747a [CodingCat] apply output dir existence checking for all output formats
2014-06-15 23:47:58 -07:00
CrazyJvm a63aa1adb2 SPARK-1999: StorageLevel in storage tab and RDD Storage Info never changes
StorageLevel in 'storage tab' and 'RDD Storage Info' never changes even if you call rdd.unpersist() and then you give the rdd another different storage level.

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #968 from CrazyJvm/ui-storagelevel and squashes the following commits:

62555fa [CrazyJvm] change RDDInfo constructor param 'storageLevel' to var, so there's need to add another variable _storageLevel。
9f1571e [CrazyJvm] JIRA https://issues.apache.org/jira/browse/SPARK-1999 UI : StorageLevel in storage tab and RDD Storage Info never changes
2014-06-15 23:23:26 -07:00
Kan Zhang ca5d9d43b9 [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors
There seems to be 2 issues.

1. When job is done, driver asks executor to shutdown. However, this clean exit was assigned FAILED executor state by Worker. I introduced EXITED executor state for executors who voluntarily exit (both normal and abnormal exit depending on the exit code).

2. When Master gets notified an executor has exited, it launches another one to replace it, regardless of reason why the executor had exited. When the reason was job has finished, the unnecessary replacement got subsequently killed when App disassociates. This launching and killing of unnecessary executors shows up in the log and is confusing to users. I added check for executor exit status and avoid launching (and subsequent killing) of unnecessary replacements when executors exit cleanly.

One could ask the scheduler to tell Master job is done so that Master wouldn't launch the replacement executor. However, there is a race condition between App telling Master job is done and Worker telling Master an executor had exited. There is no guarantee the former will happen before the later. Instead, I chose to check the exit code when executor exits. If the exit code is 0, I assume executor has been asked to shutdown by driver and Master will not launch replacements.

Due to race condition, it could also happen that (although didn't happen on my local cluster), Master detects App disassociation event before the executor exits by itself. In such cases, the executor will be rightfully killed and labeled as KILLED, while the App state will show FINISHED.

Author: Kan Zhang <kzhang@apache.org>

Closes #306 from kanzhang/SPARK-1118 and squashes the following commits:

cb0cc86 [Kan Zhang] [SPARK-937] adding EXITED executor state and not relaunching cleanly exited executors
2014-06-15 14:55:34 -07:00
Kan Zhang 7dd9fc67a6 [SPARK-1837] NumericRange should be partitioned in the same way as other...
... sequences

Author: Kan Zhang <kzhang@apache.org>

Closes #776 from kanzhang/SPARK-1837 and squashes the following commits:

e48f018 [Kan Zhang] [SPARK-1837] code refactoring
67c33b5 [Kan Zhang] minor change
403f9b1 [Kan Zhang] [SPARK-1837] NumericRange should be partitioned in the same way as other sequences
2014-06-14 14:31:28 -07:00
nravi 70c8116c0a Workaround in Spark for ConcurrentModification issue (JIRA Hadoop-10456, Spark-1097)
This fix has gone into Hadoop 2.4.1. For developers using <  2.4.1, it would be good to have a workaround in Spark as well.

Fix has been tested for performance as well, no regressions found.

Author: nravi <nravi@c1704.halxg.cloudera.com>

Closes #1000 from nishkamravi2/master and squashes the following commits:

eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark
df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456)
6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed)
5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456)
681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles
2014-06-13 10:52:21 -07:00
Xiangrui Meng b3736e3d2f [HOTFIX] add math3 version to pom
Passed `mvn package`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #1075 from mengxr/takeSample-fix and squashes the following commits:

45b4590 [Xiangrui Meng] add math3 version to pom
2014-06-13 02:59:38 -07:00
Andrew Or 44daec5abd [Minor] Fix style, formatting and naming in BlockManager etc.
This is a precursor to a bigger change. I wanted to separate out the relatively insignificant changes so the ultimate PR is not inflated.

(Warning: this PR is full of unimportant nitpicks)

Author: Andrew Or <andrewor14@gmail.com>

Closes #1058 from andrewor14/bm-minor and squashes the following commits:

8e12eaf [Andrew Or] SparkException -> BlockException
c36fd53 [Andrew Or] Make parts of BlockManager more readable
0a5f378 [Andrew Or] Entry -> MemoryEntry
e9762a5 [Andrew Or] Tone down string interpolation (minor reverts)
c4de9ac [Andrew Or] Merge branch 'master' of github.com:apache/spark into bm-minor
b3470f1 [Andrew Or] More string interpolation (minor)
7f9dcab [Andrew Or] Use string interpolation (minor)
94a425b [Andrew Or] Refactor against duplicate code + minor changes
8a6a7dc [Andrew Or] Exception -> SparkException
97c410f [Andrew Or] Deal with MIMA excludes
2480f1d [Andrew Or] Fixes in StorgeLevel.scala
abb0163 [Andrew Or] Style, formatting and naming fixes
2014-06-12 20:40:58 -07:00
Doris Xin 1de1d703bf SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Modified the takeSample method in RDD to use the ScaSRS sampling technique to improve performance. Added a private method that computes sampling rate > sample_size/total to ensure sufficient sample size with success rate >= 0.9999. Added a unit test for the private method to validate choice of sampling rate.

Author: Doris Xin <doris.s.xin@gmail.com>
Author: dorx <doris.s.xin@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #916 from dorx/takeSample and squashes the following commits:

5b061ae [Doris Xin] merge master
444e750 [Doris Xin] edge cases
3de882b [dorx] Merge pull request #2 from mengxr/SPARK-1939
82dde31 [Xiangrui Meng] update pyspark's takeSample
48d954d [Doris Xin] remove unused imports from RDDSuite
fb1452f [Doris Xin] allowing num to be greater than count in all cases
1481b01 [Doris Xin] washing test tubes and making coffee
dc699f3 [Doris Xin] give back imports removed by accident in rdd.py
64e445b [Doris Xin] logwarnning as soon as it enters the while loop
55518ed [Doris Xin] added TODO for logging in rdd.py
eff89e2 [Doris Xin] addressed reviewer comments.
ecab508 [Doris Xin] "fixed checkstyle violation
0a9b3e3 [Doris Xin] "reviewer comment addressed"
f80f270 [Doris Xin] Merge branch 'master' into takeSample
ae3ad04 [Doris Xin] fixed edge cases to prevent overflow
065ebcd [Doris Xin] Merge branch 'master' into takeSample
9bdd36e [Doris Xin] Check sample size and move computeFraction
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
2014-06-12 19:44:27 -07:00
Ariel Rabkin 0154587ab7 document laziness of parallelize
Took me several hours to figure out this behavior. It would be good to highlight it in the documentation.

Author: Ariel Rabkin <asrabkin@cs.princeton.edu>

Closes #1070 from asrabkin/master and squashes the following commits:

29a076e [Ariel Rabkin] doc fix
2014-06-12 17:51:33 -07:00