Commit graph

3383 commits

Author SHA1 Message Date
uncleGen d8298c46b7 [SPARK-3170][CORE][BUG]:RDD info loss in "StorageTab" and "ExecutorTab"
compeleted stage only need to remove its own partitions that are no longer cached. However, "StorageTab" may lost some rdds which are cached actually. Not only in "StorageTab", "ExectutorTab" may also lose some rdd info which have been overwritten by last rdd in a same task.
1. "StorageTab": when multiple stages run simultaneously, completed stage will remove rdd info which belong to other stages that are still running.
2. "ExectutorTab": taskcontext may lose some "updatedBlocks" info of  rdds  in a dependency chain. Like the following example:
         val r1 = sc.paralize(..).cache()
         val r2 = r1.map(...).cache()
         val n = r2.count()

When count the r2, r1 and r2 will be cached finally. So in CacheManager.getOrCompute, the taskcontext should contain "updatedBlocks" of r1 and r2. Currently, the "updatedBlocks" only contain the info of r2.

Author: uncleGen <hustyugm@gmail.com>

Closes #2131 from uncleGen/master_ui_fix and squashes the following commits:

a6a8a0b [uncleGen] fix some coding style
3a1bc15 [uncleGen] fix some error in unit test
56ea488 [uncleGen] there's some line too long
c82ba82 [uncleGen] Bug Fix: RDD info loss in "StorageTab" and "ExecutorTab"
2014-08-27 10:33:01 -07:00
Tathagata Das 3e2864e404 [SPARK-3139] Made ContextCleaner to not block on shuffles
As a workaround for SPARK-3015, the ContextCleaner was made "blocking", that is, it cleaned items one-by-one. But shuffles can take a long time to be deleted. Given that the RC for 1.1 is imminent, this PR makes a narrow change in the context cleaner - not wait for shuffle cleanups to complete. Also it changes the error messages on failure to delete to be milder warnings, as exceptions in the delete code path for one item does not really stop the actual functioning of the system.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #2143 from tdas/cleaner-shuffle-fix and squashes the following commits:

9c84202 [Tathagata Das] Restoring default blocking behavior in ContextCleanerSuite, and added docs to identify that spark.cleaner.referenceTracking.blocking does not control shuffle.
2181329 [Tathagata Das] Mark shuffle cleanup as non-blocking.
e337cc2 [Tathagata Das] Changed semantics based on PR comments.
387b578 [Tathagata Das] Made ContextCleaner to not block on shuffles
2014-08-27 00:13:38 -07:00
Andrew Or 7557c4cfef [SPARK-3167] Handle special driver configs in Windows
This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2129 from andrewor14/windows-config and squashes the following commits:

881a8f0 [Andrew Or] Add reference to Windows taskkill
92e6047 [Andrew Or] Update a few comments (minor)
22b1acd [Andrew Or] Fix style again (minor)
afcffea [Andrew Or] Fix style (minor)
72004c2 [Andrew Or] Actually respect --driver-java-options
803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH
eeb34a0 [Andrew Or] Update outdated comment (minor)
35caecc [Andrew Or] In Windows, actually kill Java processes on exit
f97daa2 [Andrew Or] Fix Windows spark shell stdin issue
83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)
2014-08-26 22:52:16 -07:00
Reynold Xin bf719056b7 [SPARK-3224] FetchFailed reduce stages should only show up once in failed stages (in UI)
This is a HOTFIX for 1.1.

Author: Reynold Xin <rxin@apache.org>
Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #2127 from rxin/SPARK-3224 and squashes the following commits:

effb1ce [Reynold Xin] Move log message.
49282b3 [Reynold Xin] Kay's feedback.
3f01847 [Reynold Xin] Merge pull request #2 from kayousterhout/SPARK-3224
796d282 [Kay Ousterhout] Added unit test for SPARK-3224
3d3d356 [Reynold Xin] Remove map output loc even for repeated FetchFaileds.
1dd3eb5 [Reynold Xin] [SPARK-3224] FetchFailed reduce stages should only show up once in the failed stages UI.
2014-08-26 22:12:37 -07:00
Cheng Lian faeb9c0e14 [SPARK-2964] [SQL] Remove duplicated code from spark-sql and start-thriftserver.sh
Author: Cheng Lian <lian.cs.zju@gmail.com>
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1886 from sarutak/SPARK-2964 and squashes the following commits:

8ef8751 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2964
26e7c95 [Kousuke Saruta] Revert "Shorten timeout to more reasonable value"
ffb68fa [Kousuke Saruta] Modified spark-sql and start-thriftserver.sh to use bin/utils.sh
8c6f658 [Kousuke Saruta] Merge branch 'spark-3026' of https://github.com/liancheng/spark into SPARK-2964
81b43a8 [Cheng Lian] Shorten timeout to more reasonable value
a89e66d [Cheng Lian] Fixed command line options quotation in scripts
9c894d3 [Cheng Lian] Fixed bin/spark-sql -S option typo
be4736b [Cheng Lian] Report better error message when running JDBC/CLI without hive-thriftserver profile enabled
2014-08-26 17:33:40 -07:00
Andrew Or b21ae5bbb9 [SPARK-2886] Use more specific actor system name than "spark"
As of #1777 we log the name of the actor system when it binds to a port. The current name "spark" is super general and does not convey any meaning. For instance, the following line is taken from my driver log after setting `spark.driver.port` to 5001.
```
14/08/13 19:33:29 INFO Remoting: Remoting started; listening on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/13 19:33:29 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkandrews-mbp:5001]
14/08/06 13:40:05 INFO Utils: Successfully started service 'spark' on port 5001.
```
This commit renames this to "sparkDriver" and "sparkExecutor". The goal of this unambitious PR is simply to make the logged information more explicit without introducing any change in functionality.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1810 from andrewor14/service-name and squashes the following commits:

8c459ed [Andrew Or] Use a common variable for driver/executor actor system names
3a92843 [Andrew Or] Change actor name to sparkDriver and sparkExecutor
921363e [Andrew Or] Merge branch 'master' of github.com:apache/spark into service-name
c8c6a62 [Andrew Or] Do not include hyphens in actor name
1c1b42e [Andrew Or] Avoid spaces in akka system name
f644b55 [Andrew Or] Use more specific service name
2014-08-25 23:36:09 -07:00
Kousuke Saruta 62f5009f67 [SPARK-2976] Replace tabs with spaces
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1895 from sarutak/SPARK-2976 and squashes the following commits:

1cf7e69 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2976
d1e0666 [Kousuke Saruta] Modified styles
c5e80a4 [Kousuke Saruta] Remove tab from JavaPageRank.java and JavaKinesisWordCountASL.java
c003b36 [Kousuke Saruta] Removed tab from sorttable.js
2014-08-25 19:40:23 -07:00
Xiangrui Meng fd8ace2d9a [FIX] fix error message in sendMessageReliably
rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #2120 from mengxr/sendMessageReliably and squashes the following commits:

b14400c [Xiangrui Meng] fix error message in sendMessageReliably
2014-08-25 14:55:20 -07:00
Raymond Liu 8861cdf112 Clean unused code in SortShuffleWriter
Just clean unused code which have been moved into ExternalSorter.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1882 from colorant/sortShuffleWriter and squashes the following commits:

e6337be [Raymond Liu] Clean unused code in SortShuffleWriter
2014-08-23 19:47:14 -07:00
Davies Liu 8df4dad495 [SPARK-2871] [PySpark] add approx API for RDD
RDD.countApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate version of count() that returns a potentially incomplete
        result within a timeout, even if not all tasks have finished.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> rdd.countApprox(1000, 1.0)
        1000

RDD.sumApprox(self, timeout, confidence=0.95)

        Approximate operation to return the sum within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000))
        >>> (rdd.sumApprox(1000) - r) / r < 0.05

RDD.meanApprox(self, timeout, confidence=0.95)

        :: Experimental ::
        Approximate operation to return the mean within a timeout
        or meet the confidence.

        >>> rdd = sc.parallelize(range(1000), 10)
        >>> r = sum(xrange(1000)) / 1000.0
        >>> (rdd.meanApprox(1000) - r) / r < 0.05
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes #2095 from davies/approx and squashes the following commits:

e8c252b [Davies Liu] add approx API for RDD
2014-08-23 19:33:34 -07:00
Liang-Chi Hsieh 76bb044b9e [Minor] fix typo
Fix a typo in comment.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #2105 from viirya/fix_typo and squashes the following commits:

6596a80 [Liang-Chi Hsieh] fix typo.
2014-08-23 10:08:25 -07:00
Daoyuan Wang f3d65cd0bf [SPARK-3068]remove MaxPermSize option for jvm 1.8
In JVM 1.8.0, MaxPermSize is no longer supported.
In spark `stderr` output, there would be a line of

    Java HotSpot(TM) 64-Bit Server VM warning: ignoring option MaxPermSize=128m; support was removed in 8.0

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2011 from adrian-wang/maxpermsize and squashes the following commits:

ef1d660 [Daoyuan Wang] direct get java version in runtime
37db9c1 [Daoyuan Wang] code refine
3c1d554 [Daoyuan Wang] remove MaxPermSize option for jvm 1.8
2014-08-23 08:09:30 -07:00
Reynold Xin fb60bec34e [SPARK-2298] Encode stage attempt in SparkListener & UI.
Simple way to reproduce this in the UI:

```scala
val f = new java.io.File("/tmp/test")
f.delete()
sc.parallelize(1 to 2, 2).map(x => (x,x )).repartition(3).mapPartitionsWithContext { case (context, iter) =>
  if (context.partitionId == 0) {
    val f = new java.io.File("/tmp/test")
    if (!f.exists) {
      f.mkdir()
      System.exit(0);
    }
  }
  iter
}.count()
```

Author: Reynold Xin <rxin@apache.org>

Closes #1545 from rxin/stage-attempt and squashes the following commits:

3ee1d2a [Reynold Xin] - Rename attempt to retry in UI. - Properly report stage failure in FetchFailed.
40a6bd5 [Reynold Xin] Updated test suites.
c414c36 [Reynold Xin] Fixed the hanging in JobCancellationSuite.
b3e2eed [Reynold Xin] Oops previous code didn't compile.
0f36075 [Reynold Xin] Mark unknown stage attempt with id -1 and drop that in JobProgressListener.
6c08b07 [Reynold Xin] Addressed code review feedback.
4e5faa2 [Reynold Xin] [SPARK-2298] Encode stage attempt in SparkListener & UI.
2014-08-20 15:37:27 -07:00
Andrew Or b3ec51bfd7 [SPARK-2849] Handle driver configs separately in client mode
In client deploy mode, the driver is launched from within `SparkSubmit`'s JVM. This means by the time we parse Spark configs from `spark-defaults.conf`, it is already too late to control certain properties of the driver's JVM. We currently ignore these configs in client mode altogether.
```
spark.driver.memory
spark.driver.extraJavaOptions
spark.driver.extraClassPath
spark.driver.extraLibraryPath
```
This PR handles these properties before launching the driver JVM. It achieves this by spawning a separate JVM that runs a new class called `SparkSubmitDriverBootstrapper`, which spawns `SparkSubmit` as a sub-process with the appropriate classpath, library paths, java opts and memory.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1845 from andrewor14/handle-configs-bash and squashes the following commits:

bed4bdf [Andrew Or] Change a few comments / messages (minor)
24dba60 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
08fd788 [Andrew Or] Warn against external usages of SparkSubmitDriverBootstrapper
ff34728 [Andrew Or] Minor comments
51aeb01 [Andrew Or] Filter out JVM memory in Scala rather than Bash (minor)
9a778f6 [Andrew Or] Fix PySpark: actually kill driver on termination
d0f20db [Andrew Or] Don't pass empty library paths, classpath, java opts etc.
a78cb26 [Andrew Or] Revert a few changes in utils.sh (minor)
9ba37e2 [Andrew Or] Don't barf when the properties file does not exist
8867a09 [Andrew Or] A few more naming things (minor)
19464ad [Andrew Or] SPARK_SUBMIT_JAVA_OPTS -> SPARK_SUBMIT_OPTS
d6488f9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
1ea6bbe [Andrew Or] SparkClassLauncher -> SparkSubmitDriverBootstrapper
a91ea19 [Andrew Or] Fix precedence of library paths, classpath, java opts and memory
158f813 [Andrew Or] Remove "client mode" boolean argument
c84f5c8 [Andrew Or] Remove debug print statement (minor)
b71f52b [Andrew Or] Revert a few more changes (minor)
7d94a8d [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
3a8235d [Andrew Or] Only parse the properties file if special configs exist
c37e08d [Andrew Or] Revert a few more changes
a396eda [Andrew Or] Nullify my own hard work to simplify bash
0effa1e [Andrew Or] Add code in Scala that handles special configs
c886568 [Andrew Or] Fix lines too long + a few comments / style (minor)
7a4190a [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
7396be2 [Andrew Or] Explicitly comment that multi-line properties are not supported
fa11ef8 [Andrew Or] Parse the properties file only if the special configs exist
371cac4 [Andrew Or] Add function prefix (minor)
be99eb3 [Andrew Or] Fix tests to not include multi-line configs
bd0d468 [Andrew Or] Simplify parsing config file by ignoring multi-line arguments
56ac247 [Andrew Or] Use eval and set to simplify splitting
8d4614c [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
aeb79c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into handle-configs-bash
2732ac0 [Andrew Or] Integrate BASH tests into dev/run-tests + log error properly
8d26a5c [Andrew Or] Add tests for bash/utils.sh
4ae24c3 [Andrew Or] Fix bug: escape properly in quote_java_property
b3c4cd5 [Andrew Or] Fix bug: count the number of quotes instead of detecting presence
c2273fc [Andrew Or] Fix typo (minor)
e793e5f [Andrew Or] Handle multi-line arguments
5d8f8c4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
c7b9926 [Andrew Or] Minor changes to spark-defaults.conf.template
a992ae2 [Andrew Or] Escape spark.*.extraJavaOptions correctly
aabfc7e [Andrew Or] escape -> split (minor)
45a1eb9 [Andrew Or] Fix bug: escape escaped backslashes and quotes properly...
1cdc6b1 [Andrew Or] Fix bug: escape escaped double quotes properly
c854859 [Andrew Or] Add small comment
c13a2cb [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
8e552b7 [Andrew Or] Include an example of spark.*.extraJavaOptions
de765c9 [Andrew Or] Print spark-class command properly
a4df3c4 [Andrew Or] Move parsing and escaping logic to utils.sh
dec2343 [Andrew Or] Only export variables if they exist
fa2136e [Andrew Or] Escape Java options + parse java properties files properly
ef12f74 [Andrew Or] Minor formatting
4ec22a1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
e5cfb46 [Andrew Or] Collapse duplicate code + fix potential whitespace issues
4edcaa8 [Andrew Or] Redirect stdout to stderr for python
130f295 [Andrew Or] Handle spark.driver.memory too
98dd8e3 [Andrew Or] Add warning if properties file does not exist
8843562 [Andrew Or] Fix compilation issues...
75ee6b4 [Andrew Or] Remove accidentally added file
63ed2e9 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-driver-extra
0025474 [Andrew Or] Revert SparkSubmit handling of --driver-* options for only cluster mode
a2ab1b0 [Andrew Or] Parse spark.driver.extra* in bash
250cb95 [Andrew Or] Do not ignore spark.driver.extra* for client mode
2014-08-20 15:01:47 -07:00
Kousuke Saruta c1ba4cd6b4 [SPARK-3149] Connection establishment information is not enough.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2060 from sarutak/SPARK-3149 and squashes the following commits:

1cc89af [Kousuke Saruta] Modified log message of accepting connection
2014-08-20 14:04:39 -07:00
Kousuke Saruta 0ea46ac800 [SPARK-3062] [SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled
#1891 was to avoid IOException when EventLogging is enabled.
The solution used ShutdownHookManager but it was defined only Hadoop 2.x. Hadoop 1.x don't have ShutdownHookManager so #1891 doesn't compile on Hadoop 1.x

Now, I had a compromised solution for both Hadoop 1.x and 2.x.
Only for FileLogger, an unique FileSystem object is created.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1970 from sarutak/SPARK-2970 and squashes the following commits:

240c91e [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2970
0e7b45d [Kousuke Saruta] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
e1262ec [Kousuke Saruta] Modified Filelogger to use unique FileSystem instance
2014-08-20 13:26:11 -07:00
Josh Rosen ebcb94f701 [SPARK-2974] [SPARK-2975] Fix two bugs related to spark.local.dirs
This PR fixes two bugs related to `spark.local.dirs` and `SPARK_LOCAL_DIRS`, one where `Utils.getLocalDir()` might return an invalid directory (SPARK-2974) and another where the `SPARK_LOCAL_DIRS` override didn't affect the driver, which could cause problems when running tasks in local mode (SPARK-2975).

This patch fixes both issues: the new `Utils.getOrCreateLocalRootDirs(conf: SparkConf)` utility method manages the creation of local directories and handles the precedence among the different configuration options, so we should see the same behavior whether we're running in local mode or on a worker.

It's kind of a pain to mock out environment variables in tests (no easy way to mock System.getenv), so I added a `private[spark]` method to SparkConf for accessing environment variables (by default, it just delegates to System.getenv).  By subclassing SparkConf and overriding this method, we can mock out SPARK_LOCAL_DIRS in tests.

I also fixed a typo in PySpark where we used `SPARK_LOCAL_DIR` instead of `SPARK_LOCAL_DIRS` (I think this was technically innocuous, but it seemed worth fixing).

Author: Josh Rosen <joshrosen@apache.org>

Closes #2002 from JoshRosen/local-dirs and squashes the following commits:

efad8c6 [Josh Rosen] Address review comments:
1dec709 [Josh Rosen] Minor updates to Javadocs.
7f36999 [Josh Rosen] Use env vars to detect if running in YARN container.
399ac25 [Josh Rosen] Update getLocalDir() documentation.
bb3ad89 [Josh Rosen] Remove duplicated YARN getLocalDirs() code.
3e92d44 [Josh Rosen] Move local dirs override logic into Utils; fix bugs:
b2c4736 [Josh Rosen] Add failing tests for SPARK-2974 and SPARK-2975.
007298b [Josh Rosen] Allow environment variables to be mocked in tests.
6d9259b [Josh Rosen] Fix typo in PySpark: SPARK_LOCAL_DIR should be SPARK_LOCAL_DIRS
2014-08-19 22:42:50 -07:00
Reynold Xin 8adfbc2b6b [SPARK-3119] Re-implementation of TorrentBroadcast.
This is a re-implementation of TorrentBroadcast, with the following changes:

1. Removes most of the mutable, transient state from TorrentBroadcast (e.g. totalBytes, num of blocks fetched).
2. Removes TorrentInfo and TorrentBlock
3. Replaces the BlockManager.getSingle call in readObject with a getLocal, resuling in one less RPC call to the BlockManagerMasterActor to find the location of the block.
4. Removes the metadata block, resulting in one less block to fetch.
5. Removes an extra memory copy for deserialization (by using Java's SequenceInputStream).

Basically for a regular broadcasted object with only one block, the number of RPC calls goes from 5+1 to 2+1).

Old TorrentBroadcast for object of a single block:
1 RPC to ask for location of the broadcast variable
1 RPC to ask for location of the metadata block
1 RPC to fetch the metadata block
1 RPC to ask for location of the first data block
1 RPC to fetch the first data block
1 RPC to tell the driver we put the first data block in
i.e. 5 + 1

New TorrentBroadcast for object of a single block:
1 RPC to ask for location of the first data block
1 RPC to get the first data block
1 RPC to tell the driver we put the first data block in
i.e. 2 + 1

Author: Reynold Xin <rxin@apache.org>

Closes #2030 from rxin/torrentBroadcast and squashes the following commits:

5bacb9d [Reynold Xin] Always add the object to driver's block manager.
0d8ed5b [Reynold Xin] Added getBytes to BlockManager and uses that in TorrentBroadcast.
2d6a5fb [Reynold Xin] Use putBytes/getRemoteBytes throughout.
3670f00 [Reynold Xin] Code review feedback.
c1185cd [Reynold Xin] [SPARK-3119] Re-implementation of TorrentBroadcast.
2014-08-19 22:11:13 -07:00
Reynold Xin 8b9dc99101 [SPARK-2468] Netty based block server / client module
Previous pull request (#1907) was reverted. This brings it back. Still looking into the hang.

Author: Reynold Xin <rxin@apache.org>

Closes #1971 from rxin/netty1 and squashes the following commits:

b0be96f [Reynold Xin] Added test to make sure outstandingRequests are cleaned after firing the events.
4c6d0ee [Reynold Xin] Pass callbacks cleanly.
603dce7 [Reynold Xin] Upgrade Netty to 4.0.23 to fix the DefaultFileRegion bug.
88be1d4 [Reynold Xin] Downgrade to 4.0.21 to work around a bug in writing DefaultFileRegion.
002626a [Reynold Xin] Remove netty-test-file.txt.
db6e6e0 [Reynold Xin] Revert "Revert "[SPARK-2468] Netty based block server / client module""
2014-08-19 17:40:35 -07:00
hzw19900416 76eaeb4523 Move a bracket in validateSettings of SparkConf
Move a bracket in validateSettings of SparkConf

Author: hzw19900416 <carlmartinmax@gmail.com>

Closes #2012 from hzw19900416/codereading and squashes the following commits:

e717fb6 [hzw19900416] Move a bracket in validateSettings of SparkConf
2014-08-19 14:04:49 -07:00
Kousuke Saruta cbfc26ba45 [SPARK-3089] Fix meaningless error message in ConnectionManager
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2000 from sarutak/SPARK-3089 and squashes the following commits:

02dfdea [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3089
e759ce7 [Kousuke Saruta] Improved error message when closing SendingConnection
2014-08-19 10:15:11 -07:00
Reynold Xin 82577339dd [SPARK-3116] Remove the excessive lockings in TorrentBroadcast
Author: Reynold Xin <rxin@apache.org>

Closes #2028 from rxin/torrentBroadcast and squashes the following commits:

92c62a5 [Reynold Xin] Revert the MEMORY_AND_DISK_SER changes.
03a5221 [Reynold Xin] [SPARK-3116] Remove the excessive lockings in TorrentBroadcast
2014-08-18 20:51:41 -07:00
Marcelo Vanzin 6201b27643 [SPARK-2718] [yarn] Handle quotes and other characters in user args.
Due to the way Yarn runs things through bash, normal quoting doesn't
work as expected. This change applies the necessary voodoo to the user
args to avoid issues with bash and special characters.

The change also uncovered an issue with the event logger app name
sanitizing code; it wasn't cleaning up all "bad" characters, so
sometimes it would fail to create the log dirs. I just added some
more bad character replacements.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1724 from vanzin/SPARK-2718 and squashes the following commits:

cc84b89 [Marcelo Vanzin] Review feedback.
c1a257a [Marcelo Vanzin] Add test for backslashes.
55571d4 [Marcelo Vanzin] Unbreak yarn-client.
515613d [Marcelo Vanzin] [SPARK-2718] [yarn] Handle quotes and other characters in user args.
2014-08-18 14:10:10 -07:00
Marcelo Vanzin 66ade00f91 [SPARK-2169] Don't copy appName / basePath everywhere.
Instead of keeping copies in all pages, just reference the values
kept in the base SparkUI instance (by making them available via
getters).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #1252 from vanzin/SPARK-2169 and squashes the following commits:

4412fc6 [Marcelo Vanzin] Simplify UIUtils.headerSparkPage signature.
4e5d35a [Marcelo Vanzin] [SPARK-2169] Don't copy appName / basePath everywhere.
2014-08-18 13:25:30 -07:00
Chandan Kumar f45efbb8aa [SPARK-2862] histogram method fails on some choices of bucketCount
Author: Chandan Kumar <chandan.kumar@imaginea.com>

Closes #1787 from nrchandan/spark-2862 and squashes the following commits:

a76bbf6 [Chandan Kumar] [SPARK-2862] Fix for a broken test case and add new test cases
4211eea [Chandan Kumar] [SPARK-2862] Add Scala bug id
13854f1 [Chandan Kumar] [SPARK-2862] Use shorthand range notation to avoid Scala bug
2014-08-18 09:52:25 -07:00
CrazyJvm c0cbbdeaf4 SPARK-3093 : masterLock in Worker is no longer need
there's no need to use masterLock in Worker now since all communications are within Akka actor

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #2008 from CrazyJvm/no-need-master-lock and squashes the following commits:

dd39e20 [CrazyJvm] fix format
58e7fa5 [CrazyJvm] there's no need to use masterLock now since all communications are within Akka actor
2014-08-18 09:34:36 -07:00
Sandy Ryza df652ea02a SPARK-2900. aggregate inputBytes per stage
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1826 from sryza/sandy-spark-2900 and squashes the following commits:

43f9091 [Sandy Ryza] SPARK-2900
2014-08-17 22:39:06 -07:00
GuoQiang Li bc95fe08df In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
cc JoshRosen sarutak

Author: GuoQiang Li <witgo@qq.com>

Closes #1989 from witgo/cancel_ackTimeoutMonitor and squashes the following commits:

4a700fa [GuoQiang Li] In the stop method of ConnectionManager to cancel the ackTimeoutMonitor
2014-08-16 20:05:55 -07:00
Davies Liu 2fc8aca086 [SPARK-1065] [PySpark] improve supporting for large broadcast
Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).

Add an option to keep object in driver (it's False by default) to save memory in driver.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1912 from davies/broadcast and squashes the following commits:

e06df4a [Davies Liu] load broadcast from disk in driver automatically
db3f232 [Davies Liu] fix serialization of accumulator
631a827 [Davies Liu] Merge branch 'master' into broadcast
c7baa8c [Davies Liu] compress serrialized broadcast and command
9a7161f [Davies Liu] fix doc tests
e93cf4b [Davies Liu] address comments: add test
6226189 [Davies Liu] improve large broadcast
2014-08-16 16:59:34 -07:00
Kousuke Saruta 76fa0eaf51 [SPARK-2677] BasicBlockFetchIterator#next can wait forever
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1632 from sarutak/SPARK-2677 and squashes the following commits:

cddbc7b [Kousuke Saruta] Removed Exception throwing when ConnectionManager#handleMessage receives ack for non-referenced message
d3bd2a8 [Kousuke Saruta] Modified configuration.md for spark.core.connection.ack.timeout
e85f88b [Kousuke Saruta] Removed useless synchronized blocks
7ed48be [Kousuke Saruta] Modified ConnectionManager to use ackTimeoutMonitor ConnectionManager-wide
9b620a6 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
0dd9ad3 [Kousuke Saruta] Modified typo in ConnectionManagerSuite.scala
7cbb8ca [Kousuke Saruta] Modified to match with scalastyle
8a73974 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
ade279a [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
0174d6a [Kousuke Saruta] Modified ConnectionManager.scala to handle the case remote Executor cannot ack
a454239 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2677
9b7b7c1 [Kousuke Saruta] (WIP) Modifying ConnectionManager.scala
2014-08-16 14:15:58 -07:00
Josh Rosen 20fcf3d0b7 [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager
This is intended to fix SPARK-2977.  Before, there was an implicit ordering dependency where we needed to know the ShuffleManager implementation before creating the ShuffleBlockManager.  This patch makes that dependency explicit by adding ShuffleManager to a bunch of constructors.

I think it's a little odd for BlockManager to take a ShuffleManager only to pass it to ShuffleBlockManager without using it itself; there's an opportunity to clean this up later if we sever the circular dependencies between BlockManager and other components and pass those components to BlockManager's constructor.

Author: Josh Rosen <joshrosen@apache.org>

Closes #1976 from JoshRosen/SPARK-2977 and squashes the following commits:

a9cd1e1 [Josh Rosen] [SPARK-2977] Ensure ShuffleManager is created before ShuffleBlockManager.
2014-08-16 00:04:55 -07:00
Reynold Xin a83c7723bf [SPARK-3045] Make Serializer interface Java friendly
Author: Reynold Xin <rxin@apache.org>

Closes #1948 from rxin/kryo and squashes the following commits:

a3a80d8 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer classloader
3d13277 [Reynold Xin] Reverted that in TestJavaSerializerImpl too.
196f3dc [Reynold Xin] Ok one more commit to revert the classloader change.
c49b50c [Reynold Xin] Removed JavaSerializer change.
afbf37d [Reynold Xin] Moved the test case also.
a2e693e [Reynold Xin] Removed the Kryo bug fix from this pull request.
c81bd6c [Reynold Xin] Use defaultClassLoader when executing user specified custom registrator.
68f261e [Reynold Xin] Added license check excludes.
0c28179 [Reynold Xin] [SPARK-3045] Make Serializer interface Java friendly [SPARK-3046] Set executor's class loader as the default serializer class loader
2014-08-15 23:12:34 -07:00
Andrew Or c9da466edb [SPARK-3015] Block on cleaning tasks to prevent Akka timeouts
More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies.

We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.

In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

tdas pwendell mengxr

Author: Andrew Or <andrewor14@gmail.com>

Closes #1931 from andrewor14/reference-blocking and squashes the following commits:

d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking
ce9daf5 [Andrew Or] Remove logic for logging queue length
111192a [Andrew Or] Add missing space in log message (minor)
a183b83 [Andrew Or] Switch order of code blocks (minor)
9fd1fe6 [Andrew Or] Remove outdated log
104b366 [Andrew Or] Use the actual reference queue length
0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full
2014-08-15 22:55:32 -07:00
Reynold Xin cc3648774e [SPARK-3046] use executor's class loader as the default serializer classloader
The serializer is not always used in an executor thread (e.g. connection manager, broadcast), in which case the classloader might not have the user jar set, leading to corruption in deserialization.

https://issues.apache.org/jira/browse/SPARK-3046

https://issues.apache.org/jira/browse/SPARK-2878

Author: Reynold Xin <rxin@apache.org>

Closes #1972 from rxin/kryoBug and squashes the following commits:

c1c7bf0 [Reynold Xin] Made change to JavaSerializer.
7204c33 [Reynold Xin] Added imports back.
d879e67 [Reynold Xin] [SPARK-3046] use executor's class loader as the default serializer class loader.
2014-08-15 17:04:15 -07:00
Sandy Ryza 0afe5cb65a SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetrics...
...Update

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1961 from sryza/sandy-spark-3028 and squashes the following commits:

dccdff5 [Sandy Ryza] Fix compile error
f883ded [Sandy Ryza] SPARK-3028. sparkEventToJson should support SparkListenerExecutorMetricsUpdate
2014-08-15 11:35:08 -07:00
Patrick Wendell fd9fcd25e9 Revert "[SPARK-2468] Netty based block server / client module"
This reverts commit 3a8b68b735.
2014-08-15 09:01:04 -07:00
Anand Avati 7589c39d39 [SPARK-2924] remove default args to overloaded methods
Not supported in Scala 2.11. Split them into separate methods instead.

Author: Anand Avati <avati@redhat.com>

Closes #1704 from avati/SPARK-1812-default-args and squashes the following commits:

3e3924a [Anand Avati] SPARK-1812: Add Mima excludes for the broken ABI
901dfc7 [Anand Avati] SPARK-1812: core - Fix overloaded methods with default arguments
07f00af [Anand Avati] SPARK-1812: streaming - Fix overloaded methods with default arguments
2014-08-15 08:53:52 -07:00
Nathan Kronenfeld fba8ec39cc Add caching information to rdd.toDebugString
I find it useful to see where in an RDD's DAG data is cached, so I figured others might too.

I've added both the caching level, and the actual memory state of the RDD.

Some of this is redundant with the web UI (notably the actual memory state), but (a) that is temporary, and (b) putting it in the DAG tree shows some context that can help a lot.

For example:
```
(4) ShuffledRDD[3] at reduceByKey at <console>:14
 +-(4) MappedRDD[2] at map at <console>:14
    |  MapPartitionsRDD[1] at mapPartitions at <console>:12
    |  ParallelCollectionRDD[0] at parallelize at <console>:12
```
should change to
```
(4) ShuffledRDD[3] at reduceByKey at <console>:14 [Memory Deserialized 1x Replicated]
 |       CachedPartitions: 4; MemorySize: 50.8 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
 +-(4) MappedRDD[2] at map at <console>:14 [Memory Deserialized 1x Replicated]
    |  MapPartitionsRDD[1] at mapPartitions at <console>:12 [Memory Deserialized 1x Replicated]
    |      CachedPartitions: 4; MemorySize: 109.1 MB; TachyonSize: 0.0 B; DiskSize: 0.0 B
    |  ParallelCollectionRDD[0] at parallelize at <console>:12 [Memory Deserialized 1x Replicated]
```

Author: Nathan Kronenfeld <nkronenfeld@oculusinfo.com>

Closes #1535 from nkronenfeld/feature/debug-caching2 and squashes the following commits:

40490bc [Nathan Kronenfeld] Back out DeveloperAPI and arguments to RDD.toDebugString, reinstate memory output
794e6a3 [Nathan Kronenfeld] Attempt to merge mima changes from master
6fe9e80 [Nathan Kronenfeld] Add exclusions to allow for signature change in toDebugString (will back out if necessary)
31d6769 [Nathan Kronenfeld] Attempt to get rid of style errors.  Add comments for the new memory usage parameter.
a0f6f76 [Nathan Kronenfeld] Add parameter to RDD.toDebugString to allow detailed memory info to be shown or not.  Default is for it not to be shown.
f8f565a [Nathan Kronenfeld] Fix code style error
8f54287 [Nathan Kronenfeld] Changed string addition to string interpolation as per PR comments
2a0cd4d [Nathan Kronenfeld] Fixed a small formatting issue I forgot to copy over from the old branch
8fbecb6 [Nathan Kronenfeld] Add caching information to rdd.toDebugString
2014-08-14 22:15:33 -07:00
Kan Zhang 9422a9b084 [SPARK-2736] PySpark converter and example script for reading Avro files
JIRA: https://issues.apache.org/jira/browse/SPARK-2736

This patch includes:
1. An Avro converter that converts Avro data types to Python. It handles all 3 Avro data mappings (Generic, Specific and Reflect).
2. An example Python script for reading Avro files using AvroKeyInputFormat and the converter.
3. Fixing a classloading issue.

cc @MLnick @JoshRosen @mateiz

Author: Kan Zhang <kzhang@apache.org>

Closes #1916 from kanzhang/SPARK-2736 and squashes the following commits:

02443f8 [Kan Zhang] [SPARK-2736] Adding .avsc files to .rat-excludes
f74e9a9 [Kan Zhang] [SPARK-2736] nit: clazz -> className
82cc505 [Kan Zhang] [SPARK-2736] Update data sample
0be7761 [Kan Zhang] [SPARK-2736] Example pyspark script and data files
c8e5881 [Kan Zhang] [SPARK-2736] Trying to work with all 3 Avro data models
2271a5b [Kan Zhang] [SPARK-2736] Using the right class loader to find Avro classes
536876b [Kan Zhang] [SPARK-2736] Adding Avro to Java converter
2014-08-14 19:03:51 -07:00
Reynold Xin 3a8b68b735 [SPARK-2468] Netty based block server / client module
This is a rewrite of the original Netty module that was added about 1.5 years ago. The old code was turned off by default and didn't really work because it lacked a frame decoder (only worked with very very small blocks).

For this pull request, I tried to make the changes non-instrusive to the rest of Spark. I only added an init and shutdown to BlockManager/DiskBlockManager, and a bunch of comments to help me understand the existing code base.

Compared with the old Netty module, this one features:
- It appears to work :)
- SPARK-2941: option to specicy nio vs oio vs epoll for channel/transport. By default nio is used. (Not using Epoll yet because I have found some bugs with its implementation)
- SPARK-2943: options to specify send buf and receive buf for users who want to do hyper tuning
- SPARK-2942: io errors are reported from server to client (the protocol uses negative length to indicate error)
- SPARK-2940: fetching multiple blocks in a single request to reduce syscalls
- SPARK-2959: clients share a single thread pool
- SPARK-2990: use PooledByteBufAllocator to reduce GC (basically a Netty managed pool of buffers with jmalloc)
- SPARK-2625: added fetchWaitTime metric and fixed thread-safety issue in metrics update.
- SPARK-2367: bump Netty version to 4.0.21.Final to address an Epoll bug (https://groups.google.com/forum/#!topic/netty/O7m-HxCJpCA)

Compared with the existing communication manager, this one features:
- IMO it is substantially easier to understand
- zero-copy send for the server for on-disk blocks
- one-copy receive (due to a frame decoder)
- don't quote me on this, but I think a lot less sys calls
- SPARK-2990: use PooledByteBufAllocator to reduce GC (basically a Netty managed pool of buffers with jmalloc)
- SPARK-2941: option to specicy nio vs oio vs epoll for channel/transport. By default nio is used. (Not using Epoll yet because I have found some bugs with its implementation)
- SPARK-2943: options to specify send buf and receive buf for users who want to do hyper tuning

TODOs before it can fully replace the existing ConnectionManager, if that ever happens (most of them should probably be done in separate PRs since this needs to be turned on explicitly)
- [x] Basic test cases
- [ ] More unit/integration tests for failures
- [ ] Performance analysis
- [ ] Support client connection reuse so we don't need to keep opening new connections (not sure how useful this would be)
- [ ] Support putting blocks in addition to fetching blocks (i.e. two way transfer)
- [x] Support serving non-disk blocks
- [ ] Support SASL authentication

For a more comprehensive list, see https://issues.apache.org/jira/browse/SPARK-2468

Thanks to @coderplay for peer coding with me on a Sunday.

Author: Reynold Xin <rxin@apache.org>

Closes #1907 from rxin/netty and squashes the following commits:

f921421 [Reynold Xin] Upgrade Netty to 4.0.22.Final to fix another Epoll bug.
4b174ca [Reynold Xin] Shivaram's code review comment.
4a3dfe7 [Reynold Xin] Switched to nio for default (instead of epoll on Linux).
56bfb9d [Reynold Xin] Bump Netty version to 4.0.21.Final for some bug fixes.
b443a4b [Reynold Xin] Added debug message to help debug Jenkins failures.
57fc4d7 [Reynold Xin] Added test cases for BlockHeaderEncoder and BlockFetchingClientHandlerSuite.
22623e9 [Reynold Xin] Added exception handling and test case for BlockServerHandler and BlockFetchingClientHandler.
6550dd7 [Reynold Xin] Fixed block mgr init bug.
60c2edf [Reynold Xin] Beefed up server/client integration tests.
38d88d5 [Reynold Xin] Added missing test files.
6ce3f3c [Reynold Xin] Added some basic test cases.
47f7ce0 [Reynold Xin] Created server and client packages and moved files there.
b16f412 [Reynold Xin] Added commit count.
f13022d [Reynold Xin] Remove unused clone() in BlockFetcherIterator.
c57d68c [Reynold Xin] Added back missing files.
842dfa7 [Reynold Xin] Made everything work with proper reference counting.
3fae001 [Reynold Xin] Connected the new netty network module with rest of Spark.
1a8f6d4 [Reynold Xin] Completed protocol documentation.
2951478 [Reynold Xin] New Netty implementation.
cc7843d [Reynold Xin] Basic skeleton.
2014-08-14 19:01:33 -07:00
Reynold Xin 655699f8b7 [SPARK-3027] TaskContext: tighten visibility and provide Java friendly callback API
Note this also passes the TaskContext itself to the TaskCompletionListener. In the future we can mark TaskContext with the exception object if exception occurs during task execution.

Author: Reynold Xin <rxin@apache.org>

Closes #1938 from rxin/TaskContext and squashes the following commits:

145de43 [Reynold Xin] Added JavaTaskCompletionListenerImpl for Java API friendly guarantee.
f435ea5 [Reynold Xin] Added license header for TaskCompletionListener.
dc4ed27 [Reynold Xin] [SPARK-3027] TaskContext: tighten the visibility and provide Java friendly callback API
2014-08-14 18:37:02 -07:00
Jacek Lewandowski a75bc7a21d SPARK-3009: Reverted readObject method in ApplicationInfo so that Applic...
...ationInfo is initialized properly after deserialization

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #1947 from jacek-lewandowski/master and squashes the following commits:

713b2f1 [Jacek Lewandowski] SPARK-3009: Reverted readObject method in ApplicationInfo so that ApplicationInfo is initialized properly after deserialization
2014-08-14 15:02:26 -07:00
Reynold Xin eaeb0f76fa Minor cleanup of metrics.Source
- Added override.
- Marked some variables as private.

Author: Reynold Xin <rxin@apache.org>

Closes #1943 from rxin/metricsSource and squashes the following commits:

fbfa943 [Reynold Xin] Minor cleanup of metrics.Source. - Added override. - Marked some variables as private.
2014-08-14 11:22:41 -07:00
Graham Dennis 6b8de0e36c SPARK-2893: Do not swallow Exceptions when running a custom kryo registrator
The previous behaviour of swallowing ClassNotFound exceptions when running a custom Kryo registrator could lead to difficult to debug problems later on at serialisation / deserialisation time, see SPARK-2878.  Instead it is better to fail fast.

Added test case.

Author: Graham Dennis <graham.dennis@gmail.com>

Closes #1827 from GrahamDennis/feature/spark-2893 and squashes the following commits:

fbe4cb6 [Graham Dennis] [SPARK-2878]: Update the test case to match the updated exception message
65e53c5 [Graham Dennis] [SPARK-2893]: Improve message when a spark.kryo.registrator fails.
f480d85 [Graham Dennis] [SPARK-2893] Fix typo.
b59d2c2 [Graham Dennis] SPARK-2893: Do not swallow Exceptions when running a custom spark.kryo.registrator
2014-08-14 02:24:18 -07:00
Aaron Davidson d069c5d9d2 [SPARK-3029] Disable local execution of Spark jobs by default
Currently, local execution of Spark jobs is only used by take(), and it can be problematic as it can load a significant amount of data onto the driver. The worst case scenarios occur if the RDD is cached (guaranteed to load whole partition), has very large elements, or the partition is just large and we apply a filter with high selectivity or computational overhead.

Additionally, jobs that run locally in this manner do not show up in the web UI, and are thus harder to track or understand what is occurring.

This PR adds a flag to disable local execution, which is turned OFF by default, with the intention of perhaps eventually removing this functionality altogether. Removing it now is a tougher proposition since it is part of the public runJob API. An alternative solution would be to limit the flag to take()/first() to avoid impacting any external users of this API, but such usage (or, at least, reliance upon the feature) is hopefully minimal.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1321 from aarondav/allowlocal and squashes the following commits:

136b253 [Aaron Davidson] Fix DAGSchedulerSuite
5599d55 [Aaron Davidson] [RFC] Disable local execution of Spark jobs by default
2014-08-14 01:37:38 -07:00
Patrick Wendell 0c7b452904 SPARK-3020: Print completed indices rather than tasks in web UI
Author: Patrick Wendell <pwendell@gmail.com>

Closes #1933 from pwendell/speculation and squashes the following commits:

33a3473 [Patrick Wendell] Use OpenHashSet
8ce2ff0 [Patrick Wendell] SPARK-3020: Print completed indices rather than tasks in web UI
2014-08-13 18:08:38 -07:00
Zhang, Liye 2bd812639c [SPARK-1777 (partial)] bugfix: make size of requested memory correctly
Author: Zhang, Liye <liye.zhang@intel.com>

Closes #1892 from liyezhang556520/lazy_memory_request and squashes the following commits:

335ab61 [Zhang, Liye] [SPARK-1777 (partial)] bugfix: make size of requested memory correctly
2014-08-12 23:43:36 -07:00
Raymond Liu 246cb3f158 Use transferTo when copy merge files in ExternalSorter
Since this is a file to file copy, using transferTo should be faster.

Author: Raymond Liu <raymond.liu@intel.com>

Closes #1884 from colorant/externalSorter and squashes the following commits:

6e42f3c [Raymond Liu] More code into copyStream
bfb496b [Raymond Liu] Use transferTo when copy merge files in ExternalSorter
2014-08-12 23:19:35 -07:00
Reynold Xin 676f98289d [SPARK-2953] Allow using short names for io compression codecs
Instead of requiring "org.apache.spark.io.LZ4CompressionCodec", it is easier for users if Spark just accepts "lz4", "lzf", "snappy".

Author: Reynold Xin <rxin@apache.org>

Closes #1873 from rxin/compressionCodecShortForm and squashes the following commits:

9f50962 [Reynold Xin] Specify short-form compression codec names first.
63f78ee [Reynold Xin] Updated configuration documentation.
47b3848 [Reynold Xin] [SPARK-2953] Allow using short names for io compression codecs
2014-08-12 22:50:29 -07:00
Josh Rosen 7712e724ad [SPARK-2931] In TaskSetManager, reset currentLocalityIndex after recomputing locality levels
This addresses SPARK-2931, a bug where getAllowedLocalityLevel() could throw ArrayIndexOutOfBoundsException.  The fix here is to reset currentLocalityIndex after recomputing the locality levels.

Thanks to kayousterhout, mridulm, and lirui-intel for helping me to debug this.

Author: Josh Rosen <joshrosen@apache.org>

Closes #1896 from JoshRosen/SPARK-2931 and squashes the following commits:

48b60b5 [Josh Rosen] Move FakeRackUtil.cleanUp() info beforeEach().
6fec474 [Josh Rosen] Set currentLocalityIndex after recomputing locality levels.
9384897 [Josh Rosen] Update SPARK-2931 test to reflect changes in 63bdb1f41b.
9ecd455 [Josh Rosen] Apply @mridulm's patch for reproducing SPARK-2931.
2014-08-11 19:15:01 -07:00
Reynold Xin 3733866665 [SPARK-2952] Enable logging actor messages at DEBUG level
Example messages:
```
14/08/09 21:37:01 DEBUG BlockManagerMasterActor: [actor] received message RegisterBlockManager(BlockManagerId(0, rxin-mbp, 58092, 0),278302556,Actor[akka.tcp://spark@rxin-mbp:58088/user/BlockManagerActor1#-63596539]) from Actor[akka.tcp://spark@rxin-mbp:58088/temp/$c]

14/08/09 21:37:01 DEBUG BlockManagerMasterActor: [actor] handled message (0.279 ms) RegisterBlockManager(BlockManagerId(0, rxin-mbp, 58092, 0),278302556,Actor[akka.tcp://spark@rxin-mbp:58088/user/BlockManagerActor1#-63596539]) from Actor[akka.tcp://spark@rxin-mbp:58088/temp/$c]
```

cc @mengxr @tdas @pwendell

Author: Reynold Xin <rxin@apache.org>

Closes #1870 from rxin/actorLogging and squashes the following commits:

c531ee5 [Reynold Xin] Added license header for ActorLogReceive.
f6b1ebe [Reynold Xin] [SPARK-2952] Enable logging actor messages at DEBUG level
2014-08-11 15:25:21 -07:00
Reynold Xin ba28a8fcbc [SPARK-2936] Migrate Netty network module from Java to Scala
The Netty network module was originally written when Scala 2.9.x had a bug that prevents a pure Scala implementation, and a subset of the files were done in Java. We have since upgraded to Scala 2.10, and can migrate all Java files now to Scala.

https://github.com/netty/netty/issues/781

https://github.com/mesos/spark/pull/522

Author: Reynold Xin <rxin@apache.org>

Closes #1865 from rxin/netty and squashes the following commits:

332422f [Reynold Xin] Code review feedback
ca9eeee [Reynold Xin] Minor update.
7f1434b [Reynold Xin] [SPARK-2936] Migrate Netty network module from Java to Scala
2014-08-10 20:36:54 -07:00
Doris Xin b715aa0c80 [SPARK-2937] Separate out samplyByKeyExact as its own API in PairRDDFunction
To enable Python consistency and `Experimental` label of the `sampleByKeyExact` API.

Author: Doris Xin <doris.s.xin@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #1866 from dorx/stratified and squashes the following commits:

0ad97b2 [Doris Xin] reviewer comments.
2948aae [Doris Xin] remove unrelated changes
e990325 [Doris Xin] Merge branch 'master' into stratified
555a3f9 [Doris Xin] separate out sampleByKeyExact as its own API
616e55c [Doris Xin] merge master
245439e [Doris Xin] moved minSamplingRate to getUpperBound
eaf5771 [Doris Xin] bug fixes.
17a381b [Doris Xin] fixed a merge issue and a failed unit
ea7d27f [Doris Xin] merge master
b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
b3013a4 [Xiangrui Meng] move math3 back to test scope
eecee5f [Doris Xin] Merge branch 'master' into stratified
f4c21f3 [Doris Xin] Reviewer comments
a10e68d [Doris Xin] style fix
a2bf756 [Doris Xin] Merge branch 'master' into stratified
680b677 [Doris Xin] use mapPartitionWithIndex instead
9884a9f [Doris Xin] style fix
bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
ee9d260 [Doris Xin] addressed reviewer comments
6b5b10b [Doris Xin] Merge branch 'master' into stratified
254e03c [Doris Xin] minor fixes and Java API.
4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
bd9dc6e [Doris Xin] unit bug and style violation fixed
1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
0214a76 [Doris Xin] cleanUp
90d94c0 [Doris Xin] merge master
9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
7327611 [Doris Xin] merge master
50581fc [Doris Xin] added a TODO for logging in python
46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
7e1a481 [Doris Xin] changed the permission on SamplingUtil
1d413ce [Doris Xin] fixed checkstyle issues
9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
7cab53a [Doris Xin] fixed import bug in rdd.py
ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
2014-08-10 16:31:07 -07:00
Davies Liu 28dcbb531a [SPARK-2898] [PySpark] fix bugs in deamon.py
1. do not use signal handler for SIGCHILD, it's easy to cause deadlock
2. handle EINTR during accept()
3. pass errno into JVM
4. handle EAGAIN during fork()

Now, it can pass 50k tasks tests in 180 seconds.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1842 from davies/qa and squashes the following commits:

f0ea451 [Davies Liu] fix lint
03a2e8c [Davies Liu] cleanup dead children every seconds
32cb829 [Davies Liu] fix lint
0cd0817 [Davies Liu] fix bugs in deamon.py
2014-08-10 13:00:38 -07:00
Shivaram Venkataraman 1d03a26a48 [SPARK-2950] Add gc time and shuffle write time to JobLogger
The JobLogger is very useful for performing offline performance profiling of Spark jobs. GC Time and Shuffle Write time are available in TaskMetrics but are currently missed from the JobLogger output. This patch adds these two fields.

~~Since this is a small change, I didn't create a JIRA. Let me know if I should do that.~~

cc kayousterhout

Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

Closes #1869 from shivaram/job-logger and squashes the following commits:

1b709fc [Shivaram Venkataraman] Add a space before GC_TIME
c418105 [Shivaram Venkataraman] Add gc time and shuffle write time to JobLogger
2014-08-10 12:44:17 -07:00
GuoQiang Li 3570119c34 Remove extra semicolon in Task.scala
Author: GuoQiang Li <witgo@qq.com>

Closes #1876 from witgo/remove_semicolon_in_Task_scala and squashes the following commits:

c6ea732 [GuoQiang Li] Remove extra semicolon in Task.scala
2014-08-10 12:12:22 -07:00
Reynold Xin 482c5afbf6 Turn UpdateBlockInfo into case class.
This helps us log UpdateBlockInfo properly once #1870 is merged.

Author: Reynold Xin <rxin@apache.org>

Closes #1872 from rxin/UpdateBlockInfo and squashes the following commits:

0cee1c2 [Reynold Xin] Turn UpdateBlockInfo into case class.
2014-08-09 23:06:54 -07:00
Kousuke Saruta 4f4a9884d9 [SPARK-2894] spark-shell doesn't accept flags
As sryza reported, spark-shell doesn't accept any flags.
The root cause is wrong usage of spark-submit in spark-shell and it come to the surface by #1801

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1715, Closes #1864, and Closes #1861

Closes #1825 from sarutak/SPARK-2894 and squashes the following commits:

47f3510 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2894
2c899ed [Kousuke Saruta] Removed useless code from java_gateway.py
98287ed [Kousuke Saruta] Removed useless code from java_gateway.py
513ad2e [Kousuke Saruta] Modified util.sh to enable to use option including white spaces
28a374e [Kousuke Saruta] Modified java_gateway.py to recognize arguments
5afc584 [Cheng Lian] Filter out spark-submit options when starting Python gateway
e630d19 [Cheng Lian] Fixing pyspark and spark-shell CLI options
2014-08-09 21:11:00 -07:00
Chris Cope e45daf226d [SPARK-1766] sorted functions to meet pedantic requirements
Pedantry is underrated

Author: Chris Cope <ccope@resilientscience.com>

Closes #1859 from copester/master and squashes the following commits:

0fb4499 [Chris Cope] [SPARK-1766] sorted functions to meet pedantic requirements
2014-08-09 20:58:56 -07:00
Chandan Kumar b431e6747f [SPARK-2861] Fix Doc comment of histogram method
Tested and ready to merge.

Author: Chandan Kumar <chandan.kumar@imaginea.com>

Closes #1786 from nrchandan/spark-2861 and squashes the following commits:

cb0bc1e [Chandan Kumar] [SPARK-2861] Fix a typo in the histogram doc comment
6a2a71b [Chandan Kumar] SPARK-2861. Fix Doc comment of histogram method
2014-08-09 00:45:54 -07:00
li-zhihui 28dbae85aa [SPARK-2635] Fix race condition at SchedulerBackend.isReady in standalone mode
In SPARK-1946(PR #900), configuration <code>spark.scheduler.minRegisteredExecutorsRatio</code> was introduced. However, in standalone mode, there is a race condition where isReady() can return true because totalExpectedExecutors has not been correctly set.

Because expected executors is uncertain in standalone mode, the PR try to use CPU cores(<code>--total-executor-cores</code>) as expected resources to judge whether SchedulerBackend is ready.

Author: li-zhihui <zhihui.li@intel.com>
Author: Li Zhihui <zhihui.li@intel.com>

Closes #1525 from li-zhihui/fixre4s and squashes the following commits:

e9a630b [Li Zhihui] Rename variable totalExecutors and clean codes
abf4860 [Li Zhihui] Push down variable totalExpectedResources to children classes
ca54bd9 [li-zhihui] Format log with String interpolation
88c7dc6 [li-zhihui] Few codes and docs refactor
41cf47e [li-zhihui] Fix race condition at SchedulerBackend.isReady in standalone mode
2014-08-08 22:52:56 -07:00
Erik Erlandson 43af281700 [SPARK-2911] apply parent[T](j) to clarify UnionRDD code
References to dependencies(j) for actually obtaining RDD parents are less common than I originally estimated.   It does clarify UnionRDD (also will clarify some of my other PRs)

Use of firstParent[T] is ubiquitous, but not as sure that benefits from being replaced with parent(0)[T].

Author: Erik Erlandson <eerlands@redhat.com>

Closes #1858 from erikerlandson/spark-2911-pr2 and squashes the following commits:

7ffea74 [Erik Erlandson] [SPARK-2911] apply parent[T](j) to clarify UnionRDD code
2014-08-08 20:58:44 -07:00
WangTao 1c84dba988 [Web UI]Make decision order of Worker's WebUI port consistent with Master's
The decision order of Worker's WebUI port is "--webui-port", SPARK_WORKER_WEBUI_POR, 8081(default), spark.worker.ui.port. But in Master, the order is "--webui-port", spark.master.ui.port, SPARK_MASTER_WEBUI_PORT and 8080(default).

So we change the order in Worker's to keep it consistent with Master.

Author: WangTao <barneystinson@aliyun.com>

Closes #1838 from WangTaoTheTonic/reOrder and squashes the following commits:

460f4d4 [WangTao] Make decision order of Worker's WebUI consistent with Master's
2014-08-08 20:53:21 -07:00
GuoQiang Li ec79063fad [SPARK-2897][SPARK-2920]TorrentBroadcast does use the serializer class specified in the spark option "spark.serializer"
Author: GuoQiang Li <witgo@qq.com>

Closes #1836 from witgo/SPARK-2897 and squashes the following commits:

23cdc5b [GuoQiang Li] review commit
ada4fba [GuoQiang Li] TorrentBroadcast does not support broadcast compression
fb91792 [GuoQiang Li] org.apache.spark.broadcast.TorrentBroadcast does use the serializer class specified in the spark option "spark.serializer"
2014-08-08 16:57:26 -07:00
Erik Erlandson 9a54de16ed [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD
Author: Erik Erlandson <eerlands@redhat.com>

Closes #1841 from erikerlandson/spark-2911-pr and squashes the following commits:

4699e2f [Erik Erlandson] [SPARK-2911]: provide rdd.parent[T](j) to obtain jth parent RDD
2014-08-07 23:45:16 -07:00
Kousuke Saruta 9de6a42bb3 [SPARK-2904] Remove non-used local variable in SparkSubmitArguments
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #1834 from sarutak/SPARK-2904 and squashes the following commits:

38e7d45 [Kousuke Saruta] Removed non-used variable in SparkSubmitArguments
2014-08-07 18:53:15 -07:00
Sandy Ryza 4c51098f32 SPARK-2565. Update ShuffleReadMetrics as blocks are fetched
Author: Sandy Ryza <sandy@cloudera.com>

Closes #1507 from sryza/sandy-spark-2565 and squashes the following commits:

74dad41 [Sandy Ryza] SPARK-2565. Update ShuffleReadMetrics as blocks are fetched
2014-08-07 18:09:19 -07:00
Matei Zaharia 6906b69cf5 SPARK-2787: Make sort-based shuffle write files directly when there's no sorting/aggregation and # partitions is small
As described in https://issues.apache.org/jira/browse/SPARK-2787, right now sort-based shuffle is more expensive than hash-based for map operations that do no partial aggregation or sorting, such as groupByKey. This is because it has to serialize each data item twice (once when spilling to intermediate files, and then again when merging these files object-by-object). This patch adds a code path to just write separate files directly if the # of output partitions is small, and concatenate them at the end to produce a sorted file.

On the unit test side, I added some tests that force or don't force this bypass path to be used, and checked that our tests for other features (e.g. all the operations) cover both cases.

Author: Matei Zaharia <matei@databricks.com>

Closes #1799 from mateiz/SPARK-2787 and squashes the following commits:

88cf26a [Matei Zaharia] Fix rebase
10233af [Matei Zaharia] Review comments
398cb95 [Matei Zaharia] Fix looking up shuffle manager in conf
ca3efd9 [Matei Zaharia] Add docs for shuffle manager properties, and allow short names for them
d0ae3c5 [Matei Zaharia] Fix some comments
90d084f [Matei Zaharia] Add code path to bypass merge-sort in ExternalSorter, and tests
31e5d7c [Matei Zaharia] Move existing logic for writing partitioned files into ExternalSorter
2014-08-07 18:04:49 -07:00
Davies Liu ffd1f59a62 [SPARK-2887] fix bug of countApproxDistinct() when have more than one partition
fix bug of countApproxDistinct() when have more than one partition

Author: Davies Liu <davies.liu@gmail.com>

Closes #1812 from davies/approx and squashes the following commits:

bf757ce [Davies Liu] fix bug of countApproxDistinct() when have more than one partition
2014-08-06 21:22:13 -07:00
Kousuke Saruta 17caae48b3 [SPARK-2583] ConnectionManager error reporting
This patch modifies the ConnectionManager so that error messages are sent in reply when uncaught exceptions occur during message processing.  This prevents message senders from hanging while waiting for an acknowledgment if the remote message processing failed.

This is an updated version of sarutak's PR, #1490.  The main change is to use Futures / Promises to signal errors.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: Josh Rosen <joshrosen@apache.org>

Closes #1758 from JoshRosen/connection-manager-fixes and squashes the following commits:

68620cb [Josh Rosen] Fix test in BlockFetcherIteratorSuite:
83673de [Josh Rosen] Error ACKs should trigger IOExceptions, so catch only those exceptions in the test.
b8bb4d4 [Josh Rosen] Fix manager.id vs managerServer.id typo that broke security tests.
659521f [Josh Rosen] Include previous exception when throwing new one
a2f745c [Josh Rosen] Remove sendMessageReliablySync; callers can wait themselves.
c01c450 [Josh Rosen] Return Try[Message] from sendMessageReliablySync.
f1cd1bb [Josh Rosen] Clean up @sarutak's PR #1490 for [SPARK-2583]: ConnectionManager error reporting
7399c6b [Josh Rosen] Merge remote-tracking branch 'origin/pr/1490' into connection-manager-fixes
ee91bb7 [Kousuke Saruta] Modified BufferMessage.scala to keep the spark code style
9dfd0d8 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
e7d9aa6 [Kousuke Saruta] rebase to master
326a17f [Kousuke Saruta] Add test cases to ConnectionManagerSuite.scala for SPARK-2583
2a18d6b [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
22d7ebd [Kousuke Saruta] Add test cases to BlockManagerSuite for SPARK-2583
e579302 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
281589c [Kousuke Saruta] Add a test case to BlockFetcherIteratorSuite.scala for fetching block from remote from successfully
0654128 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
ffaa83d [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
12d3de8 [Kousuke Saruta] Added BlockFetcherIteratorSuite.scala
4117b8f [Kousuke Saruta] Modified ConnectionManager to be alble to handle error during processing message
717c9c3 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
6635467 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-2583
e2b8c4a [Kousuke Saruta] Modify to propagete error using ConnectionManager
2014-08-06 17:27:55 -07:00
Sandy Ryza 4e98236442 SPARK-2566. Update ShuffleWriteMetrics incrementally
I haven't tested this out on a cluster yet, but wanted to make sure the approach (passing ShuffleWriteMetrics down to DiskBlockObjectWriter) was ok

Author: Sandy Ryza <sandy@cloudera.com>

Closes #1481 from sryza/sandy-spark-2566 and squashes the following commits:

8090d88 [Sandy Ryza] Fix ExternalSorter
b2a62ed [Sandy Ryza] Fix more test failures
8be6218 [Sandy Ryza] Fix test failures and mark a couple variables private
c5e68e5 [Sandy Ryza] SPARK-2566. Update ShuffleWriteMetrics incrementally
2014-08-06 13:10:33 -07:00
Cheng Lian a6cd31108f [SPARK-2678][Core][SQL] A workaround for SPARK-2678
JIRA issues:

- Main: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
- Related: [SPARK-2874](https://issues.apache.org/jira/browse/SPARK-2874)

Related PR:

- #1715

This PR is both a fix for SPARK-2874 and a workaround for SPARK-2678. Fixing SPARK-2678 completely requires some API level changes that need further discussion, and we decided not to include it in Spark 1.1 release. As currently SPARK-2678 only affects Spark SQL scripts, this workaround is enough for Spark 1.1. Command line option handling logic in bash scripts looks somewhat dirty and duplicated, but it helps to provide a cleaner user interface as well as retain full downward compatibility for now.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #1801 from liancheng/spark-2874 and squashes the following commits:

8045d7a [Cheng Lian] Make sure test suites pass
8493a9e [Cheng Lian] Using eval to retain quoted arguments
aed523f [Cheng Lian] Fixed typo in bin/spark-sql
f12a0b1 [Cheng Lian] Worked arount SPARK-2678
daee105 [Cheng Lian] Fixed usage messages of all Spark SQL related scripts
2014-08-06 12:28:35 -07:00
Andrew Or 09f7e4587b [SPARK-2157] Enable tight firewall rules for Spark
The goal of this PR is to allow users of Spark to write tight firewall rules for their clusters. This is currently not possible because Spark uses random ports in many places, notably the communication between executors and drivers. The changes in this PR are based on top of ash211's changes in #1107.

The list covered here may or may not be the complete set of port needed for Spark to operate perfectly. However, as of the latest commit there are no known sources of random ports (except in tests). I have not documented a few of the more obscure configs.

My spark-env.sh looks like this:
```
export SPARK_MASTER_PORT=6060
export SPARK_WORKER_PORT=7070
export SPARK_MASTER_WEBUI_PORT=9090
export SPARK_WORKER_WEBUI_PORT=9091
```
and my spark-defaults.conf looks like this:
```
spark.master spark://andrews-mbp:6060
spark.driver.port 5001
spark.fileserver.port 5011
spark.broadcast.port 5021
spark.replClassServer.port 5031
spark.blockManager.port 5041
spark.executor.port 5051
```

Author: Andrew Or <andrewor14@gmail.com>
Author: Andrew Ash <andrew@andrewash.com>

Closes #1777 from andrewor14/configure-ports and squashes the following commits:

621267b [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
8a6b820 [Andrew Or] Use a random UI port during tests
7da0493 [Andrew Or] Fix tests
523c30e [Andrew Or] Add test for isBindCollision
b97b02a [Andrew Or] Minor fixes
c22ad00 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
93d359f [Andrew Or] Executors connect to wrong port when collision occurs
d502e5f [Andrew Or] Handle port collisions when creating Akka systems
a2dd05c [Andrew Or] Patrick's comment nit
86461e2 [Andrew Or] Remove spark.executor.env.port and spark.standalone.client.port
1d2d5c6 [Andrew Or] Fix ports for standalone cluster mode
cb3be88 [Andrew Or] Various doc fixes (broken link, format etc.)
e837cde [Andrew Or] Remove outdated TODOs
bfbab28 [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
de1b207 [Andrew Or] Update docs to reflect new ports
b565079 [Andrew Or] Add spark.ports.maxRetries
2551eb2 [Andrew Or] Remove spark.worker.watcher.port
151327a [Andrew Or] Merge branch 'master' of github.com:apache/spark into configure-ports
9868358 [Andrew Or] Add a few miscellaneous ports
6016e77 [Andrew Or] Add spark.executor.port
8d836e6 [Andrew Or] Also document SPARK_{MASTER/WORKER}_WEBUI_PORT
4d9e6f3 [Andrew Or] Fix super subtle bug
3f8e51b [Andrew Or] Correct erroneous docs...
e111d08 [Andrew Or] Add names for UI services
470f38c [Andrew Or] Special case non-"Address already in use" exceptions
1d7e408 [Andrew Or] Treat 0 ports specially + return correct ConnectionManager port
ba32280 [Andrew Or] Minor fixes
6b550b0 [Andrew Or] Assorted fixes
73fbe89 [Andrew Or] Move start service logic to Utils
ec676f4 [Andrew Or] Merge branch 'SPARK-2157' of github.com:ash211/spark into configure-ports
038a579 [Andrew Ash] Trust the server start function to report the port the service started on
7c5bdc4 [Andrew Ash] Fix style issue
0347aef [Andrew Ash] Unify port fallback logic to a single place
24a4c32 [Andrew Ash] Remove type on val to match surrounding style
9e4ad96 [Andrew Ash] Reformat for style checker
5d84e0e [Andrew Ash] Document new port configuration options
066dc7a [Andrew Ash] Fix up HttpServer port increments
cad16da [Andrew Ash] Add fallover increment logic for HttpServer
c5a0568 [Andrew Ash] Fix ConnectionManager to retry with increment
b80d2fd [Andrew Ash] Make Spark's block manager port configurable
17c79bb [Andrew Ash] Add a configuration option for spark-shell's class server
f34115d [Andrew Ash] SPARK-1176 Add port configuration for HttpBroadcast
49ee29b [Andrew Ash] SPARK-1174 Add port configuration for HttpFileServer
1c0981a [Andrew Ash] Make port in HttpServer configurable
2014-08-06 00:07:40 -07:00
CodingCat 63bdb1f41b SPARK-2294: fix locality inversion bug in TaskManager
copied from original JIRA (https://issues.apache.org/jira/browse/SPARK-2294):

If an executor E is free, a task may be speculatively assigned to E when there are other tasks in the job that have not been launched (at all) yet. Similarly, a task without any locality preferences may be assigned to E when there was another NODE_LOCAL task that could have been scheduled.
This happens because TaskSchedulerImpl calls TaskSetManager.resourceOffer (which in turn calls TaskSetManager.findTask) with increasing locality levels, beginning with PROCESS_LOCAL, followed by NODE_LOCAL, and so on until the highest currently allowed level. Now, supposed NODE_LOCAL is the highest currently allowed locality level. The first time findTask is called, it will be called with max level PROCESS_LOCAL; if it cannot find any PROCESS_LOCAL tasks, it will try to schedule tasks with no locality preferences or speculative tasks. As a result, speculative tasks or tasks with no preferences may be scheduled instead of NODE_LOCAL tasks.

----

I added an additional parameter in resourceOffer and findTask, maxLocality, indicating when we should consider the tasks without locality preference

Author: CodingCat <zhunansjtu@gmail.com>

Closes #1313 from CodingCat/SPARK-2294 and squashes the following commits:

bf3f13b [CodingCat] rollback some forgotten changes
89f9bc0 [CodingCat] address matei's comments
18cae02 [CodingCat] add test case for node-local tasks
2ba6195 [CodingCat] fix failed test cases
87dd09e [CodingCat] fix style
9b9432f [CodingCat] remove hasNodeLocalOnlyTasks
fdd1573 [CodingCat] fix failed test cases
941a4fd [CodingCat] see my shocked face..........
f600085 [CodingCat] remove hasNodeLocalOnlyTasks checking
0b8a46b [CodingCat] test whether hasNodeLocalOnlyTasks affect the results
73ceda8 [CodingCat] style fix
b3a430b [CodingCat] remove fine granularity tracking for node-local only tasks
f9a2ad8 [CodingCat] simplify the logic in TaskSchedulerImpl
c8c1de4 [CodingCat] simplify the patch
be652ed [CodingCat] avoid unnecessary delay when we only have nopref tasks
dee9e22 [CodingCat] fix locality inversion bug in TaskManager by moving nopref branch
2014-08-05 23:02:58 -07:00
Stephen Boesch 2643e66008 SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection
I inquired on  dev mailing list about the motivation for checking the jdbc statement instead of the connection in the close() logic of JdbcRDD. Ted Yu believes there essentially is none-  it is a simple cut and paste issue. So here is the tiny fix to patch it.

Author: Stephen Boesch <javadba>
Author: Stephen Boesch <javadba@gmail.com>

Closes #1792 from javadba/closejdbc and squashes the following commits:

363be4f [Stephen Boesch] SPARK-2869 - Fix tiny bug in JdbcRdd for closing jdbc connection (reformat with braces)
6518d36 [Stephen Boesch] SPARK-2689 Fix tiny bug in JdbcRdd for closing jdbc connection
3fb23ed [Stephen Boesch] SPARK-2689 Fix potential leak of connection/PreparedStatement in case of error in JdbcRDD
095b2c9 [Stephen Boesch] Fix tiny bug (likely copy and paste error) in closing jdbc connection
2014-08-05 18:18:08 -07:00
Reynold Xin acff9a7f13 [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
This can substantially reduce memory usage during shuffle.

Author: Reynold Xin <rxin@apache.org>

Closes #1781 from rxin/SPARK-2503-spark.shuffle.file.buffer.kb and squashes the following commits:

104b8d8 [Reynold Xin] [SPARK-2503] Lower shuffle output buffer (spark.shuffle.file.buffer.kb) to 32KB.
2014-08-05 16:24:50 -07:00
Patrick Wendell 74f82c71b0 SPARK-2380: Support displaying accumulator values in the web UI
This patch adds support for giving accumulators user-visible names and displaying accumulator values in the web UI. This allows users to create custom counters that can display in the UI. The current approach displays both the accumulator deltas caused by each task and a "current" value of the accumulator totals for each stage, which gets update as tasks finish.

Currently in Spark developers have been extending the `TaskMetrics` functionality to provide custom instrumentation for RDD's. This provides a potentially nicer alternative of going through the existing accumulator framework (actually `TaskMetrics` and accumulators are on an awkward collision course as we add more features to the former). The current patch demo's how we can use the feature to provide instrumentation for RDD input sizes. The nice thing about going through accumulators is that users can read the current value of the data being tracked in their programs. This could be useful to e.g. decide to short-circuit a Spark stage depending on how things are going.

![counters](https://cloud.githubusercontent.com/assets/320616/3488815/6ee7bc34-0505-11e4-84ce-e36d9886e2cf.png)

Author: Patrick Wendell <pwendell@gmail.com>

Closes #1309 from pwendell/metrics and squashes the following commits:

8815308 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into HEAD
93fbe0f [Patrick Wendell] Other minor fixes
cc43f68 [Patrick Wendell] Updating unit tests
c991b1b [Patrick Wendell] Moving some code into the Accumulators class
9a9ba3c [Patrick Wendell] More merge fixes
c5ace9e [Patrick Wendell] More merge conflicts
1da15e3 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into metrics
9860c55 [Patrick Wendell] Potential solution to posting listener events
0bb0e33 [Patrick Wendell] Remove "display" variable and assume display = name.isDefined
0ec4ac7 [Patrick Wendell] Java API's
e95bf69 [Patrick Wendell] Stash
be97261 [Patrick Wendell] Style fix
8407308 [Patrick Wendell] Removing examples in Hadoop and RDD class
64d405f [Patrick Wendell] Adding missing file
5d8b156 [Patrick Wendell] Changes based on Kay's review.
9f18bad [Patrick Wendell] Minor style changes and tests
7a63abc [Patrick Wendell] Adding Json serialization and responding to Reynold's feedback
ad85076 [Patrick Wendell] Example of using named accumulators for custom RDD metrics.
0b72660 [Patrick Wendell] Initial WIP example of supporing globally named accumulators.
2014-08-05 13:08:23 -07:00
Thomas Graves 1c5555a23d SPARK-1890 and SPARK-1891- add admin and modify acls
It was easier to combine these 2 jira since they touch many of the same places.  This pr adds the following:

- adds modify acls
- adds admin acls (list of admins/users that get added to both view and modify acls)
- modify Kill button on UI to take modify acls into account
- changes config name of spark.ui.acls.enable to spark.acls.enable since I choose poorly in original name. We keep backwards compatibility so people can still use spark.ui.acls.enable. The acls should apply to any web ui as well as any CLI interfaces.
- send view and modify acls information on to YARN so that YARN interfaces can use (yarn cli for killing applications for example).

Author: Thomas Graves <tgraves@apache.org>

Closes #1196 from tgravescs/SPARK-1890 and squashes the following commits:

8292eb1 [Thomas Graves] review comments
b92ec89 [Thomas Graves] remove unneeded variable from applistener
4c765f4 [Thomas Graves] Add in admin acls
72eb0ac [Thomas Graves] Add modify acls
2014-08-05 12:52:52 -05:00
Reynold Xin 184048f80b [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
Author: Reynold Xin <rxin@apache.org>

Closes #1780 from rxin/kryo-init-size and squashes the following commits:

551b935 [Reynold Xin] [SPARK-2856] Decrease initial buffer size for Kryo to 64KB.
2014-08-05 01:30:46 -07:00
wangfei 9862c614c0 [SPARK-1779] Throw an exception if memory fractions are not between 0 and 1
Author: wangfei <scnbwf@yeah.net>
Author: wangfei <wangfei1@huawei.com>

Closes #714 from scwf/memoryFraction and squashes the following commits:

6e385b9 [wangfei] Update SparkConf.scala
da6ee59 [wangfei] add configs
829a195 [wangfei] add indent
717c0ca [wangfei] updated to make more concise
fc45476 [wangfei] validate memoryfraction in sparkconf
2e79b3d [wangfei] && => ||
43621bd [wangfei] && => ||
cf38bcf [wangfei] throw IllegalArgumentException
14d18ac [wangfei] throw IllegalArgumentException
dff1f0f [wangfei] Update BlockManager.scala
764965f [wangfei] Update ExternalAppendOnlyMap.scala
a59d76b [wangfei] Throw exception when memoryFracton is out of range
7b899c2 [wangfei] 【SPARK-1779】
2014-08-05 00:51:07 -07:00
Andrew Or a646a365e3 [SPARK-2857] Correct properties to set Master / Worker ports
`master.ui.port` and `worker.ui.port` were never picked up by SparkConf, simply because they are not prefixed with "spark." Unfortunately, this is also currently the documented way of setting these values.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1779 from andrewor14/master-worker-port and squashes the following commits:

8475e95 [Andrew Or] Update docs to reflect changes in configs
4db3d5d [Andrew Or] Stop using configs that don't actually work
2014-08-05 00:39:07 -07:00
Matei Zaharia 4fde28c206 SPARK-2711. Create a ShuffleMemoryManager to track memory for all spilling collections
This tracks memory properly if there are multiple spilling collections in the same task (which was a problem before), and also implements an algorithm that lets each thread grow up to 1 / 2N of the memory pool (where N is the number of threads) before spilling, which avoids an inefficiency with small spills we had before (some threads would spill many times at 0-1 MB because the pool was allocated elsewhere).

Author: Matei Zaharia <matei@databricks.com>

Closes #1707 from mateiz/spark-2711 and squashes the following commits:

debf75b [Matei Zaharia] Review comments
24f28f3 [Matei Zaharia] Small rename
c8f3a8b [Matei Zaharia] Update ShuffleMemoryManager to be able to partially grant requests
315e3a5 [Matei Zaharia] Some review comments
b810120 [Matei Zaharia] Create central manager to track memory for all spilling collections
2014-08-04 23:41:03 -07:00
Matei Zaharia 066765d60d SPARK-2685. Update ExternalAppendOnlyMap to avoid buffer.remove()
Replaces this with an O(1) operation that does not have to shift over
the whole tail of the array into the gap produced by the element removed.

Author: Matei Zaharia <matei@databricks.com>

Closes #1773 from mateiz/SPARK-2685 and squashes the following commits:

1ea028a [Matei Zaharia] Update comments in StreamBuffer and EAOM, and reuse ArrayBuffers
eb1abfd [Matei Zaharia] Update ExternalAppendOnlyMap to avoid buffer.remove()
2014-08-04 23:27:53 -07:00
Reynold Xin 05bf4e4aff [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext
Author: Reynold Xin <rxin@apache.org>

Closes #1772 from rxin/accumulator-dagscheduler and squashes the following commits:

6a58520 [Reynold Xin] [SPARK-2323] Exception in accumulator update should not crash DAGScheduler & SparkContext.
2014-08-04 20:39:18 -07:00
Matei Zaharia 8e7d5ba1a2 SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter
All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed.

In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues.

Author: Matei Zaharia <matei@databricks.com>

Closes #1722 from mateiz/spark-2792 and squashes the following commits:

5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too
18fe865 [Matei Zaharia] Update docs on objectStreamReset
576ee83 [Matei Zaharia] Allow objectStreamReset to be 0
0374217 [Matei Zaharia] Remove super paranoid code to close file handles
bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too
0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap
9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
2014-08-04 12:59:18 -07:00
Davies Liu 55349f9fe8 [SPARK-1740] [PySpark] kill the python worker
Kill only the python worker related to cancelled tasks.

The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.

When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1643 from davies/kill and squashes the following commits:

8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
46ca150 [Davies Liu] address comment
acd751c [Davies Liu] kill the worker when task is canceled
2014-08-03 15:52:00 -07:00
Andrew Or 3dc55fdf45 [Minor] Fixes on top of #1679
Minor fixes on top of #1679.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1736 from andrewor14/amend-#1679 and squashes the following commits:

3b46f5e [Andrew Or] Minor fixes
2014-08-02 22:00:46 -07:00
GuoQiang Li 4c477117bb SPARK-2804: Remove scalalogging-slf4j dependency
This also Closes #1701.

Author: GuoQiang Li <witgo@qq.com>

Closes #1208 from witgo/SPARK-1470 and squashes the following commits:

422646b [GuoQiang Li] Remove scalalogging-slf4j dependency
2014-08-02 13:59:58 -07:00
Andrew Or e09e18b312 [HOTFIX] Do not throw NPE if spark.test.home is not set
`spark.test.home` was introduced in #1734. This is fine for SBT but is failing maven tests. Either way it shouldn't throw an NPE.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1739 from andrewor14/fix-spark-test-home and squashes the following commits:

ce2624c [Andrew Or] Do not throw NPE if spark.test.home is not set
2014-08-02 12:12:56 -07:00
Andrew Or 148af6082c [SPARK-2454] Do not ship spark home to Workers
When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.

The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.

This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:

f71f391 [Andrew Or] Revert changes in python
1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
188fc5d [Andrew Or] Avoid using spark.home where possible
09272b7 [Andrew Or] Always use Worker's working directory as spark home
2014-08-02 00:45:38 -07:00
Andrew Or d934801d53 [SPARK-2316] Avoid O(blocks) operations in listeners
The existing code in `StorageUtils` is not the most efficient. Every time we want to update an `RDDInfo` we end up iterating through all blocks on all block managers just to discard most of them. The symptoms manifest themselves in the bountiful UI bugs observed in the wild. Many of these bugs are caused by the slow consumption of events in `LiveListenerBus`, which frequently leads to the event queue overflowing and `SparkListenerEvent`s being dropped on the floor. The changes made in this PR avoid this by first filtering out only the blocks relevant to us before computing storage information from them.

It's worth a mention that this corner of the Spark code is also not very well-tested at all. The bulk of the changes in this PR (more than 60%) is actually test cases for the various logic in `StorageUtils.scala` as well as `StorageTab.scala`. These will eventually be extended to cover the various listeners that constitute the `SparkUI`.

Author: Andrew Or <andrewor14@gmail.com>

Closes #1679 from andrewor14/fix-drop-events and squashes the following commits:

f80c1fa [Andrew Or] Rewrite fold and reduceOption as sum
e132d69 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
14fa1c3 [Andrew Or] Simplify some code + update a few comments
a91be46 [Andrew Or] Make ExecutorsPage blazingly fast
bf6f09b [Andrew Or] Minor changes
8981de1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
af19bc0 [Andrew Or] *UsedByRDD -> *UsedByRdd (minor)
6970bc8 [Andrew Or] Add extensive tests for StorageListener and the new code in StorageUtils
e080b9e [Andrew Or] Reduce run time of StorageUtils.updateRddInfo to near constant
2c3ef6a [Andrew Or] Actually filter out only the relevant RDDs
6fef86a [Andrew Or] Add extensive tests for new code in StorageStatus
b66b6b0 [Andrew Or] Use more efficient underlying data structures for blocks
6a7b7c0 [Andrew Or] Avoid chained operations on TraversableLike
a9ec384 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
b12fcd7 [Andrew Or] Fix tests + simplify sc.getRDDStorageInfo
da8e322 [Andrew Or] Merge branch 'master' of github.com:apache/spark into fix-drop-events
8e91921 [Andrew Or] Iterate through a filtered set of blocks when updating RDDInfo
7b2c4aa [Andrew Or] Rewrite blockLocationsFromStorageStatus + clean up method signatures
41fa50d [Andrew Or] Add a legacy constructor for StorageStatus
53af15d [Andrew Or] Refactor StorageStatus + add a bunch of tests
2014-08-01 23:56:24 -07:00
Patrick Wendell dab37966b0 Revert "[SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api"
This reverts commit adc8303294.
2014-08-01 23:55:30 -07:00
GuoQiang Li adc8303294 [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api
Author: GuoQiang Li <witgo@qq.com>

Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:

66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
2014-08-01 23:55:11 -07:00
Josh Rosen e8e0fd691a [SPARK-2764] Simplify daemon.py process structure
Curently, daemon.py forks a pool of numProcessors subprocesses, and those processes fork themselves again to create the actual Python worker processes that handle data.

I think that this extra layer of indirection is unnecessary and adds a lot of complexity.  This commit attempts to remove this middle layer of subprocesses by launching the workers directly from daemon.py.

See https://github.com/mesos/spark/pull/563 for the original PR that added daemon.py, where I raise some issues with the current design.

Author: Josh Rosen <joshrosen@apache.org>

Closes #1680 from JoshRosen/pyspark-daemon and squashes the following commits:

5abbcb9 [Josh Rosen] Replace magic number: 4 -> EINTR
5495dff [Josh Rosen] Throw IllegalStateException if worker launch fails.
b79254d [Josh Rosen] Detect failed fork() calls; improve error logging.
282c2c4 [Josh Rosen] Remove daemon.py exit logging, since it caused problems:
8554536 [Josh Rosen] Fix daemon’s shutdown(); log shutdown reason.
4e0fab8 [Josh Rosen] Remove shared-memory exit_flag; don't die on worker death.
e9892b4 [Josh Rosen] [WIP] [SPARK-2764] Simplify daemon.py process structure.
2014-08-01 19:38:21 -07:00
Albert Chu 0da07da53e [SPARK-2116] Load spark-defaults.conf from SPARK_CONF_DIR if set
If SPARK_CONF_DIR environment variable is set, search it for spark-defaults.conf.

Author: Albert Chu <chu11@llnl.gov>

Closes #1059 from chu11/SPARK-2116 and squashes the following commits:

9f3ac94 [Albert Chu] SPARK-2116: If SPARK_CONF_DIR environment variable is set, search it for spark-defaults.conf.
2014-08-01 19:00:46 -07:00
Davies Liu 880eabec37 [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD
Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.

This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.

root
 |-- field1: integer (nullable = true)
 |-- field2: string (nullable = true)
 |-- field3: struct (nullable = true)
 |    |-- field4: integer (nullable = true)
 |    |-- field5: array (nullable = true)
 |    |    |-- element: integer (containsNull = false)
 |-- field6: array (nullable = true)
 |    |-- element: struct (containsNull = false)
 |    |    |-- field7: string (nullable = true)

Then we can access them by row.field3.field5[0]  or row.field6[5].field7

It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.

You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:

ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))

Or you could use Row to create a class just like namedtuple, for example:

Person = Row("name", "age")
ctx.inferSchema(rdd.map(lambda x: Person(*x)))

Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.

schema = StructType([StructField("name, StringType, True),
                                    StructType("age", IntegerType, True)])
ctx.applySchema(rdd, schema)

PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1598 from davies/nested and squashes the following commits:

f1d15b6 [Davies Liu] verify schema with the first few rows
8852aaf [Davies Liu] check type of schema
abe9e6e [Davies Liu] address comments
61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
1e5b801 [Davies Liu] improve cache of classes
51aa135 [Davies Liu] use Row to infer schema
e9c0d5c [Davies Liu] remove string typed schema
353a3f2 [Davies Liu] fix code style
63de8f8 [Davies Liu] fix typo
c79ca67 [Davies Liu] fix serialization of nested data
6b258b5 [Davies Liu] fix pep8
9d8447c [Davies Liu] apply schema provided by string of names
f5df97f [Davies Liu] refactor, address comments
9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
0eaaf56 [Davies Liu] fix doc tests
b3559b4 [Davies Liu] use generated Row instead of namedtuple
c4ddc30 [Davies Liu] fix conflict between name of fields and variables
7f6f251 [Davies Liu] address all comments
d69d397 [Davies Liu] refactor
2cc2d45 [Davies Liu] refactor
182fb46 [Davies Liu] refactor
bc6e9e1 [Davies Liu] switch to new Schema API
547bf3e [Davies Liu] Merge branch 'master' into nested
a435b5a [Davies Liu] add docs and code refactor
2c8debc [Davies Liu] Merge branch 'master' into nested
644665a [Davies Liu] use tuple and namedtuple for schemardd
2014-08-01 18:47:41 -07:00
Aaron Davidson 78f2af5822 SPARK-2791: Fix committing, reverting and state tracking in shuffle file consolidation
All changes from this PR are by mridulm and are drawn from his work in #1609. This patch is intended to fix all major issues related to shuffle file consolidation that mridulm found, while minimizing changes to the code, with the hope that it may be more easily merged into 1.1.

This patch is **not** intended as a replacement for #1609, which provides many additional benefits, including fixes to ExternalAppendOnlyMap, improvements to DiskBlockObjectWriter's API, and several new unit tests.

If it is feasible to merge #1609 for the 1.1 deadline, that is a preferable option.

Author: Aaron Davidson <aaron@databricks.com>

Closes #1678 from aarondav/consol and squashes the following commits:

53b3f6d [Aaron Davidson] Correct behavior when writing unopened file
701d045 [Aaron Davidson] Rebase with sort-based shuffle
9160149 [Aaron Davidson] SPARK-2532: Minimal shuffle consolidation fixes
2014-08-01 13:57:19 -07:00
zsxwing f5d9bea20e SPARK-1612: Fix potential resource leaks
JIRA: https://issues.apache.org/jira/browse/SPARK-1612

Move the "close" statements into a "finally" block.

Author: zsxwing <zsxwing@gmail.com>

Closes #535 from zsxwing/SPARK-1612 and squashes the following commits:

ae52f50 [zsxwing] Update to follow the code style
549ba13 [zsxwing] SPARK-1612: Fix potential resource leaks
2014-08-01 13:25:04 -07:00
Liang-Chi Hsieh baf9ce1a4e [SPARK-2490] Change recursive visiting on RDD dependencies to iterative approach
When performing some transformations on RDDs after many iterations, the dependencies of RDDs could be very long. It can easily cause StackOverflowError when recursively visiting these dependencies in Spark core. For example:

    var rdd = sc.makeRDD(Array(1))
    for (i <- 1 to 1000) {
      rdd = rdd.coalesce(1).cache()
      rdd.collect()
    }

This PR changes recursive visiting on rdd's dependencies to iterative approach to avoid StackOverflowError.

In addition to the recursive visiting, since the Java serializer has a known [bug](http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4152790) that causes StackOverflowError too when serializing/deserializing a large graph of objects. So applying this PR only solves part of the problem. Using KryoSerializer to replace Java serializer might be helpful. However, since KryoSerializer is not supported for `spark.closure.serializer` now, I can not test if KryoSerializer can solve Java serializer's problem completely.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #1418 from viirya/remove_recursive_visit and squashes the following commits:

6b2c615 [Liang-Chi Hsieh] change function name; comply with code style.
5f072a7 [Liang-Chi Hsieh] add comments to explain Stack usage.
8742dbb [Liang-Chi Hsieh] comply with code style.
900538b [Liang-Chi Hsieh] change recursive visiting on rdd's dependencies to iterative approach to avoid stackoverflowerror.
2014-08-01 12:12:30 -07:00
Aaron Staple eb5bdcaf6c [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions.
getPreferredLocs traverses a dependency graph of partitions using depth first search.  Given a complex dependency graph, the old implementation may explore a set of paths in the graph that is exponential in the number of nodes.  By maintaining a set of visited nodes the new implementation avoids revisiting nodes, preventing exponential blowup.

Some comment and whitespace cleanups are also included.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #1362 from staple/SPARK-695 and squashes the following commits:

ecea0f3 [Aaron Staple] address review comments
751c661 [Aaron Staple] [SPARK-695] Add a unit test.
5adf326 [Aaron Staple] Replace getPreferredLocsInternal's HashMap argument with a simpler HashSet.
58e37d0 [Aaron Staple] Replace comment documenting NarrowDependency.
6751ced [Aaron Staple] Revert "Remove unused variable."
04c7097 [Aaron Staple] Fix indentation.
0030884 [Aaron Staple] Remove unused variable.
33f67c6 [Aaron Staple] Clarify comment.
4e42b46 [Aaron Staple] Remove apparently incorrect comment describing NarrowDependency.
65c2d3d [Aaron Staple] [SPARK-695] In DAGScheduler's getPreferredLocs, track set of visited partitions.
2014-08-01 12:04:04 -07:00