Commit graph

5435 commits

Author SHA1 Message Date
Henry Saputra 91a563608e Merge branch 'master' into remove_simpleredundantreturn_scala 2014-01-12 10:34:13 -08:00
Henry Saputra 93a65e5fde Remove simple redundant return statement for Scala methods/functions:
-) Only change simple return statements at the end of method
-) Ignore the complex if-else check
-) Ignore the ones inside synchronized
2014-01-12 10:30:04 -08:00
Reynold Xin 288a878999 Merge pull request #389 from rxin/clone-writables
Minor update for clone writables and more documentation.
2014-01-11 21:53:19 -08:00
Reynold Xin dbc11df411 Merge pull request #388 from pwendell/master
Fix UI bug introduced in #244.

The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
2014-01-11 18:07:13 -08:00
Reynold Xin 362cda18bc Renamed cloneKeyValues to cloneRecords; updated docs. 2014-01-11 18:01:29 -08:00
Patrick Wendell 409866b351 Merge pull request #393 from pwendell/revert-381
Revert PR 381

This PR missed a bunch of test cases that require "spark.cleaner.ttl". I think it is what is causing test failures on Jenkins right now (though it's a bit hard to tell because the DNS for cs.berkeley.edu is down).

I'm submitting this to see if it fixes jeknins. I did try just patching various tests but it was taking a really long time because there are a bunch of them, so for now I'm just seeing if a revert works.
2014-01-11 17:12:06 -08:00
Patrick Wendell 07b952e1d1 Revert "Fix default TTL for metadata cleaner"
This reverts commit 669ba4caa9.
2014-01-11 16:07:10 -08:00
Patrick Wendell 22d4d62420 Revert "Fix one unit test that was not setting spark.cleaner.ttl"
This reverts commit 942c80b34c.
2014-01-11 16:07:03 -08:00
Reynold Xin 6510f04e4d Merge pull request #387 from jerryshao/conf-fix
Fix configure didn't work small problem in ALS
2014-01-11 12:48:26 -08:00
Reynold Xin b0fbfccadc Minor update for clone writables and more documentation. 2014-01-11 12:35:10 -08:00
Reynold Xin ee6e7f9b8c Merge pull request #359 from ScrapCodes/clone-writables
We clone hadoop key and values by default and reuse objects if asked to.

 We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully.

There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.
2014-01-11 12:07:55 -08:00
Patrick Wendell b313e15616 Fix UI bug introduced in #244.
The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
2014-01-11 10:52:57 -08:00
Patrick Wendell 4216178d5e Merge pull request #373 from jerryshao/kafka-upgrade
Upgrade Kafka dependecy to 0.8.0 release version
2014-01-11 09:46:48 -08:00
jerryshao cbfbc01938 Fix configure didn't work small problem in ALS 2014-01-11 16:22:45 +08:00
Reynold Xin 92ad18b00e Merge pull request #376 from prabeesh/master
Change clientId to random clientId

The client identifier should be unique across all clients connecting to the same server. A convenience method is provided to generate a random client id that should satisfy this criteria - generateClientId(). Returns a randomly generated client identifier based on the current user's login name and the system time. As the client identifier is used by the server to identify a client when it reconnects, the client must use the same identifier between connections if durable subscriptions are to be used.
2014-01-10 23:25:15 -08:00
Reynold Xin 0b5ce7af17 Merge pull request #386 from pwendell/typo-fix
Small typo fix
2014-01-10 23:23:21 -08:00
Matei Zaharia 1d7bef0c91 Merge pull request #381 from mateiz/default-ttl
Fix default TTL for metadata cleaner

It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
2014-01-10 18:53:03 -08:00
Patrick Wendell 44d6a8e3d8 Merge pull request #382 from RongGu/master
Fix a type error in comment lines

Fix a type error in comment lines
2014-01-10 17:51:50 -08:00
Patrick Wendell 08370a52b8 Small typo fix 2014-01-10 17:47:15 -08:00
Patrick Wendell 88faa30a42 Merge pull request #385 from shivaram/add-i2-instances
Add i2 instance types to Spark EC2.

Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/
2014-01-10 17:14:22 -08:00
Matei Zaharia 942c80b34c Fix one unit test that was not setting spark.cleaner.ttl 2014-01-10 16:32:36 -08:00
Patrick Wendell f26553102c Merge pull request #383 from tdas/driver-test
API for automatic driver recovery for streaming programs and other bug fixes

1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory.

  Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext
  Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext

  See the RecoverableNetworkWordCount below as an example of how to use it.

2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint.

3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery.

4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp).

5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared.

This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test.

Edit: Java example to be added later, unit test added.
2014-01-10 16:25:44 -08:00
Patrick Wendell d37408f39c Merge pull request #377 from andrewor14/master
External Sorting for Aggregator and CoGroupedRDDs (Revisited)

(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)

The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.

The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.

Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
2014-01-10 16:25:01 -08:00
Tathagata Das 4f39e79c23 Merge remote-tracking branch 'apache/master' into driver-test
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
2014-01-10 15:47:01 -08:00
Andrew Or 2e393cd5fd Update documentation for externalSorting 2014-01-10 15:45:38 -08:00
Tathagata Das 82f07deeda Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate. 2014-01-10 15:37:05 -08:00
Reynold Xin 0eaf01c5ed Merge pull request #369 from pillis/master
SPARK-961 Add a Vector.random() method

Added method and testcases
2014-01-10 15:32:19 -08:00
Andrew Or e4c51d2113 Address Patrick's and Reynold's comments
Aside from trivial formatting changes, use nulls instead of Options for
DiskMapIterator, and add documentation for spark.shuffle.externalSorting
and spark.shuffle.memoryFraction.

Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
2014-01-10 15:09:51 -08:00
RongGu 94776f753f fix a type error in comment lines 2014-01-11 05:43:56 +08:00
Thomas Graves 7cef8435d7 Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes
Yarn client addjar and misc fixes

Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.
2014-01-10 15:34:15 -06:00
Patrick Wendell 7b58f116e5 Merge pull request #384 from pwendell/debug-logs
Make DEBUG-level logs consummable.

Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.
2014-01-10 12:47:46 -08:00
Shivaram Venkataraman 7c4e6e1bf1 Add i2 instance types to Spark EC2. 2014-01-10 12:44:55 -08:00
Tathagata Das e4bb845238 Updated docs based on Patrick's comments in PR 383. 2014-01-10 12:17:09 -08:00
Patrick Wendell e9ed2d9e82 Make DEBUG-level logs consummable.
Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.
2014-01-10 10:33:24 -08:00
Tathagata Das 2213a5a47f Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test 2014-01-10 05:06:22 -08:00
Tathagata Das 740730a179 Fixed conf/slaves and updated docs. 2014-01-10 05:06:15 -08:00
Tathagata Das 4f609f7901 Removed spark.hostPort and other setting from SparkConf before saving to checkpoint. 2014-01-10 12:58:07 +00:00
Tathagata Das d7ec73ac76 Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test 2014-01-10 11:44:17 +00:00
Tathagata Das 9d3d9c8251 Refactored graph checkpoint file reading and writing code to make it cleaner and easily debuggable. 2014-01-10 11:44:02 +00:00
Matei Zaharia 669ba4caa9 Fix default TTL for metadata cleaner
It seems to have been set to 3500 in a previous commit for debugging,
but it should be off by default
2014-01-10 00:21:36 -08:00
Pillis 8d021b42bc SPARK-961. Add a Vector.random() method - update 1 2014-01-10 00:07:36 -08:00
Matei Zaharia 0ebc97305a Merge pull request #375 from mateiz/option-fix
Fix bug added when we changed AppDescription.maxCores to an Option

The Scala compiler warned about this -- we were comparing an Option against an integer now.
2014-01-09 23:58:49 -08:00
Patrick Wendell dd03cea02a Merge pull request #378 from pwendell/consolidate_on
Enable shuffle consolidation by default.

Bump this to being enabled for 0.9.0.
2014-01-09 23:38:03 -08:00
Patrick Wendell 460f655cc6 Enable shuffle consolidation by default.
Bump this to being enabled for 0.9.0.
2014-01-09 22:42:50 -08:00
Patrick Wendell 997c830e0b Merge pull request #363 from pwendell/streaming-logs
Set default logging to WARN for Spark streaming examples.

This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.
2014-01-09 22:22:20 -08:00
Andrew Or 372a533a6c Fix wonky imports from merge 2014-01-09 21:47:49 -08:00
Andrew Or aa5002bb96 Defensively allocate memory from global pool
This is an alternative to the existing approach, which evenly distributes the
collective shuffle memory among all running tasks. In the new approach, each
thread requests a chunk of memory whenever its map is about to multiplicatively
grow. If there is sufficient memory in the global pool, the thread allocates it
and grows its map. Otherwise, it spills.

A danger with the previous approach is that a new task may quickly fill up its
map before old tasks finish spilling, potentially causing an OOM. This approach
prevents this scenario as it favors existing tasks over new tasks; any thread
that may step over the boundary of other threads defensively backs off and
starts spilling.

Testing through spark-perf reveals: (1) When no spills have occured, the
performance of external sorting using this memory management approach is
essentially the same as without external sorting. (2) When one or more spills
have occured, the performance of external sorting is a small multiple (3x) worse
2014-01-09 21:43:58 -08:00
Andrew Or d76e1f90a8 Merge github.com:apache/incubator-spark
Conflicts:
	core/src/main/scala/org/apache/spark/SparkEnv.scala
	streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java
2014-01-09 21:38:48 -08:00
Patrick Wendell 7b748b83a1 Minor clean-up 2014-01-09 20:42:48 -08:00
Patrick Wendell 300eaa994c Merge pull request #353 from pwendell/ipython-simplify
Simplify and fix pyspark script.

This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.

I tested this using the three commands in the PySpark documentation page:

1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark

There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
  gloms them together as a single argument passed to `exec` which
  seemed to cause ipython to fail (it instead expects them as
  multiple arguments).
2014-01-09 20:29:51 -08:00