Commit graph

6166 commits

Author SHA1 Message Date
Ankur Dave 362b9422e4 Soften wording about GraphX superseding Bagel 2014-01-10 23:48:32 -08:00
Ankur Dave 2d7e8d8c48 Add GC note to GraphLab 2014-01-10 23:46:02 -08:00
Reynold Xin 92ad18b00e Merge pull request #376 from prabeesh/master
Change clientId to random clientId

The client identifier should be unique across all clients connecting to the same server. A convenience method is provided to generate a random client id that should satisfy this criteria - generateClientId(). Returns a randomly generated client identifier based on the current user's login name and the system time. As the client identifier is used by the server to identify a client when it reconnects, the client must use the same identifier between connections if durable subscriptions are to be used.
2014-01-10 23:25:15 -08:00
Reynold Xin 0b5ce7af17 Merge pull request #386 from pwendell/typo-fix
Small typo fix
2014-01-10 23:23:21 -08:00
Andrew Or bb8098f203 Add number of bytes spilled to Web UI 2014-01-10 21:40:55 -08:00
Ankur Dave a696be1e01 Finish d1d2b6d9b6 2014-01-10 21:18:34 -08:00
Ankur Dave d1d2b6d9b6 Remove blank lines added to Spark core 2014-01-10 21:17:32 -08:00
Matei Zaharia 1d7bef0c91 Merge pull request #381 from mateiz/default-ttl
Fix default TTL for metadata cleaner

It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
2014-01-10 18:53:03 -08:00
Ankur Dave c4fb6a87d3 Fix scaladoc warnings 2014-01-10 18:36:42 -08:00
Andrew Or e6447152b3 Induce spilling in ExternalAppendOnlyMapSuite 2014-01-10 18:33:48 -08:00
Ankur Dave 0ca18b8b07 Revert GraphX changes to SparkILoopInit
The changes were to support a custom banner in spark-shell for use by
graphx-shell, but once GraphX is merged into Spark, a separate shell
will be unnecessary.
2014-01-10 18:05:11 -08:00
Ankur Dave 41d6586e8e Revert changes to Spark's (PrimitiveKey)OpenHashMap; copy PKOHM to graphx 2014-01-10 18:00:54 -08:00
Patrick Wendell 44d6a8e3d8 Merge pull request #382 from RongGu/master
Fix a type error in comment lines

Fix a type error in comment lines
2014-01-10 17:51:50 -08:00
Patrick Wendell 08370a52b8 Small typo fix 2014-01-10 17:47:15 -08:00
Patrick Wendell 88faa30a42 Merge pull request #385 from shivaram/add-i2-instances
Add i2 instance types to Spark EC2.

Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/
2014-01-10 17:14:22 -08:00
Matei Zaharia 942c80b34c Fix one unit test that was not setting spark.cleaner.ttl 2014-01-10 16:32:36 -08:00
Patrick Wendell f26553102c Merge pull request #383 from tdas/driver-test
API for automatic driver recovery for streaming programs and other bug fixes

1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory.

  Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext
  Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext

  See the RecoverableNetworkWordCount below as an example of how to use it.

2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint.

3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery.

4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp).

5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared.

This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test.

Edit: Java example to be added later, unit test added.
2014-01-10 16:25:44 -08:00
Patrick Wendell d37408f39c Merge pull request #377 from andrewor14/master
External Sorting for Aggregator and CoGroupedRDDs (Revisited)

(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)

The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.

The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.

Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
2014-01-10 16:25:01 -08:00
Ankur Dave 85a6645d31 Add doc for Algorithms 2014-01-10 16:08:58 -08:00
Ankur Dave 04c20e7f4f Minor cleanup to docs 2014-01-10 15:58:30 -08:00
Ankur Dave 1788729273 Move VertexIdToIndexMap into impl 2014-01-10 15:58:18 -08:00
Ankur Dave 57d7487d3d Improve docs for VertexRDD 2014-01-10 15:48:20 -08:00
Tathagata Das 4f39e79c23 Merge remote-tracking branch 'apache/master' into driver-test
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
2014-01-10 15:47:01 -08:00
Andrew Or 2e393cd5fd Update documentation for externalSorting 2014-01-10 15:45:38 -08:00
Tathagata Das 82f07deeda Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate. 2014-01-10 15:37:05 -08:00
Reynold Xin 0eaf01c5ed Merge pull request #369 from pillis/master
SPARK-961 Add a Vector.random() method

Added method and testcases
2014-01-10 15:32:19 -08:00
Ankur Dave 11dd35c28b Clean up GraphGenerators 2014-01-10 15:23:32 -08:00
Ankur Dave 9e48af6dba Remove unused HashUtils class 2014-01-10 15:22:57 -08:00
Ankur Dave b437ed62a8 graph -> graphx in pom.xml 2014-01-10 15:22:31 -08:00
Andrew Or e4c51d2113 Address Patrick's and Reynold's comments
Aside from trivial formatting changes, use nulls instead of Options for
DiskMapIterator, and add documentation for spark.shuffle.externalSorting
and spark.shuffle.memoryFraction.

Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
2014-01-10 15:09:51 -08:00
RongGu 94776f753f fix a type error in comment lines 2014-01-11 05:43:56 +08:00
Thomas Graves 7cef8435d7 Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes
Yarn client addjar and misc fixes

Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.
2014-01-10 15:34:15 -06:00
Ankur Dave 7bda997785 Improve docs for PartitionStrategy 2014-01-10 13:00:28 -08:00
Patrick Wendell 7b58f116e5 Merge pull request #384 from pwendell/debug-logs
Make DEBUG-level logs consummable.

Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.
2014-01-10 12:47:46 -08:00
Ankur Dave eb4b46f8d1 Improve docs for GraphOps 2014-01-10 12:46:00 -08:00
Shivaram Venkataraman 7c4e6e1bf1 Add i2 instance types to Spark EC2. 2014-01-10 12:44:55 -08:00
Ankur Dave 9454fa1f6c Remove duplicate method in GraphLoader and improve docs 2014-01-10 12:37:20 -08:00
Ankur Dave 37611e57f6 Improve docs for EdgeRDD, EdgeTriplet, and GraphLab 2014-01-10 12:37:03 -08:00
Ankur Dave eee9bc0958 Remove commented-out perf files 2014-01-10 12:36:15 -08:00
Ankur Dave c39ec3017f Remove some commented code 2014-01-10 12:17:17 -08:00
Tathagata Das e4bb845238 Updated docs based on Patrick's comments in PR 383. 2014-01-10 12:17:09 -08:00
Ankur Dave 5fcd2a61b4 Finish cleaning up Graph docs 2014-01-10 12:17:04 -08:00
Ankur Dave 4c114a7556 Start cleaning up Scaladocs in Graph and EdgeRDD 2014-01-10 11:37:54 -08:00
Ankur Dave 3eb83191cb Generate GraphX docs 2014-01-10 11:37:28 -08:00
Ankur Dave 6bd9a78e78 Add back Bagel links to docs, but mark them superseded 2014-01-10 11:37:10 -08:00
Ankur Dave cfc10c74a3 Remove EdgeTriplet.{src,dst}Stale, which were unused 2014-01-10 10:43:23 -08:00
Ankur Dave bf50e8c6cd Remove commented code from Analytics 2014-01-10 10:37:04 -08:00
Ankur Dave 1b2aad918c Update graphx/pom.xml to mirror mllib/pom.xml 2014-01-10 10:34:40 -08:00
Patrick Wendell e9ed2d9e82 Make DEBUG-level logs consummable.
Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.
2014-01-10 10:33:24 -08:00
Ankur Dave 23d2995116 Merge pull request #1 from jegonzal/graphx
ProgrammingGuide
2014-01-10 10:20:02 -08:00