Commit graph

6326 commits

Author SHA1 Message Date
Ankur Dave 02771aa087 Make EdgeDirection val instead of case object for Java compat. 2014-01-11 13:15:46 -08:00
Reynold Xin 6510f04e4d Merge pull request #387 from jerryshao/conf-fix
Fix configure didn't work small problem in ALS
2014-01-11 12:48:26 -08:00
Ankur Dave 574c0d28c2 Use SparkConf in GraphX tests (via LocalSparkContext) 2014-01-11 12:39:30 -08:00
Ankur Dave 55101f5821 One-line Scaladoc comments in Edge and EdgeDirection 2014-01-11 12:35:41 -08:00
Reynold Xin b0fbfccadc Minor update for clone writables and more documentation. 2014-01-11 12:35:10 -08:00
Ankur Dave 64f73f73a0 Fix indent and use SparkConf in Analytics 2014-01-11 12:33:06 -08:00
Reynold Xin ee6e7f9b8c Merge pull request #359 from ScrapCodes/clone-writables
We clone hadoop key and values by default and reuse objects if asked to.

 We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully.

There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.
2014-01-11 12:07:55 -08:00
Ankur Dave 732333d78e Remove GraphLab 2014-01-11 11:49:35 -08:00
Ankur Dave 0b5c49ebad Make nullValue and VertexSet package-private 2014-01-11 11:49:35 -08:00
Joseph E. Gonzalez fac44bbe2c Finished documenting structural operators and starting join operators. 2014-01-11 11:28:01 -08:00
Patrick Wendell b313e15616 Fix UI bug introduced in #244.
The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.
2014-01-11 10:52:57 -08:00
Patrick Wendell 4216178d5e Merge pull request #373 from jerryshao/kafka-upgrade
Upgrade Kafka dependecy to 0.8.0 release version
2014-01-11 09:46:48 -08:00
Joseph E. Gonzalez 1f45e4e572 starting structural operator discussion. 2014-01-11 09:27:00 -08:00
Ankur Dave feaa078022 algorithms -> lib 2014-01-11 00:30:10 -08:00
jerryshao cbfbc01938 Fix configure didn't work small problem in ALS 2014-01-11 16:22:45 +08:00
Joseph E. Gonzalez 56a245c6bc Addressing comment about Graph Processing in docs. 2014-01-11 00:21:17 -08:00
Ankur Dave 4f7ddf40fc Optimize Edge.lexicographicOrdering 2014-01-11 00:15:01 -08:00
Joseph E. Gonzalez 0c9d39bbaa More organizational changes and dropping the benchmark plot. 2014-01-11 00:09:08 -08:00
Ankur Dave 34496d6a9f Move Analytics to algorithms and fix doc 2014-01-11 00:08:36 -08:00
Joseph E. Gonzalez b8a44f12a5 More edits. 2014-01-10 23:52:24 -08:00
Ankur Dave 362b9422e4 Soften wording about GraphX superseding Bagel 2014-01-10 23:48:32 -08:00
Ankur Dave 2d7e8d8c48 Add GC note to GraphLab 2014-01-10 23:46:02 -08:00
Reynold Xin 92ad18b00e Merge pull request #376 from prabeesh/master
Change clientId to random clientId

The client identifier should be unique across all clients connecting to the same server. A convenience method is provided to generate a random client id that should satisfy this criteria - generateClientId(). Returns a randomly generated client identifier based on the current user's login name and the system time. As the client identifier is used by the server to identify a client when it reconnects, the client must use the same identifier between connections if durable subscriptions are to be used.
2014-01-10 23:25:15 -08:00
Reynold Xin 0b5ce7af17 Merge pull request #386 from pwendell/typo-fix
Small typo fix
2014-01-10 23:23:21 -08:00
Andrew Or bb8098f203 Add number of bytes spilled to Web UI 2014-01-10 21:40:55 -08:00
Reza Zadeh 1afdeaeb2f add dimension parameters to example 2014-01-10 21:30:54 -08:00
Ankur Dave a696be1e01 Finish d1d2b6d9b6 2014-01-10 21:18:34 -08:00
Ankur Dave d1d2b6d9b6 Remove blank lines added to Spark core 2014-01-10 21:17:32 -08:00
Matei Zaharia 1d7bef0c91 Merge pull request #381 from mateiz/default-ttl
Fix default TTL for metadata cleaner

It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
2014-01-10 18:53:03 -08:00
Ankur Dave c4fb6a87d3 Fix scaladoc warnings 2014-01-10 18:36:42 -08:00
Andrew Or e6447152b3 Induce spilling in ExternalAppendOnlyMapSuite 2014-01-10 18:33:48 -08:00
Ankur Dave 0ca18b8b07 Revert GraphX changes to SparkILoopInit
The changes were to support a custom banner in spark-shell for use by
graphx-shell, but once GraphX is merged into Spark, a separate shell
will be unnecessary.
2014-01-10 18:05:11 -08:00
Ankur Dave 41d6586e8e Revert changes to Spark's (PrimitiveKey)OpenHashMap; copy PKOHM to graphx 2014-01-10 18:00:54 -08:00
Patrick Wendell 44d6a8e3d8 Merge pull request #382 from RongGu/master
Fix a type error in comment lines

Fix a type error in comment lines
2014-01-10 17:51:50 -08:00
Patrick Wendell 08370a52b8 Small typo fix 2014-01-10 17:47:15 -08:00
Patrick Wendell 88faa30a42 Merge pull request #385 from shivaram/add-i2-instances
Add i2 instance types to Spark EC2.

Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/
2014-01-10 17:14:22 -08:00
Matei Zaharia 942c80b34c Fix one unit test that was not setting spark.cleaner.ttl 2014-01-10 16:32:36 -08:00
Patrick Wendell f26553102c Merge pull request #383 from tdas/driver-test
API for automatic driver recovery for streaming programs and other bug fixes

1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory.

  Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext
  Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext

  See the RecoverableNetworkWordCount below as an example of how to use it.

2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint.

3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery.

4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp).

5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared.

This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test.

Edit: Java example to be added later, unit test added.
2014-01-10 16:25:44 -08:00
Patrick Wendell d37408f39c Merge pull request #377 from andrewor14/master
External Sorting for Aggregator and CoGroupedRDDs (Revisited)

(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)

The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.

The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.

Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
2014-01-10 16:25:01 -08:00
Ankur Dave 85a6645d31 Add doc for Algorithms 2014-01-10 16:08:58 -08:00
Ankur Dave 04c20e7f4f Minor cleanup to docs 2014-01-10 15:58:30 -08:00
Ankur Dave 1788729273 Move VertexIdToIndexMap into impl 2014-01-10 15:58:18 -08:00
Ankur Dave 57d7487d3d Improve docs for VertexRDD 2014-01-10 15:48:20 -08:00
Tathagata Das 4f39e79c23 Merge remote-tracking branch 'apache/master' into driver-test
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
2014-01-10 15:47:01 -08:00
Andrew Or 2e393cd5fd Update documentation for externalSorting 2014-01-10 15:45:38 -08:00
Tathagata Das 82f07deeda Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate. 2014-01-10 15:37:05 -08:00
Reynold Xin 0eaf01c5ed Merge pull request #369 from pillis/master
SPARK-961 Add a Vector.random() method

Added method and testcases
2014-01-10 15:32:19 -08:00
Ankur Dave 11dd35c28b Clean up GraphGenerators 2014-01-10 15:23:32 -08:00
Ankur Dave 9e48af6dba Remove unused HashUtils class 2014-01-10 15:22:57 -08:00
Ankur Dave b437ed62a8 graph -> graphx in pom.xml 2014-01-10 15:22:31 -08:00