Commit graph

6336 commits

Author SHA1 Message Date
Ankur Dave 59e4384e19 Fix Pregel SSSP example in programming guide 2014-01-13 21:02:38 -08:00
Ankur Dave c6023bee60 Fix infinite loop in GraphGenerators.generateRandomEdges
The loop occurred when numEdges < numVertices. This commit fixes it by
allowing generateRandomEdges to generate a multigraph.
2014-01-13 21:02:37 -08:00
Ankur Dave 84d6af8021 Make Graph{,Impl,Ops} serializable to work around capture 2014-01-13 21:02:37 -08:00
Ankur Dave d4d9ece1af Remove Graph.statistics and GraphImpl.printLineage 2014-01-13 21:02:37 -08:00
Andrew Or 839934140f Wording changes per Patrick 2014-01-13 20:51:38 -08:00
Matei Zaharia cc93c2abb1 Disable MLlib tests for now while Jenkins is still on Python 2.6 2014-01-13 20:46:46 -08:00
Patrick Wendell b07bc02a00 Merge pull request #412 from harveyfeng/master
Add default value for HadoopRDD's `cloneRecords` constructor arg

Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility
2014-01-13 20:45:22 -08:00
Reynold Xin 33022d6656 Adjusted visibility of various components. 2014-01-13 19:58:53 -08:00
Patrick Wendell a2fee38ee0 Merge pull request #411 from tdas/filestream-fix
Improved logic of finding new files in FileInputDStream

Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it.

The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further.

Also reduced line lengths in DStream to <=100 chars.
2014-01-13 19:45:26 -08:00
Harvey 9e84e70509 Add default value for HadoopRDD's cloneRecords constructor arg, to maintain backwards compatibility. 2014-01-13 19:43:40 -08:00
Joseph E. Gonzalez ee8931d2c6 Finished documenting vertexrdd. 2014-01-13 19:30:35 -08:00
Patrick Wendell d4cd5debf4 Fix for Kryo Serializer 2014-01-13 19:03:59 -08:00
Reynold Xin 0fbc0b0561 Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphx 2014-01-13 18:51:22 -08:00
Reynold Xin 0b18bfba1a Updated doc for PageRank. 2014-01-13 18:51:04 -08:00
Reynold Xin 9317286b72 More cleanup. 2014-01-13 18:45:35 -08:00
Reynold Xin 8e5c732430 Moved SVDPlusPlusConf into SVDPlusPlus object itself. 2014-01-13 18:45:20 -08:00
Raymond Liu 4c22c55ad6 Address comments to fix code formats 2014-01-14 10:41:42 +08:00
Joseph E. Gonzalez 552de5d42e Finished second pass on pregel docs. 2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez 622b7f7d39 Minor changes in graphx programming guide. 2014-01-13 18:40:43 -08:00
Raymond Liu 161ab93989 Yarn workerRunnable refactor 2014-01-14 10:36:00 +08:00
Raymond Liu 79a5ba3497 Yarn Client refactor 2014-01-14 10:33:48 +08:00
Reynold Xin 1dce9ce446 Moved PartitionStrategy's into an object. 2014-01-13 18:32:04 -08:00
Reynold Xin ae06d2c22f Updated GraphGenerator. 2014-01-13 18:31:49 -08:00
Reynold Xin 87f335db78 Made more things private. 2014-01-13 18:30:26 -08:00
Reynold Xin a4e12af7aa Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphx
Conflicts:
	graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
2014-01-13 17:42:59 -08:00
Reynold Xin 02a8f54bfa Miscel doc update. 2014-01-13 17:40:36 -08:00
Tathagata Das 1233b3de01 Merge remote-tracking branch 'apache/master' into filestream-fix 2014-01-13 17:29:19 -08:00
Joseph E. Gonzalez cfe4a29dcb Improvements in example code for the programming guide as well as adding serialization support for GraphImpl to address issues with failed closure capture. 2014-01-13 17:18:31 -08:00
Ankur Dave ae4b75d94a Add EdgeDirection.Either and use it to fix CC bug
The bug was due to a misunderstanding of the activeSetOpt parameter to
Graph.mapReduceTriplets. Passing EdgeDirection.Both causes
mapReduceTriplets to run only on edges with *both* vertices in the
active set. This commit adds EdgeDirection.Either, which causes
mapReduceTriplets to run on edges with *either* vertex in the active
set. This is what connected components needed.
2014-01-13 17:03:03 -08:00
Ankur Dave 1bd5cefcae Remove aggregateNeighbors 2014-01-13 17:03:03 -08:00
Tathagata Das c0bb38e8aa Improved file input stream further. 2014-01-13 16:54:52 -08:00
Reynold Xin dc041cd3b6 Merge branch 'scaladoc1' of github.com:rxin/incubator-spark into graphx 2014-01-13 16:25:21 -08:00
Reynold Xin 01c0d72b32 Merge pull request #410 from rxin/scaladoc1
Updated JavaStreamingContext to make scaladoc compile.

`sbt/sbt doc` used to fail. This fixed it.
2014-01-13 16:24:30 -08:00
Reynold Xin e2d25d2dfe Merge branch 'master' into graphx 2014-01-13 16:21:26 -08:00
Reynold Xin 30328c347b Updated JavaStreamingContext to make scaladoc compile.
`sbt/sbt doc` used to fail. This fixed it.
2014-01-13 15:58:39 -08:00
Ankur Dave 8038da2328 Merge pull request #2 from jegonzal/GraphXCCIssue
Improving documentation and identifying potential bug in CC calculation.
2014-01-13 14:59:30 -08:00
Tathagata Das 27311b1332 Added unpersisting and modified testsuite to better test out metadata cleaning. 2014-01-13 14:57:07 -08:00
Ankur Dave 97cd27e31b Add graph loader links to doc 2014-01-13 14:54:48 -08:00
Ankur Dave 15ca89b11e Fix mapReduceTriplets links in doc 2014-01-13 14:54:33 -08:00
Joseph E. Gonzalez 80e4d98dc6 Improving documentation and identifying potential bug in CC calculation. 2014-01-13 13:40:16 -08:00
Patrick Wendell c3816de504 Changing option wording per discussion with Andrew 2014-01-13 13:25:06 -08:00
Ankur Dave 9fe88627b5 Improve EdgeRDD scaladoc 2014-01-13 13:16:41 -08:00
Ankur Dave ea69cff711 Further improve VertexRDD scaladocs 2014-01-13 12:52:52 -08:00
Patrick Wendell 5d61e051c2 Improvements to external sorting
1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 12:21:39 -08:00
Patrick Wendell b93f9d42f2 Merge pull request #400 from tdas/dstream-move
Moved DStream and PairDSream to org.apache.spark.streaming.dstream

Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure.

Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
2014-01-13 12:18:05 -08:00
Ankur Dave 8ca9773974 Add LiveJournalPageRank example 2014-01-13 12:17:58 -08:00
Saurabh Rawat e922973373 Modifications as suggested in PR feedback-
- mapPartitions, foreachPartition moved to JavaRDDLike
- call scala rdd's setGenerator instead of setting directly in JavaRDD
2014-01-13 23:40:04 +05:30
eklavya fa42951e3b Remove default param from mapPartitions 2014-01-13 18:13:22 +05:30
eklavya 8fe562c0fa Remove classtag from mapPartitions. 2014-01-13 18:09:58 +05:30
eklavya 6a65feebc7 Added foreachPartition method to JavaRDD. 2014-01-13 17:56:47 +05:30