Commit graph

465 commits

Author SHA1 Message Date
Tathagata Das f8e239e058 Merge remote-tracking branch 'apache/master' into filestream-fix
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
2014-01-13 23:57:27 -08:00
Tathagata Das 4e497db8f3 Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation. 2014-01-13 23:23:46 -08:00
Patrick Wendell fdaabdc673 Merge pull request #380 from mateiz/py-bayes
Add Naive Bayes to Python MLlib, and some API fixes

- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-13 23:08:26 -08:00
Patrick Wendell 4a805aff5e Merge pull request #367 from ankurdave/graphx
GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
2014-01-13 22:58:38 -08:00
Patrick Wendell 945fe7a37e Merge pull request #408 from pwendell/external-serializers
Improvements to external sorting

1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 22:56:12 -08:00
Joseph E. Gonzalez 4bafc4f41f adding documentation about EdgeRDD 2014-01-13 22:55:54 -08:00
Ankur Dave af645be5b8 Fix all code examples in guide 2014-01-13 22:29:45 -08:00
Ankur Dave 2cd9358ccf Finish 6f6f8c928c 2014-01-13 22:29:23 -08:00
Ankur Dave 6f6f8c928c Wrap methods in the appropriate class/object declaration 2014-01-13 21:55:35 -08:00
Ankur Dave 67795dbbfb Write Graph Builders section in guide 2014-01-13 21:45:11 -08:00
Ankur Dave e14a14bcde Remove K-Core and LDA sections from guide; they are unimplemented 2014-01-13 21:12:58 -08:00
Ankur Dave 59e4384e19 Fix Pregel SSSP example in programming guide 2014-01-13 21:02:38 -08:00
Joseph E. Gonzalez ee8931d2c6 Finished documenting vertexrdd. 2014-01-13 19:30:35 -08:00
Joseph E. Gonzalez 552de5d42e Finished second pass on pregel docs. 2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez 622b7f7d39 Minor changes in graphx programming guide. 2014-01-13 18:40:43 -08:00
Joseph E. Gonzalez cfe4a29dcb Improvements in example code for the programming guide as well as adding serialization support for GraphImpl to address issues with failed closure capture. 2014-01-13 17:18:31 -08:00
Ankur Dave 1bd5cefcae Remove aggregateNeighbors 2014-01-13 17:03:03 -08:00
Reynold Xin e2d25d2dfe Merge branch 'master' into graphx 2014-01-13 16:21:26 -08:00
Ankur Dave 8038da2328 Merge pull request #2 from jegonzal/GraphXCCIssue
Improving documentation and identifying potential bug in CC calculation.
2014-01-13 14:59:30 -08:00
Ankur Dave 97cd27e31b Add graph loader links to doc 2014-01-13 14:54:48 -08:00
Ankur Dave 15ca89b11e Fix mapReduceTriplets links in doc 2014-01-13 14:54:33 -08:00
Joseph E. Gonzalez 80e4d98dc6 Improving documentation and identifying potential bug in CC calculation. 2014-01-13 13:40:16 -08:00
Patrick Wendell c3816de504 Changing option wording per discussion with Andrew 2014-01-13 13:25:06 -08:00
Patrick Wendell 5d61e051c2 Improvements to external sorting
1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.
2014-01-13 12:21:39 -08:00
Patrick Wendell b93f9d42f2 Merge pull request #400 from tdas/dstream-move
Moved DStream and PairDSream to org.apache.spark.streaming.dstream

Similar to the package location of `org.apache.spark.rdd.RDD`, `DStream` has been moved from `org.apache.spark.streaming.DStream` to `org.apache.spark.streaming.dstream.DStream`. I know that the package name is a little long, but I think its better to keep it consistent with Spark's structure.

Also fixed persistence of windowed DStream. The RDDs generated generated by windowed DStream are essentially unions of underlying RDDs, and persistent these union RDDs would store numerous copies of the underlying data. Instead setting the persistence level on the windowed DStream is made to set the persistence level of the underlying DStream.
2014-01-13 12:18:05 -08:00
Joseph E. Gonzalez 66c9d0092a Tested and corrected all examples up to mask in the graphx-programming-guide. 2014-01-12 22:11:13 -08:00
Ankur Dave 1efe78a101 Use GraphLoader for algorithms examples in doc 2014-01-12 22:03:03 -08:00
Tathagata Das 777c181d2f Merge remote-tracking branch 'apache/master' into dstream-move
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
2014-01-12 21:59:51 -08:00
Ankur Dave d691e9f47e Move algorithms to GraphOps 2014-01-12 21:47:16 -08:00
Ankur Dave 20c509b805 Add TriangleCount example 2014-01-12 21:41:32 -08:00
Patrick Wendell 0b96d85c20 Merge pull request #399 from pwendell/consolidate-off
Disable shuffle file consolidation by default

After running various performance tests for the 0.9 release, this still seems to have performance issues even on XFS. So let's keep this off-by-default for 0.9 and users can experiment with it depending on their disk configurations.
2014-01-12 21:31:43 -08:00
Joseph E. Gonzalez c787ff5640 Documenting Pregel API 2014-01-12 20:49:52 -08:00
Patrick Wendell 2802cc80bc Disable shuffle file consolidation by default 2014-01-12 19:16:43 -08:00
Matei Zaharia 54d3486ee9 Fix Scala version in docs (it was printed as 2.1) 2014-01-12 17:49:59 -08:00
Patrick Wendell f4d77f8cb8 Rename DStream.foreach to DStream.foreachRDD
`foreachRDD` makes it clear that the granularity of this operator is per-RDD.
As it stands, `foreach` is inconsistent with with `map`, `filter`, and the other
DStream operators which get pushed down to individual records within each RDD.
2014-01-12 17:21:00 -08:00
Ankur Dave 7a4bb863c7 Add connected components example to doc 2014-01-12 16:58:18 -08:00
Ankur Dave 5e35d39e0f Add PageRank example and data 2014-01-12 13:10:53 -08:00
Tathagata Das 448aef6790 Moved DStream, DStreamCheckpointData and PairDStream from org.apache.spark.streaming to org.apache.spark.streaming.dstream. 2014-01-12 11:31:54 -08:00
Ankur Dave f096f4eaf1 Link methods in programming guide; document VertexID 2014-01-12 10:55:29 -08:00
Matei Zaharia 224f1a754a Update Python required version to 2.7, and mention MLlib support 2014-01-12 00:15:34 -08:00
Matei Zaharia 4c28a2bad8 Update some Python MLlib parameters to use camelCase, and tweak docs
We've used camel case in other Spark methods so it felt reasonable to
keep using it here and make the code match Scala/Java as much as
possible. Note that parameter names matter in Python because it allows
passing optional parameters by name.
2014-01-11 22:30:48 -08:00
Matei Zaharia 9a0dfdf868 Add Naive Bayes to Python MLlib, and some API fixes
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-11 22:30:48 -08:00
Joseph E. Gonzalez cf57b1b055 Correcting typos in documentation. 2014-01-11 17:13:10 -08:00
Joseph E. Gonzalez 64c4593586 Finished docummenting join operators and revised some of the initial presentation. 2014-01-11 13:48:35 -08:00
Ankur Dave 732333d78e Remove GraphLab 2014-01-11 11:49:35 -08:00
Joseph E. Gonzalez fac44bbe2c Finished documenting structural operators and starting join operators. 2014-01-11 11:28:01 -08:00
Joseph E. Gonzalez 1f45e4e572 starting structural operator discussion. 2014-01-11 09:27:00 -08:00
Joseph E. Gonzalez 56a245c6bc Addressing comment about Graph Processing in docs. 2014-01-11 00:21:17 -08:00
Joseph E. Gonzalez 0c9d39bbaa More organizational changes and dropping the benchmark plot. 2014-01-11 00:09:08 -08:00
Joseph E. Gonzalez b8a44f12a5 More edits. 2014-01-10 23:52:24 -08:00