Merge pull request #1 from jegonzal/graphx

ProgrammingGuide
2014-01-10 10:20:02 -08:00 · 2014-01-10 10:20:02 -08:00 · 23d2995116
parent 729277ebc4 b1eeefb401
commit 23d2995116
12 changed files with 134 additions and 159 deletions
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@ -67,7 +67,7 @@
                                <li class="divider"></li>
                                <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
-                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
+                                <li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
                            </ul>
                        </li>

@ -79,7 +79,7 @@
                                <li class="divider"></li>
                                <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
                                <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
-                                <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
+                                <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Paralle Spark)</a></li>
                            </ul>
                        </li>

--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@ -1,63 +1,141 @@
 ---
 layout: global
-title: "GraphX: Unifying Graphs and Tables"
+title: GraphX Programming Guide
 ---

+* This will become a table of contents (this text will be scraped).
+{:toc}

-GraphX extends the distributed fault-tolerant collections API and
-interactive console of [Spark](http://spark.incubator.apache.org) with
-a new graph API which leverages recent advances in graph systems
-(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
-interactively build, transform, and reason about graph structured data
-at scale.
-
-
-## Motivation
-
-From social networks and targeted advertising to protein modeling and
-astrophysics, big graphs capture the structure in data and are central
-to the recent advances in machine learning and data mining. Directly
-applying existing *data-parallel* tools (e.g.,
-[Hadoop](http://hadoop.apache.org) and
-[Spark](http://spark.incubator.apache.org)) to graph computation tasks
-can be cumbersome and inefficient.  The need for intuitive, scalable
-tools for graph computation has lead to the development of new
-*graph-parallel* systems (e.g.,
-[Pregel](http://http://giraph.apache.org) and
-[GraphLab](http://graphlab.org)) which are designed to efficiently
-execute graph algorithms.  Unfortunately, these systems do not address
-the challenges of graph construction and transformation and provide
-limited fault-tolerance and support for interactive analysis.
-
-{:.pagination-centered}
-![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png)
-
-## Solution
-
-The GraphX project combines the advantages of both data-parallel and
-graph-parallel systems by efficiently expressing graph computation
-within the [Spark](http://spark.incubator.apache.org) framework.  We
-leverage new ideas in distributed graph representation to efficiently
-distribute graphs as tabular data-structures.  Similarly, we leverage
-advances in data-flow systems to exploit in-memory computation and
-fault-tolerance.  We provide powerful new operations to simplify graph
-construction and transformation.  Using these primitives we implement
-the PowerGraph and Pregel abstractions in less than 20 lines of code.
-Finally, by exploiting the Scala foundation of Spark, we enable users
-to interactively load, transform, and compute on massive graphs.
-
-<p align="center">
-  <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
+<p style="text-align: center;">
+  <img src="img/graphx_logo.png"
+       title="GraphX Logo"
+       alt="GraphX"
+       width="65%" />
 </p>

-## Examples
+# Overview
+
+GraphX is the new (alpha) Spark API for graphs and graph-parallel
+computation. At a high-level GraphX, extends the Spark
+[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
+introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
+a directed graph with properties attached to each vertex and edge.
+To support graph computation, GraphX exposes a set of functions
+(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the
+[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org)
+APIs. In addition, GraphX includes a growing collection of graph
+[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
+graph analytics tasks.
+
+## Background on Graph-Parallel Computation
+
+From social networks to language modeling, the growing scale and importance of
+graph data has driven the development of numerous new *graph-parallel* systems
+(e.g., [Giraph](http://http://giraph.apache.org) and
+[GraphLab](http://graphlab.org)).  By restricting the types of computation that can be
+expressed and introducing new techniques to partition and distribute graphs,
+these systems can efficiently execute sophisticated graph algorithms orders of
+magnitude faster than more general *data-parallel* systems.
+
+<p style="text-align: center;">
+  <img src="img/data_parallel_vs_graph_parallel.png"
+       title="Data-Parallel vs. Graph-Parallel"
+       alt="Data-Parallel vs. Graph-Parallel"
+       width="50%" />
+</p>
+
+However, the same restrictions that enable these substantial performance gains
+also make it difficult to express many of the important stages in a typical graph-analytics pipeline:
+constructing the graph, modifying its structure, or expressing computation that
+spans multiple graphs.  As a consequence, existing graph analytics pipelines
+compose graph-parallel and data-parallel systems, leading to extensive data
+movement and duplication and a complicated programming model.
+
+<p style="text-align: center;">
+  <img src="img/graph_analytics_pipeline.png"
+       title="Graph Analytics Pipeline"
+       alt="Graph Analytics Pipeline"
+       width="50%" />
+</p>
+
+The goal of the GraphX project is to unify graph-parallel and data-parallel
+computation in one system with a single composable API. This goal is achieved
+through an API that enables users to view data both as a graph and as
+collections (i.e., RDDs) without data movement or duplication and by
+incorporating advances in graph-parallel systems to optimize the execution of
+operations on the graph view.  In preliminary experiments we find that the GraphX
+system is able to achieve performance comparable to state-of-the-art
+graph-parallel systems while easily expressing the entire analytics pipelines.
+
+<p style="text-align: center;">
+  <img src="img/graphx_performance_comparison.png"
+       title="GraphX Performance Comparison"
+       alt="GraphX Performance Comparison"
+       width="50%" />
+</p>
+
+## GraphX Replaces the Spark Bagel API
+
+Prior to the release of GraphX, graph computation in Spark was expressed using
+Bagel, an implementation of the Pregel API.  GraphX improves upon Bagel by exposing
+a richer property graph API, a more streamlined version of the Pregel abstraction,
+and system optimizations to improve performance and reduce memory
+overhead.  While we plan to eventually deprecate the Bagel, we will continue to
+support the API and [Bagel programming guide](bagel-programming-guide.html). However,
+we encourage Bagel to explore the new GraphX API and comment on issues that may
+complicate the transition from Bagel.
+
+# The Property Graph
+<a name="property_graph"></a>
+
+<p style="text-align: center;">
+  <img src="img/edge_cut_vs_vertex_cut.png"
+       title="Edge Cut vs. Vertex Cut"
+       alt="Edge Cut vs. Vertex Cut"
+       width="50%" />
+</p>
+
+<p style="text-align: center;">
+  <img src="img/property_graph.png"
+       title="The Property Graph"
+       alt="The Property Graph"
+       width="50%" />
+</p>
+
+<p style="text-align: center;">
+  <img src="img/vertex_routing_edge_tables.png"
+       title="RDD Graph Representation"
+       alt="RDD Graph Representation"
+       width="50%" />
+</p>
+
+
+# Graph Operators
+
+## Map Reduce Triplets (mapReduceTriplets)
+<a name="mrTriplets"></a>
+
+# Graph Algorithms
+<a name="graph_algorithms"></a>
+
+# Graph Builders
+<a name="graph_builders"></a>
+
+<p style="text-align: center;">
+  <img src="img/tables_and_graphs.png"
+       title="Tables and Graphs"
+       alt="Tables and Graphs"
+       width="50%" />
+</p>
+
+# Examples

 Suppose I want to build a graph from some text files, restrict the graph
 to important relationships and users, run page-rank on the sub-graph, and
 then finally return attributes associated with the top users.  I can do
 all of this in just a few lines with GraphX:

-```scala
+{% highlight scala %}
 // Connect to the Spark cluster
 val sc = new SparkContext("spark://master.amplab.org", "research")

@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){

 println(userInfoWithPageRank.top(5))

-```
+{% endhighlight %}

-
-## Online Documentation
-
-You can find the latest Spark documentation, including a programming
-guide, on the project webpage at
-<http://spark.incubator.apache.org/documentation.html>.  This README
-file only contains basic setup instructions.
-
-
-## Building
-
-Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
-project is built using Simple Build Tool (SBT), which is packaged with
-it. To build Spark and its example programs, run:
-
-    sbt/sbt assembly
-
-Once you've built Spark, the easiest way to start using it is the
-shell:
-
-    ./spark-shell
-
-Or, for the Python API, the Python shell (`./pyspark`).
-
-Spark also comes with several sample programs in the `examples`
-directory.  To run one of them, use `./run-example <class>
-<params>`. For example:
-
-    ./run-example org.apache.spark.examples.SparkLR local[2]
-
-will run the Logistic Regression example locally on 2 CPUs.
-
-Each of the example programs prints usage help if no params are given.
-
-All of the Spark samples take a `<master>` parameter that is the
-cluster URL to connect to. This can be a mesos:// or spark:// URL, or
-"local" to run locally with one thread, or "local[N]" to run locally
-with N threads.
-
-
-## A Note About Hadoop Versions
-
-Spark uses the Hadoop core library to talk to HDFS and other
-Hadoop-supported storage systems. Because the protocols have changed
-in different versions of Hadoop, you must build Spark against the same
-version that your cluster runs.  You can change the version by setting
-the `SPARK_HADOOP_VERSION` environment when building Spark.
-
-For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
-versions without YARN, use:
-
-    # Apache Hadoop 1.2.1
-    $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
-
-    # Cloudera CDH 4.2.0 with MapReduce v1
-    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
-
-For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
-with YARN, also set `SPARK_YARN=true`:
-
-    # Apache Hadoop 2.0.5-alpha
-    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
-
-    # Cloudera CDH 4.2.0 with MapReduce v2
-    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
-
-For convenience, these variables may also be set through the
-`conf/spark-env.sh` file described below.
-
-When developing a Spark application, specify the Hadoop version by adding the
-"hadoop-client" artifact to your project's dependencies. For example, if you're
-using Hadoop 1.2.1 and build your application using SBT, add this entry to
-`libraryDependencies`:
-
-    "org.apache.hadoop" % "hadoop-client" % "1.2.1"
-
-If your project is built with Maven, add this to your POM file's
-`<dependencies>` section:
-
-    <dependency>
-      <groupId>org.apache.hadoop</groupId>
-      <artifactId>hadoop-client</artifactId>
-      <version>1.2.1</version>
-    </dependency>
-
-
-## Configuration
-
-Please refer to the [Configuration
-guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
-in the online documentation for an overview on how to configure Spark.
-
-
-## Contributing to GraphX
-
-Contributions via GitHub pull requests are gladly accepted from their
-original author. Along with any pull requests, please state that the
-contribution is your original work and that you license the work to
-the project under the project's open source license. Whether or not
-you state this explicitly, by submitting any copyrighted material via
-pull request, email, or other means you agree to license the material
-under the project's open source license and warrant that you have the
-legal authority to do so.
--- a/docs/img/data_parallel_vs_graph_parallel.png
+++ b/docs/img/data_parallel_vs_graph_parallel.png
--- a/docs/img/edge_cut_vs_vertex_cut.png
+++ b/docs/img/edge_cut_vs_vertex_cut.png
--- a/docs/img/graph_analytics_pipeline.png
+++ b/docs/img/graph_analytics_pipeline.png
--- a/docs/img/graphx_figures.pptx
+++ b/docs/img/graphx_figures.pptx
--- a/docs/img/graphx_logo.png
+++ b/docs/img/graphx_logo.png
--- a/docs/img/graphx_performance_comparison.png
+++ b/docs/img/graphx_performance_comparison.png
--- a/docs/img/property_graph.png
+++ b/docs/img/property_graph.png
--- a/docs/img/tables_and_graphs.png
+++ b/docs/img/tables_and_graphs.png
--- a/docs/img/vertex_routing_edge_tables.png
+++ b/docs/img/vertex_routing_edge_tables.png
--- a/docs/index.md
+++ b/docs/index.md
@ -5,7 +5,7 @@ title: Spark Overview

 Apache Spark is a fast and general-purpose cluster computing system.
 It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
-It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [Bagel](bagel-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).

 # Downloading

@ -77,7 +77,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
  * [Python Programming Guide](python-programming-guide.html): using Spark from Python
 * [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
 * [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
-* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
+* [GraphX (Graphs on Spark)](graphx-programming-guide.html): simple graph processing model

 **API Docs:**

@ -85,7 +85,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
 * [Spark for Python (Epydoc)](api/pyspark/index.html)
 * [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
 * [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
-* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html)
+* [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)


 **Deployment guides:**