More edits.

2014-01-10 23:52:04 -08:00 · 2014-01-10 23:52:04 -08:00 · b8a44f12a5
parent 362b9422e4
commit b8a44f12a5
4 changed files with 215 additions and 16 deletions
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@ -11,6 +11,7 @@ title: GraphX Programming Guide
       title="GraphX Logo"
       alt="GraphX"
       width="65%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>

 # Overview
@ -42,6 +43,7 @@ magnitude faster than more general *data-parallel* systems.
       title="Data-Parallel vs. Graph-Parallel"
       alt="Data-Parallel vs. Graph-Parallel"
       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>

 However, the same restrictions that enable these substantial performance gains
@ -56,14 +58,15 @@ movement and duplication and a complicated programming model.
       title="Graph Analytics Pipeline"
       alt="Graph Analytics Pipeline"
       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>

 The goal of the GraphX project is to unify graph-parallel and data-parallel
-computation in one system with a single composable API. This goal is achieved
-through an API that enables users to view data both as a graph and as
-collections (i.e., RDDs) without data movement or duplication and by
-incorporating advances in graph-parallel systems to optimize the execution of
-operations on the graph view.  In preliminary experiments we find that the GraphX
+computation in one system with a single composable API. The GraphX API
+enables users to view data both as a graph and as
+collection (i.e., RDDs) without data movement or duplication. By
+incorporating recent advances in graph-parallel systems, GraphX is able to optimize
+the execution of graph operations. In preliminary experiments we find that the GraphX
 system is able to achieve performance comparable to state-of-the-art
 graph-parallel systems while easily expressing the entire analytics pipelines.

@ -72,6 +75,7 @@ graph-parallel systems while easily expressing the entire analytics pipelines.
       title="GraphX Performance Comparison"
       alt="GraphX Performance Comparison"
       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>

 ## GraphX Replaces the Spark Bagel API
@ -86,21 +90,219 @@ support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
 Bagel users to explore the new GraphX API and comment on issues that may
 complicate the transition from Bagel.

+# Getting Started
+
+To get started you first need to import Spark and GraphX into your project.  This can be done by
+importing the following:
+
+{% highlight scala %}
+import org.apache.spark._
+import org.apache.spark.graphx._
+{% endhighlight %}
+
+If you are not using the Spark shell you will also need a Spark context.
+
 # The Property Graph
 <a name="property_graph"></a>

-<p style="text-align: center;">
-  <img src="img/edge_cut_vs_vertex_cut.png"
-       title="Edge Cut vs. Vertex Cut"
-       alt="Edge Cut vs. Vertex Cut"
-       width="50%" />
-</p>
+The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with
+user defined objects attached to each vertex and edge.  Like RDDs, property graphs are immutable,
+distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
+a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
+Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
+
+The property graph is parameterized over the vertex `VD` and edge `ED` types.  These are the types
+of the objects associated with each vertex and edge respectively.  In some cases it can be desirable
+to have vertices of different types.  However, this can be accomplished through inheritance.
+
+> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
+> int, double, etc...) reducing the memory overhead of the graph representation.
+
+Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
+properties for each vertex and edge:
+
+{% highlight scala %}
+class Graph[VD: ClassTag, ED: ClassTag] {
+  val vertices: RDD[(VertexId, VD)]
+  val edges: RDD[Edge[ED]]
+  // ...
+}
+{% endhighlight %}
+> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
+> `EdgeRDD[ED]` respectively. These types extend and are optimized versions of `RDD[(VertexId, VD)]`
+> and `RDD[Edge[ED]]`.
+
+For example, we might construct a property graph consisting of various collaborators on the GraphX
+project. The vertex property contains the username and occupation and the edge property contains
+a string describing the relationships between the users.

 <p style="text-align: center;">
  <img src="img/property_graph.png"
       title="The Property Graph"
       alt="The Property Graph"
       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
+</p>
+
+The resulting graph would have the type signature:
+
+{% highlight scala %}
+val userGraph: Graph[(String, String), String]
+{% endhighlight %}
+
+There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
+generators and these are discussed in more detail in the section on
+[graph builders](#graph_builders).  Probably the most general method is to use the
+[graph singleton](api/graphx/index.html#org.apache.spark.graphx.Graph$).
+For example the following code constructs a graph from a collection of RDDs:
+
+{% highlight scala %}
+// Assume the SparkContext has already been constructed
+val sc: SparkContext
+// Create an RDD for the vertices
+val users: RDD[(VertexId, (String, String))] =
+  sc.parallelize(Array((3, ("rxin", "student")), (7, ("jgonzal", "postdoc")),
+                       (5, ("franklin", "prof")), (2, ("istoica", "prof"))))
+// Create an RDD for edges
+val relationships: RDD[Edge[String]] =
+  sc.parallelize(Array(Edge(3, 7, "collab"), Edge(5, 3, "advisor"),
+                       Edge(2, 5, "colleague"), Edge(5, 7, "pi"))
+// Define a default user in case there are relationship with missing user
+val defaultUser = ("John Doe", "Missing")
+// Build the initial Graph
+val graph = Graph(users, relationships, defaultUser)
+{% endhighlight %}
+
+In the above example we make use of the [`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge)
+case class. Edges have a `srcId` and a `dstId` corresponding to the source and destination vertex
+identifiers. In addition, the `Edge` class contains the `attr` member which contains the edge
+property.
+
+We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices`
+and `graph.edges` members respectively.
+
+{% highlight scala %}
+val graph: Graph[(String, String), String] // Constructed from above
+// Count all users which are postdocs
+graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
+// Count all the edges where src > dst
+graph.edges.filter(e => e.srcId > e.dstId).count
+{% endhighlight %}
+
+> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the
+> scala `case` expression to deconstruct the tuple.  Alternatively, `graph.edges` returns an `RDD`
+> containing `Edge[String]` objects.  We could have also used the case class type constructor as
+> in the following:
+> {% highlight scala %}
+graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
+{% endhighlight %}
+
+In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
+The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
+ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet).
+This *join* can be expressed in the following SQL expression:
+
+{% highlight sql %}
+SELECT src.id, dst.id, src.attr, e.attr, dst.attr
+FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst
+ON e.srcId = src.Id AND e.dstId = dst.Id
+{% endhighlight %}
+
+or graphically as:
+
+<p style="text-align: center;">
+  <img src="img/triplet.png"
+       title="Edge Triplet"
+       alt="Edge Triplet"
+       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
+</p>
+
+The [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class extends the
+[`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) class by adding the `srcAttr` and
+`dstAttr` members which contain the source and destination properties respectively. We can use the
+triplet view of a graph to render a collection of strings describing relationships between users.
+
+{% highlight scala %}
+val graph: Graph[(String, String), String] // Constructed from above
+// Use the triplets view to create an RDD of facts.
+val facts: RDD[String] =
+  graph.triplets.map(et => et.srcAttr._1 + " is the " + et.attr + " of " et.dstAttr)
+{% endhighlight %}
+
+# Graph Operators
+
+Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
+have a collection of basic operators that take user defined function and produce new graphs with
+transformed properties and structure.
+
+## Property Operators
+
+In direct analogy to the RDD `map` operator, the property
+graph contains the following:
+
+{% highlight scala %}
+class Graph[VD, ED] {
+  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
+  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
+  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+}
+{% endhighlight %}
+
+Each of these operators yields a new graph with the vertex or edge properties modified by the user
+defined `map` function.
+
+> Note that in all cases the graph structure is unaffected.  This is a key feature of these
+> operators which allows the resulting graph to reuse the structural indicies and the unaffected
+> properties of the original graph.
+> While `graph.mapVertices(mapUDF)` is logically equivalent to the following, the following
+> does not preserve the structural indicies and would not benefit from the substantial system
+> optimizations in GraphX.
+> {% highlight scala %}
+val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))}
+val newGraph = Graph(newVertices, graph.edges)
+{% endhighlight %}
+
+These operators are often used to initialize the graph for a particular computation or project away
+unnecessary properties.  For example, given a graph with the out-degrees as the vertex properties
+(we describe how to construct such a graph later) we initialize for PageRank:
+
+{% highlight scala %}
+// Given a graph where the vertex property is the out-degree
+val inputGraph: Graph[Int, String]
+// Construct a graph where each edge contains the weight
+// and each vertex is the initial PageRank
+val outputGraph: Graph[Double, Double] =
+  inputGraph.mapTriplets(et => 1.0/et.srcAttr).mapVertices(v => 1.0)
+{% endhighlight %}
+
+## Structural Operators
+<a name="structural_operators"></a>
+
+
+## Map Reduce Triplets (mapReduceTriplets)
+<a name="mrTriplets"></a>
+
+
+# Graph Builders
+<a name="graph_builders"></a>
+
+
+{% highlight scala %}
+val userGraph: Graph[(String, String), String]
+{% endhighlight %}
+
+
+# Optimized Representation
+
+The Property Graph is internally represented as a collection of RDDs
+
+<p style="text-align: center;">
+  <img src="img/edge_cut_vs_vertex_cut.png"
+       title="Edge Cut vs. Vertex Cut"
+       alt="Edge Cut vs. Vertex Cut"
+       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>

 <p style="text-align: center;">
@ -108,25 +310,22 @@ complicate the transition from Bagel.
       title="RDD Graph Representation"
       alt="RDD Graph Representation"
       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>


-# Graph Operators

-## Map Reduce Triplets (mapReduceTriplets)
-<a name="mrTriplets"></a>

 # Graph Algorithms
 <a name="graph_algorithms"></a>

-# Graph Builders
-<a name="graph_builders"></a>

 <p style="text-align: center;">
  <img src="img/tables_and_graphs.png"
       title="Tables and Graphs"
       alt="Tables and Graphs"
       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>

 # Examples
--- a/docs/img/graphx_figures.pptx
+++ b/docs/img/graphx_figures.pptx
--- a/docs/img/property_graph.png
+++ b/docs/img/property_graph.png
--- a/docs/img/triplet.png
+++ b/docs/img/triplet.png