More edits.
This commit is contained in:
parent
362b9422e4
commit
b8a44f12a5
|
@ -11,6 +11,7 @@ title: GraphX Programming Guide
|
|||
title="GraphX Logo"
|
||||
alt="GraphX"
|
||||
width="65%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
# Overview
|
||||
|
@ -42,6 +43,7 @@ magnitude faster than more general *data-parallel* systems.
|
|||
title="Data-Parallel vs. Graph-Parallel"
|
||||
alt="Data-Parallel vs. Graph-Parallel"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
However, the same restrictions that enable these substantial performance gains
|
||||
|
@ -56,14 +58,15 @@ movement and duplication and a complicated programming model.
|
|||
title="Graph Analytics Pipeline"
|
||||
alt="Graph Analytics Pipeline"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
The goal of the GraphX project is to unify graph-parallel and data-parallel
|
||||
computation in one system with a single composable API. This goal is achieved
|
||||
through an API that enables users to view data both as a graph and as
|
||||
collections (i.e., RDDs) without data movement or duplication and by
|
||||
incorporating advances in graph-parallel systems to optimize the execution of
|
||||
operations on the graph view. In preliminary experiments we find that the GraphX
|
||||
computation in one system with a single composable API. The GraphX API
|
||||
enables users to view data both as a graph and as
|
||||
collection (i.e., RDDs) without data movement or duplication. By
|
||||
incorporating recent advances in graph-parallel systems, GraphX is able to optimize
|
||||
the execution of graph operations. In preliminary experiments we find that the GraphX
|
||||
system is able to achieve performance comparable to state-of-the-art
|
||||
graph-parallel systems while easily expressing the entire analytics pipelines.
|
||||
|
||||
|
@ -72,6 +75,7 @@ graph-parallel systems while easily expressing the entire analytics pipelines.
|
|||
title="GraphX Performance Comparison"
|
||||
alt="GraphX Performance Comparison"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
## GraphX Replaces the Spark Bagel API
|
||||
|
@ -86,21 +90,219 @@ support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
|
|||
Bagel users to explore the new GraphX API and comment on issues that may
|
||||
complicate the transition from Bagel.
|
||||
|
||||
# Getting Started
|
||||
|
||||
To get started you first need to import Spark and GraphX into your project. This can be done by
|
||||
importing the following:
|
||||
|
||||
{% highlight scala %}
|
||||
import org.apache.spark._
|
||||
import org.apache.spark.graphx._
|
||||
{% endhighlight %}
|
||||
|
||||
If you are not using the Spark shell you will also need a Spark context.
|
||||
|
||||
# The Property Graph
|
||||
<a name="property_graph"></a>
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/edge_cut_vs_vertex_cut.png"
|
||||
title="Edge Cut vs. Vertex Cut"
|
||||
alt="Edge Cut vs. Vertex Cut"
|
||||
width="50%" />
|
||||
</p>
|
||||
The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with
|
||||
user defined objects attached to each vertex and edge. Like RDDs, property graphs are immutable,
|
||||
distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
|
||||
a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
|
||||
Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
|
||||
|
||||
The property graph is parameterized over the vertex `VD` and edge `ED` types. These are the types
|
||||
of the objects associated with each vertex and edge respectively. In some cases it can be desirable
|
||||
to have vertices of different types. However, this can be accomplished through inheritance.
|
||||
|
||||
> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
|
||||
> int, double, etc...) reducing the memory overhead of the graph representation.
|
||||
|
||||
Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
|
||||
properties for each vertex and edge:
|
||||
|
||||
{% highlight scala %}
|
||||
class Graph[VD: ClassTag, ED: ClassTag] {
|
||||
val vertices: RDD[(VertexId, VD)]
|
||||
val edges: RDD[Edge[ED]]
|
||||
// ...
|
||||
}
|
||||
{% endhighlight %}
|
||||
> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
|
||||
> `EdgeRDD[ED]` respectively. These types extend and are optimized versions of `RDD[(VertexId, VD)]`
|
||||
> and `RDD[Edge[ED]]`.
|
||||
|
||||
For example, we might construct a property graph consisting of various collaborators on the GraphX
|
||||
project. The vertex property contains the username and occupation and the edge property contains
|
||||
a string describing the relationships between the users.
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/property_graph.png"
|
||||
title="The Property Graph"
|
||||
alt="The Property Graph"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
The resulting graph would have the type signature:
|
||||
|
||||
{% highlight scala %}
|
||||
val userGraph: Graph[(String, String), String]
|
||||
{% endhighlight %}
|
||||
|
||||
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
|
||||
generators and these are discussed in more detail in the section on
|
||||
[graph builders](#graph_builders). Probably the most general method is to use the
|
||||
[graph singleton](api/graphx/index.html#org.apache.spark.graphx.Graph$).
|
||||
For example the following code constructs a graph from a collection of RDDs:
|
||||
|
||||
{% highlight scala %}
|
||||
// Assume the SparkContext has already been constructed
|
||||
val sc: SparkContext
|
||||
// Create an RDD for the vertices
|
||||
val users: RDD[(VertexId, (String, String))] =
|
||||
sc.parallelize(Array((3, ("rxin", "student")), (7, ("jgonzal", "postdoc")),
|
||||
(5, ("franklin", "prof")), (2, ("istoica", "prof"))))
|
||||
// Create an RDD for edges
|
||||
val relationships: RDD[Edge[String]] =
|
||||
sc.parallelize(Array(Edge(3, 7, "collab"), Edge(5, 3, "advisor"),
|
||||
Edge(2, 5, "colleague"), Edge(5, 7, "pi"))
|
||||
// Define a default user in case there are relationship with missing user
|
||||
val defaultUser = ("John Doe", "Missing")
|
||||
// Build the initial Graph
|
||||
val graph = Graph(users, relationships, defaultUser)
|
||||
{% endhighlight %}
|
||||
|
||||
In the above example we make use of the [`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge)
|
||||
case class. Edges have a `srcId` and a `dstId` corresponding to the source and destination vertex
|
||||
identifiers. In addition, the `Edge` class contains the `attr` member which contains the edge
|
||||
property.
|
||||
|
||||
We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices`
|
||||
and `graph.edges` members respectively.
|
||||
|
||||
{% highlight scala %}
|
||||
val graph: Graph[(String, String), String] // Constructed from above
|
||||
// Count all users which are postdocs
|
||||
graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
|
||||
// Count all the edges where src > dst
|
||||
graph.edges.filter(e => e.srcId > e.dstId).count
|
||||
{% endhighlight %}
|
||||
|
||||
> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the
|
||||
> scala `case` expression to deconstruct the tuple. Alternatively, `graph.edges` returns an `RDD`
|
||||
> containing `Edge[String]` objects. We could have also used the case class type constructor as
|
||||
> in the following:
|
||||
> {% highlight scala %}
|
||||
graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
|
||||
{% endhighlight %}
|
||||
|
||||
In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
|
||||
The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
|
||||
ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet).
|
||||
This *join* can be expressed in the following SQL expression:
|
||||
|
||||
{% highlight sql %}
|
||||
SELECT src.id, dst.id, src.attr, e.attr, dst.attr
|
||||
FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst
|
||||
ON e.srcId = src.Id AND e.dstId = dst.Id
|
||||
{% endhighlight %}
|
||||
|
||||
or graphically as:
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/triplet.png"
|
||||
title="Edge Triplet"
|
||||
alt="Edge Triplet"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
The [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class extends the
|
||||
[`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) class by adding the `srcAttr` and
|
||||
`dstAttr` members which contain the source and destination properties respectively. We can use the
|
||||
triplet view of a graph to render a collection of strings describing relationships between users.
|
||||
|
||||
{% highlight scala %}
|
||||
val graph: Graph[(String, String), String] // Constructed from above
|
||||
// Use the triplets view to create an RDD of facts.
|
||||
val facts: RDD[String] =
|
||||
graph.triplets.map(et => et.srcAttr._1 + " is the " + et.attr + " of " et.dstAttr)
|
||||
{% endhighlight %}
|
||||
|
||||
# Graph Operators
|
||||
|
||||
Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
|
||||
have a collection of basic operators that take user defined function and produce new graphs with
|
||||
transformed properties and structure.
|
||||
|
||||
## Property Operators
|
||||
|
||||
In direct analogy to the RDD `map` operator, the property
|
||||
graph contains the following:
|
||||
|
||||
{% highlight scala %}
|
||||
class Graph[VD, ED] {
|
||||
def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
|
||||
def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
|
||||
def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
|
||||
}
|
||||
{% endhighlight %}
|
||||
|
||||
Each of these operators yields a new graph with the vertex or edge properties modified by the user
|
||||
defined `map` function.
|
||||
|
||||
> Note that in all cases the graph structure is unaffected. This is a key feature of these
|
||||
> operators which allows the resulting graph to reuse the structural indicies and the unaffected
|
||||
> properties of the original graph.
|
||||
> While `graph.mapVertices(mapUDF)` is logically equivalent to the following, the following
|
||||
> does not preserve the structural indicies and would not benefit from the substantial system
|
||||
> optimizations in GraphX.
|
||||
> {% highlight scala %}
|
||||
val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))}
|
||||
val newGraph = Graph(newVertices, graph.edges)
|
||||
{% endhighlight %}
|
||||
|
||||
These operators are often used to initialize the graph for a particular computation or project away
|
||||
unnecessary properties. For example, given a graph with the out-degrees as the vertex properties
|
||||
(we describe how to construct such a graph later) we initialize for PageRank:
|
||||
|
||||
{% highlight scala %}
|
||||
// Given a graph where the vertex property is the out-degree
|
||||
val inputGraph: Graph[Int, String]
|
||||
// Construct a graph where each edge contains the weight
|
||||
// and each vertex is the initial PageRank
|
||||
val outputGraph: Graph[Double, Double] =
|
||||
inputGraph.mapTriplets(et => 1.0/et.srcAttr).mapVertices(v => 1.0)
|
||||
{% endhighlight %}
|
||||
|
||||
## Structural Operators
|
||||
<a name="structural_operators"></a>
|
||||
|
||||
|
||||
## Map Reduce Triplets (mapReduceTriplets)
|
||||
<a name="mrTriplets"></a>
|
||||
|
||||
|
||||
# Graph Builders
|
||||
<a name="graph_builders"></a>
|
||||
|
||||
|
||||
{% highlight scala %}
|
||||
val userGraph: Graph[(String, String), String]
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
# Optimized Representation
|
||||
|
||||
The Property Graph is internally represented as a collection of RDDs
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/edge_cut_vs_vertex_cut.png"
|
||||
title="Edge Cut vs. Vertex Cut"
|
||||
alt="Edge Cut vs. Vertex Cut"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
<p style="text-align: center;">
|
||||
|
@ -108,25 +310,22 @@ complicate the transition from Bagel.
|
|||
title="RDD Graph Representation"
|
||||
alt="RDD Graph Representation"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
|
||||
# Graph Operators
|
||||
|
||||
## Map Reduce Triplets (mapReduceTriplets)
|
||||
<a name="mrTriplets"></a>
|
||||
|
||||
# Graph Algorithms
|
||||
<a name="graph_algorithms"></a>
|
||||
|
||||
# Graph Builders
|
||||
<a name="graph_builders"></a>
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/tables_and_graphs.png"
|
||||
title="Tables and Graphs"
|
||||
alt="Tables and Graphs"
|
||||
width="50%" />
|
||||
<!-- Images are downsized intentionally to improve quality on retina displays -->
|
||||
</p>
|
||||
|
||||
# Examples
|
||||
|
|
Binary file not shown.
Binary file not shown.
Before Width: | Height: | Size: 77 KiB After Width: | Height: | Size: 220 KiB |
BIN
docs/img/triplet.png
Normal file
BIN
docs/img/triplet.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 31 KiB |
Loading…
Reference in a new issue