|
@ -67,7 +67,7 @@
|
|||
<li class="divider"></li>
|
||||
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
|
||||
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
|
||||
<li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
|
||||
<li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
|
@ -79,7 +79,7 @@
|
|||
<li class="divider"></li>
|
||||
<li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
|
||||
<li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
|
||||
<li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
|
||||
<li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Paralle Spark)</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
|
|
|
@ -1,63 +1,141 @@
|
|||
---
|
||||
layout: global
|
||||
title: "GraphX: Unifying Graphs and Tables"
|
||||
title: GraphX Programming Guide
|
||||
---
|
||||
|
||||
* This will become a table of contents (this text will be scraped).
|
||||
{:toc}
|
||||
|
||||
GraphX extends the distributed fault-tolerant collections API and
|
||||
interactive console of [Spark](http://spark.incubator.apache.org) with
|
||||
a new graph API which leverages recent advances in graph systems
|
||||
(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
|
||||
interactively build, transform, and reason about graph structured data
|
||||
at scale.
|
||||
|
||||
|
||||
## Motivation
|
||||
|
||||
From social networks and targeted advertising to protein modeling and
|
||||
astrophysics, big graphs capture the structure in data and are central
|
||||
to the recent advances in machine learning and data mining. Directly
|
||||
applying existing *data-parallel* tools (e.g.,
|
||||
[Hadoop](http://hadoop.apache.org) and
|
||||
[Spark](http://spark.incubator.apache.org)) to graph computation tasks
|
||||
can be cumbersome and inefficient. The need for intuitive, scalable
|
||||
tools for graph computation has lead to the development of new
|
||||
*graph-parallel* systems (e.g.,
|
||||
[Pregel](http://http://giraph.apache.org) and
|
||||
[GraphLab](http://graphlab.org)) which are designed to efficiently
|
||||
execute graph algorithms. Unfortunately, these systems do not address
|
||||
the challenges of graph construction and transformation and provide
|
||||
limited fault-tolerance and support for interactive analysis.
|
||||
|
||||
{:.pagination-centered}
|
||||
![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png)
|
||||
|
||||
## Solution
|
||||
|
||||
The GraphX project combines the advantages of both data-parallel and
|
||||
graph-parallel systems by efficiently expressing graph computation
|
||||
within the [Spark](http://spark.incubator.apache.org) framework. We
|
||||
leverage new ideas in distributed graph representation to efficiently
|
||||
distribute graphs as tabular data-structures. Similarly, we leverage
|
||||
advances in data-flow systems to exploit in-memory computation and
|
||||
fault-tolerance. We provide powerful new operations to simplify graph
|
||||
construction and transformation. Using these primitives we implement
|
||||
the PowerGraph and Pregel abstractions in less than 20 lines of code.
|
||||
Finally, by exploiting the Scala foundation of Spark, we enable users
|
||||
to interactively load, transform, and compute on massive graphs.
|
||||
|
||||
<p align="center">
|
||||
<img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
|
||||
<p style="text-align: center;">
|
||||
<img src="img/graphx_logo.png"
|
||||
title="GraphX Logo"
|
||||
alt="GraphX"
|
||||
width="65%" />
|
||||
</p>
|
||||
|
||||
## Examples
|
||||
# Overview
|
||||
|
||||
GraphX is the new (alpha) Spark API for graphs and graph-parallel
|
||||
computation. At a high-level GraphX, extends the Spark
|
||||
[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
|
||||
introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
|
||||
a directed graph with properties attached to each vertex and edge.
|
||||
To support graph computation, GraphX exposes a set of functions
|
||||
(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the
|
||||
[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org)
|
||||
APIs. In addition, GraphX includes a growing collection of graph
|
||||
[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
|
||||
graph analytics tasks.
|
||||
|
||||
## Background on Graph-Parallel Computation
|
||||
|
||||
From social networks to language modeling, the growing scale and importance of
|
||||
graph data has driven the development of numerous new *graph-parallel* systems
|
||||
(e.g., [Giraph](http://http://giraph.apache.org) and
|
||||
[GraphLab](http://graphlab.org)). By restricting the types of computation that can be
|
||||
expressed and introducing new techniques to partition and distribute graphs,
|
||||
these systems can efficiently execute sophisticated graph algorithms orders of
|
||||
magnitude faster than more general *data-parallel* systems.
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/data_parallel_vs_graph_parallel.png"
|
||||
title="Data-Parallel vs. Graph-Parallel"
|
||||
alt="Data-Parallel vs. Graph-Parallel"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
However, the same restrictions that enable these substantial performance gains
|
||||
also make it difficult to express many of the important stages in a typical graph-analytics pipeline:
|
||||
constructing the graph, modifying its structure, or expressing computation that
|
||||
spans multiple graphs. As a consequence, existing graph analytics pipelines
|
||||
compose graph-parallel and data-parallel systems, leading to extensive data
|
||||
movement and duplication and a complicated programming model.
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/graph_analytics_pipeline.png"
|
||||
title="Graph Analytics Pipeline"
|
||||
alt="Graph Analytics Pipeline"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
The goal of the GraphX project is to unify graph-parallel and data-parallel
|
||||
computation in one system with a single composable API. This goal is achieved
|
||||
through an API that enables users to view data both as a graph and as
|
||||
collections (i.e., RDDs) without data movement or duplication and by
|
||||
incorporating advances in graph-parallel systems to optimize the execution of
|
||||
operations on the graph view. In preliminary experiments we find that the GraphX
|
||||
system is able to achieve performance comparable to state-of-the-art
|
||||
graph-parallel systems while easily expressing the entire analytics pipelines.
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/graphx_performance_comparison.png"
|
||||
title="GraphX Performance Comparison"
|
||||
alt="GraphX Performance Comparison"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
## GraphX Replaces the Spark Bagel API
|
||||
|
||||
Prior to the release of GraphX, graph computation in Spark was expressed using
|
||||
Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by exposing
|
||||
a richer property graph API, a more streamlined version of the Pregel abstraction,
|
||||
and system optimizations to improve performance and reduce memory
|
||||
overhead. While we plan to eventually deprecate the Bagel, we will continue to
|
||||
support the API and [Bagel programming guide](bagel-programming-guide.html). However,
|
||||
we encourage Bagel to explore the new GraphX API and comment on issues that may
|
||||
complicate the transition from Bagel.
|
||||
|
||||
# The Property Graph
|
||||
<a name="property_graph"></a>
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/edge_cut_vs_vertex_cut.png"
|
||||
title="Edge Cut vs. Vertex Cut"
|
||||
alt="Edge Cut vs. Vertex Cut"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/property_graph.png"
|
||||
title="The Property Graph"
|
||||
alt="The Property Graph"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/vertex_routing_edge_tables.png"
|
||||
title="RDD Graph Representation"
|
||||
alt="RDD Graph Representation"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
|
||||
# Graph Operators
|
||||
|
||||
## Map Reduce Triplets (mapReduceTriplets)
|
||||
<a name="mrTriplets"></a>
|
||||
|
||||
# Graph Algorithms
|
||||
<a name="graph_algorithms"></a>
|
||||
|
||||
# Graph Builders
|
||||
<a name="graph_builders"></a>
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/tables_and_graphs.png"
|
||||
title="Tables and Graphs"
|
||||
alt="Tables and Graphs"
|
||||
width="50%" />
|
||||
</p>
|
||||
|
||||
# Examples
|
||||
|
||||
Suppose I want to build a graph from some text files, restrict the graph
|
||||
to important relationships and users, run page-rank on the sub-graph, and
|
||||
then finally return attributes associated with the top users. I can do
|
||||
all of this in just a few lines with GraphX:
|
||||
|
||||
```scala
|
||||
{% highlight scala %}
|
||||
// Connect to the Spark cluster
|
||||
val sc = new SparkContext("spark://master.amplab.org", "research")
|
||||
|
||||
|
@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
|
|||
|
||||
println(userInfoWithPageRank.top(5))
|
||||
|
||||
```
|
||||
{% endhighlight %}
|
||||
|
||||
|
||||
## Online Documentation
|
||||
|
||||
You can find the latest Spark documentation, including a programming
|
||||
guide, on the project webpage at
|
||||
<http://spark.incubator.apache.org/documentation.html>. This README
|
||||
file only contains basic setup instructions.
|
||||
|
||||
|
||||
## Building
|
||||
|
||||
Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
|
||||
project is built using Simple Build Tool (SBT), which is packaged with
|
||||
it. To build Spark and its example programs, run:
|
||||
|
||||
sbt/sbt assembly
|
||||
|
||||
Once you've built Spark, the easiest way to start using it is the
|
||||
shell:
|
||||
|
||||
./spark-shell
|
||||
|
||||
Or, for the Python API, the Python shell (`./pyspark`).
|
||||
|
||||
Spark also comes with several sample programs in the `examples`
|
||||
directory. To run one of them, use `./run-example <class>
|
||||
<params>`. For example:
|
||||
|
||||
./run-example org.apache.spark.examples.SparkLR local[2]
|
||||
|
||||
will run the Logistic Regression example locally on 2 CPUs.
|
||||
|
||||
Each of the example programs prints usage help if no params are given.
|
||||
|
||||
All of the Spark samples take a `<master>` parameter that is the
|
||||
cluster URL to connect to. This can be a mesos:// or spark:// URL, or
|
||||
"local" to run locally with one thread, or "local[N]" to run locally
|
||||
with N threads.
|
||||
|
||||
|
||||
## A Note About Hadoop Versions
|
||||
|
||||
Spark uses the Hadoop core library to talk to HDFS and other
|
||||
Hadoop-supported storage systems. Because the protocols have changed
|
||||
in different versions of Hadoop, you must build Spark against the same
|
||||
version that your cluster runs. You can change the version by setting
|
||||
the `SPARK_HADOOP_VERSION` environment when building Spark.
|
||||
|
||||
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
|
||||
versions without YARN, use:
|
||||
|
||||
# Apache Hadoop 1.2.1
|
||||
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
|
||||
|
||||
# Cloudera CDH 4.2.0 with MapReduce v1
|
||||
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
|
||||
|
||||
For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
|
||||
with YARN, also set `SPARK_YARN=true`:
|
||||
|
||||
# Apache Hadoop 2.0.5-alpha
|
||||
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
|
||||
|
||||
# Cloudera CDH 4.2.0 with MapReduce v2
|
||||
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
|
||||
|
||||
For convenience, these variables may also be set through the
|
||||
`conf/spark-env.sh` file described below.
|
||||
|
||||
When developing a Spark application, specify the Hadoop version by adding the
|
||||
"hadoop-client" artifact to your project's dependencies. For example, if you're
|
||||
using Hadoop 1.2.1 and build your application using SBT, add this entry to
|
||||
`libraryDependencies`:
|
||||
|
||||
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
|
||||
|
||||
If your project is built with Maven, add this to your POM file's
|
||||
`<dependencies>` section:
|
||||
|
||||
<dependency>
|
||||
<groupId>org.apache.hadoop</groupId>
|
||||
<artifactId>hadoop-client</artifactId>
|
||||
<version>1.2.1</version>
|
||||
</dependency>
|
||||
|
||||
|
||||
## Configuration
|
||||
|
||||
Please refer to the [Configuration
|
||||
guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
|
||||
in the online documentation for an overview on how to configure Spark.
|
||||
|
||||
|
||||
## Contributing to GraphX
|
||||
|
||||
Contributions via GitHub pull requests are gladly accepted from their
|
||||
original author. Along with any pull requests, please state that the
|
||||
contribution is your original work and that you license the work to
|
||||
the project under the project's open source license. Whether or not
|
||||
you state this explicitly, by submitting any copyrighted material via
|
||||
pull request, email, or other means you agree to license the material
|
||||
under the project's open source license and warrant that you have the
|
||||
legal authority to do so.
|
||||
|
|
Before Width: | Height: | Size: 194 KiB After Width: | Height: | Size: 423 KiB |
BIN
docs/img/edge_cut_vs_vertex_cut.png
Normal file
After Width: | Height: | Size: 78 KiB |
BIN
docs/img/graph_analytics_pipeline.png
Normal file
After Width: | Height: | Size: 417 KiB |
BIN
docs/img/graphx_figures.pptx
Normal file
BIN
docs/img/graphx_logo.png
Normal file
After Width: | Height: | Size: 39 KiB |
BIN
docs/img/graphx_performance_comparison.png
Normal file
After Width: | Height: | Size: 162 KiB |
BIN
docs/img/property_graph.png
Normal file
After Width: | Height: | Size: 77 KiB |
Before Width: | Height: | Size: 67 KiB After Width: | Height: | Size: 162 KiB |
BIN
docs/img/vertex_routing_edge_tables.png
Normal file
After Width: | Height: | Size: 557 KiB |
|
@ -5,7 +5,7 @@ title: Spark Overview
|
|||
|
||||
Apache Spark is a fast and general-purpose cluster computing system.
|
||||
It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
|
||||
It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [Bagel](bagel-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
|
||||
It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
|
||||
|
||||
# Downloading
|
||||
|
||||
|
@ -77,7 +77,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
|
|||
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
|
||||
* [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
|
||||
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
|
||||
* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
|
||||
* [GraphX (Graphs on Spark)](graphx-programming-guide.html): simple graph processing model
|
||||
|
||||
**API Docs:**
|
||||
|
||||
|
@ -85,7 +85,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
|
|||
* [Spark for Python (Epydoc)](api/pyspark/index.html)
|
||||
* [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
|
||||
* [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
|
||||
* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html)
|
||||
* [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)
|
||||
|
||||
|
||||
**Deployment guides:**
|
||||
|
|