2013-10-30 00:31:12 -04:00
|
|
|
# GraphX: Unifying Graphs and Tables
|
2013-09-17 20:34:24 -04:00
|
|
|
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
GraphX extends the distributed fault-tolerant collections API and
|
|
|
|
interactive console of [Spark](http://spark.incubator.apache.org) with
|
|
|
|
a new graph API which leverages recent advances in graph systems
|
|
|
|
(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
|
|
|
|
interactively build, transform, and reason about graph structured data
|
|
|
|
at scale.
|
|
|
|
|
|
|
|
|
|
|
|
## Motivation
|
|
|
|
|
|
|
|
From social networks and targeted advertising to protein modeling and
|
|
|
|
astrophysics, big graphs capture the structure in data and are central
|
|
|
|
to the recent advances in machine learning and data mining. Directly
|
|
|
|
applying existing *data-parallel* tools (e.g.,
|
|
|
|
[Hadoop](http://hadoop.apache.org) and
|
|
|
|
[Spark](http://spark.incubator.apache.org)) to graph computation tasks
|
|
|
|
can be cumbersome and inefficient. The need for intuitive, scalable
|
|
|
|
tools for graph computation has lead to the development of new
|
|
|
|
*graph-parallel* systems (e.g.,
|
|
|
|
[Pregel](http://http://giraph.apache.org) and
|
|
|
|
[GraphLab](http://graphlab.org)) which are designed to efficiently
|
|
|
|
execute graph algorithms. Unfortunately, these systems do not address
|
|
|
|
the challenges of graph construction and transformation and provide
|
|
|
|
limited fault-tolerance and support for interactive analysis.
|
|
|
|
|
2013-10-30 00:06:29 -04:00
|
|
|
<p align="center">
|
2013-11-11 22:37:16 -05:00
|
|
|
<img src="https://raw.github.com/amplab/graphx/master/docs/img/data_parallel_vs_graph_parallel.png" />
|
2013-10-30 00:06:29 -04:00
|
|
|
</p>
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Solution
|
|
|
|
|
|
|
|
The GraphX project combines the advantages of both data-parallel and
|
|
|
|
graph-parallel systems by efficiently expressing graph computation
|
|
|
|
within the [Spark](http://spark.incubator.apache.org) framework. We
|
|
|
|
leverage new ideas in distributed graph representation to efficiently
|
|
|
|
distribute graphs as tabular data-structures. Similarly, we leverage
|
|
|
|
advances in data-flow systems to exploit in-memory computation and
|
|
|
|
fault-tolerance. We provide powerful new operations to simplify graph
|
|
|
|
construction and transformation. Using these primitives we implement
|
|
|
|
the PowerGraph and Pregel abstractions in less than 20 lines of code.
|
|
|
|
Finally, by exploiting the Scala foundation of Spark, we enable users
|
|
|
|
to interactively load, transform, and compute on massive graphs.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-30 00:06:29 -04:00
|
|
|
<p align="center">
|
2013-11-11 22:37:16 -05:00
|
|
|
<img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
|
2013-10-30 00:06:29 -04:00
|
|
|
</p>
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-30 00:31:12 -04:00
|
|
|
## Examples
|
|
|
|
|
2013-11-05 00:02:36 -05:00
|
|
|
Suppose I want to build a graph from some text files, restrict the graph
|
2013-10-30 00:31:12 -04:00
|
|
|
to important relationships and users, run page-rank on the sub-graph, and
|
2013-11-05 00:02:36 -05:00
|
|
|
then finally return attributes associated with the top users. I can do
|
2013-10-30 00:31:12 -04:00
|
|
|
all of this in just a few lines with GraphX:
|
|
|
|
|
|
|
|
```scala
|
|
|
|
// Connect to the Spark cluster
|
|
|
|
val sc = new SparkContext("spark://master.amplab.org", "research")
|
|
|
|
|
|
|
|
// Load my user data and prase into tuples of user id and attribute list
|
|
|
|
val users = sc.textFile("hdfs://user_attributes.tsv")
|
|
|
|
.map(line => line.split).map( parts => (parts.head, parts.tail) )
|
|
|
|
|
|
|
|
// Parse the edge data which is already in userId -> userId format
|
|
|
|
val followerGraph = Graph.textFile(sc, "hdfs://followers.tsv")
|
|
|
|
|
|
|
|
// Attach the user attributes
|
2013-11-05 00:02:36 -05:00
|
|
|
val graph = followerGraph.outerJoinVertices(users){
|
2013-10-30 00:31:12 -04:00
|
|
|
case (uid, deg, Some(attrList)) => attrList
|
|
|
|
// Some users may not have attributes so we set them as empty
|
2013-11-05 00:02:36 -05:00
|
|
|
case (uid, deg, None) => Array.empty[String]
|
2013-10-30 00:31:12 -04:00
|
|
|
}
|
|
|
|
|
|
|
|
// Restrict the graph to users which have exactly two attributes
|
|
|
|
val subgraph = graph.subgraph((vid, attr) => attr.size == 2)
|
|
|
|
|
2013-11-05 00:02:36 -05:00
|
|
|
// Compute the PageRank
|
2013-10-30 00:31:12 -04:00
|
|
|
val pagerankGraph = Analytics.pagerank(subgraph)
|
|
|
|
|
|
|
|
// Get the attributes of the top pagerank users
|
|
|
|
val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
|
|
|
|
case (uid, attrList, Some(pr)) => (pr, attrList)
|
|
|
|
case (uid, attrList, None) => (pr, attrList)
|
|
|
|
}
|
2013-11-05 00:02:36 -05:00
|
|
|
|
2013-10-30 00:31:12 -04:00
|
|
|
println(userInfoWithPageRank.top(5))
|
|
|
|
|
|
|
|
```
|
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
## Online Documentation
|
|
|
|
|
|
|
|
You can find the latest Spark documentation, including a programming
|
2013-10-29 23:57:55 -04:00
|
|
|
guide, on the project webpage at
|
|
|
|
<http://spark.incubator.apache.org/documentation.html>. This README
|
|
|
|
file only contains basic setup instructions.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Building
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
|
|
|
|
project is built using Simple Build Tool (SBT), which is packaged with
|
|
|
|
it. To build Spark and its example programs, run:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-27 22:39:54 -04:00
|
|
|
sbt/sbt assembly
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Once you've built Spark, the easiest way to start using it is the
|
|
|
|
shell:
|
2013-03-17 17:47:44 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
./spark-shell
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Or, for the Python API, the Python shell (`./pyspark`).
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Spark also comes with several sample programs in the `examples`
|
|
|
|
directory. To run one of them, use `./run-example <class>
|
|
|
|
<params>`. For example:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 22:27:07 -04:00
|
|
|
./run-example org.apache.spark.examples.SparkLR local[2]
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
will run the Logistic Regression example locally on 2 CPUs.
|
|
|
|
|
|
|
|
Each of the example programs prints usage help if no params are given.
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
All of the Spark samples take a `<master>` parameter that is the
|
|
|
|
cluster URL to connect to. This can be a mesos:// or spark:// URL, or
|
|
|
|
"local" to run locally with one thread, or "local[N]" to run locally
|
|
|
|
with N threads.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
## A Note About Hadoop Versions
|
2012-03-17 16:49:55 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Spark uses the Hadoop core library to talk to HDFS and other
|
|
|
|
Hadoop-supported storage systems. Because the protocols have changed
|
|
|
|
in different versions of Hadoop, you must build Spark against the same
|
|
|
|
version that your cluster runs. You can change the version by setting
|
|
|
|
the `SPARK_HADOOP_VERSION` environment when building Spark.
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-08-21 20:12:03 -04:00
|
|
|
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
|
2013-08-21 17:51:56 -04:00
|
|
|
versions without YARN, use:
|
|
|
|
|
|
|
|
# Apache Hadoop 1.2.1
|
2013-08-27 22:39:54 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v1
|
2013-08-27 22:39:54 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
|
2013-08-31 21:08:05 -04:00
|
|
|
with YARN, also set `SPARK_YARN=true`:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Apache Hadoop 2.0.5-alpha
|
2013-08-31 21:08:05 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v2
|
2013-08-31 21:08:05 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
For convenience, these variables may also be set through the
|
|
|
|
`conf/spark-env.sh` file described below.
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
When developing a Spark application, specify the Hadoop version by adding the
|
|
|
|
"hadoop-client" artifact to your project's dependencies. For example, if you're
|
2013-11-02 15:58:44 -04:00
|
|
|
using Hadoop 1.2.1 and build your application using SBT, add this entry to
|
2013-08-21 17:51:56 -04:00
|
|
|
`libraryDependencies`:
|
|
|
|
|
2013-08-22 00:15:00 -04:00
|
|
|
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
If your project is built with Maven, add this to your POM file's
|
|
|
|
`<dependencies>` section:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
<dependency>
|
|
|
|
<groupId>org.apache.hadoop</groupId>
|
|
|
|
<artifactId>hadoop-client</artifactId>
|
2013-08-31 21:08:05 -04:00
|
|
|
<version>1.2.1</version>
|
2013-08-21 17:51:56 -04:00
|
|
|
</dependency>
|
2012-03-17 16:49:55 -04:00
|
|
|
|
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
## Configuration
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Please refer to the [Configuration
|
|
|
|
guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
|
2013-08-31 21:08:05 -04:00
|
|
|
in the online documentation for an overview on how to configure Spark.
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
## Contributing to GraphX
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Contributions via GitHub pull requests are gladly accepted from their
|
|
|
|
original author. Along with any pull requests, please state that the
|
|
|
|
contribution is your original work and that you license the work to
|
|
|
|
the project under the project's open source license. Whether or not
|
|
|
|
you state this explicitly, by submitting any copyrighted material via
|
|
|
|
pull request, email, or other means you agree to license the material
|
|
|
|
under the project's open source license and warrant that you have the
|
|
|
|
legal authority to do so.
|
2013-09-17 20:34:24 -04:00
|
|
|
|