2013-10-29 23:57:55 -04:00
|
|
|
# GraphX: Unifying Graph and Tables
|
2013-09-17 20:34:24 -04:00
|
|
|
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
GraphX extends the distributed fault-tolerant collections API and
|
|
|
|
interactive console of [Spark](http://spark.incubator.apache.org) with
|
|
|
|
a new graph API which leverages recent advances in graph systems
|
|
|
|
(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
|
|
|
|
interactively build, transform, and reason about graph structured data
|
|
|
|
at scale.
|
|
|
|
|
|
|
|
|
|
|
|
## Motivation
|
|
|
|
|
|
|
|
From social networks and targeted advertising to protein modeling and
|
|
|
|
astrophysics, big graphs capture the structure in data and are central
|
|
|
|
to the recent advances in machine learning and data mining. Directly
|
|
|
|
applying existing *data-parallel* tools (e.g.,
|
|
|
|
[Hadoop](http://hadoop.apache.org) and
|
|
|
|
[Spark](http://spark.incubator.apache.org)) to graph computation tasks
|
|
|
|
can be cumbersome and inefficient. The need for intuitive, scalable
|
|
|
|
tools for graph computation has lead to the development of new
|
|
|
|
*graph-parallel* systems (e.g.,
|
|
|
|
[Pregel](http://http://giraph.apache.org) and
|
|
|
|
[GraphLab](http://graphlab.org)) which are designed to efficiently
|
|
|
|
execute graph algorithms. Unfortunately, these systems do not address
|
|
|
|
the challenges of graph construction and transformation and provide
|
|
|
|
limited fault-tolerance and support for interactive analysis.
|
|
|
|
|
2013-10-30 00:06:29 -04:00
|
|
|
<p align="center">
|
|
|
|
<img src="https://raw.github.com/jegonzal/graphx/Documentation/docs/img/data_parallel_vs_graph_parallel.png" />
|
|
|
|
</p>
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Solution
|
|
|
|
|
|
|
|
The GraphX project combines the advantages of both data-parallel and
|
|
|
|
graph-parallel systems by efficiently expressing graph computation
|
|
|
|
within the [Spark](http://spark.incubator.apache.org) framework. We
|
|
|
|
leverage new ideas in distributed graph representation to efficiently
|
|
|
|
distribute graphs as tabular data-structures. Similarly, we leverage
|
|
|
|
advances in data-flow systems to exploit in-memory computation and
|
|
|
|
fault-tolerance. We provide powerful new operations to simplify graph
|
|
|
|
construction and transformation. Using these primitives we implement
|
|
|
|
the PowerGraph and Pregel abstractions in less than 20 lines of code.
|
|
|
|
Finally, by exploiting the Scala foundation of Spark, we enable users
|
|
|
|
to interactively load, transform, and compute on massive graphs.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-30 00:06:29 -04:00
|
|
|
<p align="center">
|
|
|
|
<img src="https://raw.github.com/jegonzal/graphx/Documentation/docs/img/tables_and_graphs.png" />
|
|
|
|
</p>
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Online Documentation
|
|
|
|
|
|
|
|
You can find the latest Spark documentation, including a programming
|
2013-10-29 23:57:55 -04:00
|
|
|
guide, on the project webpage at
|
|
|
|
<http://spark.incubator.apache.org/documentation.html>. This README
|
|
|
|
file only contains basic setup instructions.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Building
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
|
|
|
|
project is built using Simple Build Tool (SBT), which is packaged with
|
|
|
|
it. To build Spark and its example programs, run:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-27 22:39:54 -04:00
|
|
|
sbt/sbt assembly
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Once you've built Spark, the easiest way to start using it is the
|
|
|
|
shell:
|
2013-03-17 17:47:44 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
./spark-shell
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Or, for the Python API, the Python shell (`./pyspark`).
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Spark also comes with several sample programs in the `examples`
|
|
|
|
directory. To run one of them, use `./run-example <class>
|
|
|
|
<params>`. For example:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 22:27:07 -04:00
|
|
|
./run-example org.apache.spark.examples.SparkLR local[2]
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
will run the Logistic Regression example locally on 2 CPUs.
|
|
|
|
|
|
|
|
Each of the example programs prints usage help if no params are given.
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
All of the Spark samples take a `<master>` parameter that is the
|
|
|
|
cluster URL to connect to. This can be a mesos:// or spark:// URL, or
|
|
|
|
"local" to run locally with one thread, or "local[N]" to run locally
|
|
|
|
with N threads.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
## A Note About Hadoop Versions
|
2012-03-17 16:49:55 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Spark uses the Hadoop core library to talk to HDFS and other
|
|
|
|
Hadoop-supported storage systems. Because the protocols have changed
|
|
|
|
in different versions of Hadoop, you must build Spark against the same
|
|
|
|
version that your cluster runs. You can change the version by setting
|
|
|
|
the `SPARK_HADOOP_VERSION` environment when building Spark.
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-08-21 20:12:03 -04:00
|
|
|
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
|
2013-08-21 17:51:56 -04:00
|
|
|
versions without YARN, use:
|
|
|
|
|
|
|
|
# Apache Hadoop 1.2.1
|
2013-08-27 22:39:54 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v1
|
2013-08-27 22:39:54 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
|
2013-08-31 21:08:05 -04:00
|
|
|
with YARN, also set `SPARK_YARN=true`:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Apache Hadoop 2.0.5-alpha
|
2013-08-31 21:08:05 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v2
|
2013-08-31 21:08:05 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
For convenience, these variables may also be set through the
|
|
|
|
`conf/spark-env.sh` file described below.
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
When developing a Spark application, specify the Hadoop version by
|
|
|
|
adding the "hadoop-client" artifact to your project's
|
|
|
|
dependencies. For example, if you're using Hadoop 1.0.1 and build your
|
|
|
|
application using SBT, add this entry to `libraryDependencies`:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-08-22 00:15:00 -04:00
|
|
|
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
If your project is built with Maven, add this to your POM file's
|
|
|
|
`<dependencies>` section:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
<dependency>
|
|
|
|
<groupId>org.apache.hadoop</groupId>
|
|
|
|
<artifactId>hadoop-client</artifactId>
|
2013-08-31 21:08:05 -04:00
|
|
|
<version>1.2.1</version>
|
2013-08-21 17:51:56 -04:00
|
|
|
</dependency>
|
2012-03-17 16:49:55 -04:00
|
|
|
|
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
## Configuration
|
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Please refer to the [Configuration
|
|
|
|
guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
|
2013-08-31 21:08:05 -04:00
|
|
|
in the online documentation for an overview on how to configure Spark.
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
## Contributing to GraphX
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2013-10-29 23:57:55 -04:00
|
|
|
Contributions via GitHub pull requests are gladly accepted from their
|
|
|
|
original author. Along with any pull requests, please state that the
|
|
|
|
contribution is your original work and that you license the work to
|
|
|
|
the project under the project's open source license. Whether or not
|
|
|
|
you state this explicitly, by submitting any copyrighted material via
|
|
|
|
pull request, email, or other means you agree to license the material
|
|
|
|
under the project's open source license and warrant that you have the
|
|
|
|
legal authority to do so.
|
2013-09-17 20:34:24 -04:00
|
|
|
|