# GraphX: Unifying Graphs and Tables GraphX extends the distributed fault-tolerant collections API and interactive console of [Spark](http://spark.incubator.apache.org) with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. ## Motivation From social networks and targeted advertising to protein modeling and astrophysics, big graphs capture the structure in data and are central to the recent advances in machine learning and data mining. Directly applying existing *data-parallel* tools (e.g., [Hadoop](http://hadoop.apache.org) and [Spark](http://spark.incubator.apache.org)) to graph computation tasks can be cumbersome and inefficient. The need for intuitive, scalable tools for graph computation has lead to the development of new *graph-parallel* systems (e.g., [Pregel](http://http://giraph.apache.org) and [GraphLab](http://graphlab.org)) which are designed to efficiently execute graph algorithms. Unfortunately, these systems do not address the challenges of graph construction and transformation and provide limited fault-tolerance and support for interactive analysis.
## Solution The GraphX project combines the advantages of both data-parallel and graph-parallel systems by efficiently expressing graph computation within the [Spark](http://spark.incubator.apache.org) framework. We leverage new ideas in distributed graph representation to efficiently distribute graphs as tabular data-structures. Similarly, we leverage advances in data-flow systems to exploit in-memory computation and fault-tolerance. We provide powerful new operations to simplify graph construction and transformation. Using these primitives we implement the PowerGraph and Pregel abstractions in less than 20 lines of code. Finally, by exploiting the Scala foundation of Spark, we enable users to interactively load, transform, and compute on massive graphs.
## Examples Suppose I want to build a graph from some text files, restrict the graph to important relationships and users, run page-rank on the sub-graph, and then finally return attributes associated with the top users. I can do all of this in just a few lines with GraphX: ```scala // Connect to the Spark cluster val sc = new SparkContext("spark://master.amplab.org", "research") // Load my user data and prase into tuples of user id and attribute list val users = sc.textFile("hdfs://user_attributes.tsv") .map(line => line.split).map( parts => (parts.head, parts.tail) ) // Parse the edge data which is already in userId -> userId format val followerGraph = Graph.textFile(sc, "hdfs://followers.tsv") // Attach the user attributes val graph = followerGraph.outerJoinVertices(users){ case (uid, deg, Some(attrList)) => attrList // Some users may not have attributes so we set them as empty case (uid, deg, None) => Array.empty[String] } // Restrict the graph to users which have exactly two attributes val subgraph = graph.subgraph((vid, attr) => attr.size == 2) // Compute the PageRank val pagerankGraph = Analytics.pagerank(subgraph) // Get the attributes of the top pagerank users val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){ case (uid, attrList, Some(pr)) => (pr, attrList) case (uid, attrList, None) => (pr, attrList) } println(userInfoWithPageRank.top(5)) ``` ## Online Documentation You can find the latest Spark documentation, including a programming guide, on the project webpage at