[Doc][GraphX] Remove Motivation section and did some minor update.

2014-11-21 00:29:02 -08:00 · 2014-11-21 00:29:02 -08:00 · b97070ec78
parent 90a6a46bd1
commit b97070ec78
1 changed files with 7 additions and 70 deletions
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@ -57,77 +57,15 @@ title: GraphX Programming Guide
 # Overview
-GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high level,
+GraphX is a new component in Spark for graphs and graph-parallel computation. At a high level,
-GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing the
+GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing a
-[Resilient Distributed Property Graph](#property_graph): a directed multigraph with properties
+new [Graph](#property_graph) abstraction: a directed multigraph with properties
 attached to each vertex and edge.  To support graph computation, GraphX exposes a set of fundamental
 operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
-[aggregateMessages](#aggregateMessages)) as well as an optimized variant of the [Pregel](#pregel) API. In
+[aggregateMessages](#aggregateMessages)) as well as an optimized variant of the [Pregel](#pregel) API. In addition, GraphX includes a growing collection of graph [algorithms](#graph_algorithms) and
 addition, GraphX includes a growing collection of graph [algorithms](#graph_algorithms) and
 [builders](#graph_builders) to simplify graph analytics tasks.
 ## Motivation
 From social networks to language modeling, the growing scale and importance of
 graph data has driven the development of numerous new *graph-parallel* systems
 (e.g., [Giraph](http://giraph.apache.org) and
 [GraphLab](http://graphlab.org)).  By restricting the types of computation that can be
 expressed and introducing new techniques to partition and distribute graphs,
 these systems can efficiently execute sophisticated graph algorithms orders of
 magnitude faster than more general *data-parallel* systems.
 <p style="text-align: center;">
  <img src="img/data_parallel_vs_graph_parallel.png"
       title="Data-Parallel vs. Graph-Parallel"
       alt="Data-Parallel vs. Graph-Parallel"
       width="50%" />
  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 However, the same restrictions that enable these substantial performance gains also make it
 difficult to express many of the important stages in a typical graph-analytics pipeline:
 constructing the graph, modifying its structure, or expressing computation that spans multiple
 graphs.  Furthermore, how we look at data depends on our objectives and the same raw data may have
 many different table and graph views.
 <p style="text-align: center;">
  <img src="img/tables_and_graphs.png"
       title="Tables and Graphs"
       alt="Tables and Graphs"
       width="50%" />
  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 As a consequence, it is often necessary to be able to move between table and graph views.
 However, existing graph analytics pipelines must compose graph-parallel and data-
 parallel systems, leading to extensive data movement and duplication and a complicated programming
 model.
 <p style="text-align: center;">
  <img src="img/graph_analytics_pipeline.png"
       title="Graph Analytics Pipeline"
       alt="Graph Analytics Pipeline"
       width="50%" />
  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one
 system with a single composable API. The GraphX API enables users to view data both as a graph and
 as collections (i.e., RDDs) without data movement or duplication. By incorporating recent advances
 in graph-parallel systems, GraphX is able to optimize the execution of graph operations.
 <!-- ## GraphX Replaces the Spark Bagel API
 Prior to the release of GraphX, graph computation in Spark was expressed using Bagel, an
 implementation of Pregel.  GraphX improves upon Bagel by exposing a richer property graph API, a
 more streamlined version of the Pregel abstraction, and system optimizations to improve performance
 and reduce memory overhead.  While we plan to eventually deprecate Bagel, we will continue to
 support the [Bagel API](api/scala/index.html#org.apache.spark.bagel.package) and
 [Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to
 explore the new GraphX API and comment on issues that may complicate the transition from Bagel.
 -->
 ## Migrating from Spark 1.1
 GraphX in Spark {{site.SPARK_VERSION}} contains a few user facing API changes:
@ -174,7 +112,7 @@ identifiers.
 The property graph is parameterized over the vertex (`VD`) and edge (`ED`) types.  These
 are the types of the objects associated with each vertex and edge respectively.
-> GraphX optimizes the representation of vertex and edge types when they are plain old data types
+> GraphX optimizes the representation of vertex and edge types when they are primitive data types
 > (e.g., int, double, etc...) reducing the in memory footprint by storing them in specialized
 > arrays.
@ -791,14 +729,13 @@ Graphs are inherently recursive data structures as properties of vertices depend
 their neighbors which in turn depend on properties of *their* neighbors.  As a
 consequence many important graph algorithms iteratively recompute the properties of each vertex
 until a fixed-point condition is reached.  A range of graph-parallel abstractions have been proposed
-to express these iterative algorithms.  GraphX exposes a Pregel-like operator which is a fusion of
+to express these iterative algorithms.  GraphX exposes a variant of the Pregel API.
 the widely used Pregel and GraphLab abstractions.
 At a high level the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction
 *constrained to the topology of the graph*.  The Pregel operator executes in a series of super steps
 in which vertices receive the *sum* of their inbound messages from the previous super step, compute
 a new value for the vertex property, and then send messages to neighboring vertices in the next
-super step.  Unlike Pregel and instead more like GraphLab messages are computed in parallel as a
+super step.  Unlike Pregel, messages are computed in parallel as a
 function of the edge triplet and the message computation has access to both the source and
 destination vertex attributes.  Vertices that do not receive a message are skipped within a super
 step.  The Pregel operators terminates iteration and returns the final graph when there are no