Apache Spark - A unified analytics engine for large-scale data processing
Go to file
Andrew Or 863ec0cb4d [SPARK-6943] [SPARK-6944] DAG visualization on SparkUI
This patch adds the functionality to display the RDD DAG on the SparkUI.

This DAG describes the relationships between
- an RDD and its dependencies,
- an RDD and its operation scopes, and
- an RDD's operation scopes and the stage / job hierarchy

An operation scope here refers to the existing public APIs that created the RDDs (e.g. `textFile`, `treeAggregate`). In the future, we can expand this to include higher level operations like SQL queries.

*Note: This blatantly stole a few lines of HTML and JavaScript from #5547 (thanks shroffpradyumn!)*

Here's what the job page looks like:
<img src="https://issues.apache.org/jira/secure/attachment/12730286/job-page.png" width="700px"/>
and the stage page:
<img src="https://issues.apache.org/jira/secure/attachment/12730287/stage-page.png" width="300px"/>

Author: Andrew Or <andrew@databricks.com>

Closes #5729 from andrewor14/viz2 and squashes the following commits:

666c03b [Andrew Or] Round corners of RDD boxes on stage page (minor)
01ba336 [Andrew Or] Change RDD cache color to red (minor)
6f9574a [Andrew Or] Add tests for RDDOperationScope
1c310e4 [Andrew Or] Wrap a few more RDD functions in an operation scope
3ffe566 [Andrew Or] Restore "null" as default for RDD name
5fdd89d [Andrew Or] children -> child (minor)
0d07a84 [Andrew Or] Fix python style
afb98e2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
0d7aa32 [Andrew Or] Fix python tests
3459ab2 [Andrew Or] Fix tests
832443c [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
429e9e1 [Andrew Or] Display cached RDDs on the viz
b1f0fd1 [Andrew Or] Rename OperatorScope -> RDDOperationScope
31aae06 [Andrew Or] Extract visualization logic from listener
83f9c58 [Andrew Or] Implement a programmatic representation of operator scopes
5a7faf4 [Andrew Or] Rename references to viz scopes to viz clusters
ee33d52 [Andrew Or] Separate HTML generating code from listener
f9830a2 [Andrew Or] Refactor + clean up + document JS visualization code
b80cc52 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
0706992 [Andrew Or] Add link from jobs to stages
deb48a0 [Andrew Or] Translate stage boxes taking into account the width
5c7ce16 [Andrew Or] Connect RDDs across stages + update style
ab91416 [Andrew Or] Introduce visualization to the Job Page
5f07e9c [Andrew Or] Remove more return statements from scopes
5e388ea [Andrew Or] Fix line too long
43de96e [Andrew Or] Add parent IDs to StageInfo
6e2cfea [Andrew Or] Remove all return statements in `withScope`
d19c4da [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
7ef957c [Andrew Or] Fix scala style
4310271 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz2
aa868a9 [Andrew Or] Ensure that HadoopRDD is actually serializable
c3bfcae [Andrew Or] Re-implement scopes using closures instead of annotations
52187fc [Andrew Or] Rat excludes
09d361e [Andrew Or] Add ID to node label (minor)
71281fa [Andrew Or] Embed the viz in the UI in a toggleable manner
8dd5af2 [Andrew Or] Fill in documentation + miscellaneous minor changes
fe7816f [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz
205f838 [Andrew Or] Reimplement rendering with dagre-d3 instead of viz.js
5e22946 [Andrew Or] Merge branch 'master' of github.com:apache/spark into viz
6a7cdca [Andrew Or] Move RDD scope util methods and logic to its own file
494d5c2 [Andrew Or] Revert a few unintended style changes
9fac6f3 [Andrew Or] Re-implement scopes through annotations instead
f22f337 [Andrew Or] First working implementation of visualization with vis.js
2184348 [Andrew Or] Translate RDD information to dot file
5143523 [Andrew Or] Expose the necessary information in RDDInfo
a9ed4f9 [Andrew Or] Add a few missing scopes to certain RDD methods
6b3403b [Andrew Or] Scope all RDD methods
2015-05-04 16:24:35 -07:00
assembly [SPARK-7168] [BUILD] Update plugin versions in Maven build and centralize versions 2015-04-28 07:48:34 -04:00
bagel [SPARK-6758]block the right jetty package in log 2015-04-09 17:44:08 -04:00
bin Limit help option regex 2015-05-01 19:26:55 +01:00
build SPARK-5856: In Maven build script, launch Zinc with more memory 2015-02-17 10:10:01 -08:00
conf [SPARK-2691] [MESOS] Support for Mesos DockerInfo 2015-05-01 18:41:22 -07:00
core [SPARK-6943] [SPARK-6944] DAG visualization on SparkUI 2015-05-04 16:24:35 -07:00
data/mllib [SPARK-5939][MLLib] make FPGrowth example app take parameters 2015-02-23 08:47:28 -08:00
dev HOTFIX: Disable buggy dependency checker 2015-04-30 22:39:58 -07:00
docker [SPARK-2691] [MESOS] Support for Mesos DockerInfo 2015-05-01 18:41:22 -07:00
docs [SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23 2015-05-03 21:22:31 +01:00
ec2 [SPARK-4897] [PySpark] Python 3 support 2015-04-16 16:20:57 -07:00
examples [SPARK-5956] [MLLIB] Pipeline components should be copyable. 2015-05-04 11:29:13 -07:00
external [SPARK-2808][Streaming][Kafka] update kafka to 0.8.2 2015-05-01 17:54:56 -07:00
extras [SPARK-6440][CORE]Handle IPv6 addresses properly when constructing URI 2015-04-13 12:55:25 +01:00
graphx [SPARK-5854] personalized page rank 2015-05-01 11:55:43 -07:00
launcher [SPARK-7031] [THRIFTSERVER] let thrift server take SPARK_DAEMON_MEMORY and SPARK_DAEMON_JAVA_OPTS 2015-05-03 00:47:47 +01:00
mllib [SPARK-5956] [MLLIB] Pipeline components should be copyable. 2015-05-04 11:29:13 -07:00
network [SPARK-6229] Add SASL encryption to network library. 2015-05-01 19:01:46 -07:00
project [Build] Enable MiMa checks for SQL 2015-04-30 16:23:01 -07:00
python [SPARK-7319][SQL] Improve the output from DataFrame.show() 2015-05-04 13:24:52 -07:00
R [SPARK-7319][SQL] Improve the output from DataFrame.show() 2015-05-04 13:24:52 -07:00
repl [SPARK-7092] Update spark scala version to 2.11.6 2015-04-25 18:07:34 -04:00
sbin [SPARK-5338] [MESOS] Add cluster mode support for Mesos 2015-04-28 13:33:57 -07:00
sbt Adde LICENSE Header to build/mvn, build/sbt and sbt/sbt 2014-12-29 10:48:53 -08:00
sql [SPARK-7319][SQL] Improve the output from DataFrame.show() 2015-05-04 13:24:52 -07:00
streaming [SPARK-7315] [STREAMING] [TEST] Fix flaky WALBackedBlockRDDSuite 2015-05-02 01:53:14 -07:00
tools [SPARK-4550] In sort-based shuffle, store map outputs in serialized form 2015-04-30 23:14:14 -07:00
unsafe [SPARK-7288] Suppress compiler warnings due to use of sun.misc.Unsafe; add facade in front of Unsafe; remove use of Unsafe.setMemory 2015-04-30 15:21:00 -07:00
yarn [SPARK-5342] [YARN] Allow long running Spark apps to run on secure YARN/HDFS 2015-05-01 15:32:09 -05:00
.gitattributes [SPARK-3870] EOL character enforcement 2014-10-31 12:39:52 -07:00
.gitignore [SPARK-5654] Integrate SparkR 2015-04-08 22:45:40 -07:00
.rat-excludes [SPARK-6943] [SPARK-6944] DAG visualization on SparkUI 2015-05-04 16:24:35 -07:00
CONTRIBUTING.md [SPARK-6889] [DOCS] CONTRIBUTING.md updates to accompany contribution doc updates 2015-04-21 22:34:31 -07:00
LICENSE [SPARK-1406] Mllib pmml model export 2015-04-29 23:21:21 -07:00
make-distribution.sh [SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23 2015-05-03 21:22:31 +01:00
NOTICE SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info 2014-05-14 09:38:33 -07:00
pom.xml [SPARK-7302] [DOCS] SPARK building documentation still mentions building for yarn 0.23 2015-05-03 21:22:31 +01:00
README.md [docs] [SPARK-6306] Readme points to dead link 2015-03-12 15:01:33 +00:00
scalastyle-config.xml [SPARK-6428] Turn on explicit type checking for public methods. 2015-04-03 01:25:02 -07:00
tox.ini [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() 2014-08-26 16:57:40 -07:00

Apache Spark

Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, and Python, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming for stream processing.

http://spark.apache.org/

Online Documentation

You can find the latest Spark documentation, including a programming guide, on the project web page and project wiki. This README file only contains basic setup instructions.

Building Spark

Spark is built using Apache Maven. To build Spark and its example programs, run:

mvn -DskipTests clean package

(You do not need to do this if you downloaded a pre-built package.) More detailed documentation is available from the project site, at "Building Spark".

Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

./bin/spark-shell

Try the following command, which should return 1000:

scala> sc.parallelize(1 to 1000).count()

Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

./bin/pyspark

And run the following command, which should also return 1000:

>>> sc.parallelize(range(1000)).count()

Example Programs

Spark also comes with several sample programs in the examples directory. To run one of them, use ./bin/run-example <class> [params]. For example:

./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit examples to a cluster. This can be a mesos:// or spark:// URL, "yarn-cluster" or "yarn-client" to run on YARN, and "local" to run locally with one thread, or "local[N]" to run locally with N threads. You can also use an abbreviated class name if the class is in the examples package. For instance:

MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

Running Tests

Testing first requires building Spark. Once Spark is built, tests can be run using:

./dev/run-tests

Please see the guidance on how to run all automated tests.

A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported storage systems. Because the protocols have changed in different versions of Hadoop, you must build Spark against the same version that your cluster runs.

Please refer to the build documentation at "Specifying the Hadoop Version" for detailed guidance on building for a particular distribution of Hadoop, including building for particular Hive and Hive Thriftserver distributions. See also "Third Party Hadoop Distributions" for guidance on building a Spark application that works with a particular distribution.

Configuration

Please refer to the Configuration guide in the online documentation for an overview on how to configure Spark.