spark-instrumented-optimizer/docs
Andrew Or 79d07d6604 [SPARK-1132] Persisting Web UI through refactoring the SparkListener interface
The fleeting nature of the Spark Web UI has long been a problem reported by many users: The existing Web UI disappears as soon as the associated application terminates. This is because SparkUI is tightly coupled with SparkContext, and cannot be instantiated independently from it. To solve this, some state must be saved to persistent storage while the application is still running.

The approach taken by this PR involves persisting the UI state through SparkListenerEvents. This requires a major refactor of the SparkListener interface because existing events (1) maintain deep references, making de/serialization is difficult, and (2) do not encode all the information displayed on the UI. In this design, each existing listener for the UI (e.g. ExecutorsListener) maintains state that can be fully constructed from SparkListenerEvents. This state is then supplied to the parent UI (e.g. ExecutorsUI), which renders the associated page(s) on demand.

This PR introduces two important classes: the **EventLoggingListener**, and the **ReplayListenerBus**. In a live application, SparkUI registers an EventLoggingListener with the SparkContext in addition to the existing listeners. Over the course of the application, this listener serializes and logs all events to persisted storage. Then, after the application has finished, the SparkUI can be revived by replaying all the logged events to the existing UI listeners through the ReplayListenerBus.

This feature is currently integrated with the Master Web UI, which optionally rebuilds a SparkUI from event logs as soon as the corresponding application finishes.

More details can be found in the commit messages, comments within the code, and the [design doc](https://spark-project.atlassian.net/secure/attachment/12900/PersistingSparkWebUI.pdf). Comments and feedback are most welcome.

Author: Andrew Or <andrewor14@gmail.com>
Author: andrewor14 <andrewor14@gmail.com>

Closes #42 from andrewor14/master and squashes the following commits:

e5f14fa [Andrew Or] Merge github.com:apache/spark
a1c5cd9 [Andrew Or] Merge github.com:apache/spark
b8ba817 [Andrew Or] Remove UI from map when removing application in Master
83af656 [Andrew Or] Scraps and pieces (no functionality change)
222adcd [Andrew Or] Merge github.com:apache/spark
124429f [Andrew Or] Clarify LiveListenerBus behavior + Add tests for new behavior
f80bd31 [Andrew Or] Simplify static handler and BlockManager status update logic
9e14f97 [Andrew Or] Moved around functionality + renamed classes per Patrick
6740e49 [Andrew Or] Fix comment nits
650eb12 [Andrew Or] Add unit tests + Fix bugs found through tests
45fd84c [Andrew Or] Remove now deprecated test
c5c2c8f [Andrew Or] Remove list of (TaskInfo, TaskMetrics) from StageInfo
3456090 [Andrew Or] Address Patrick's comments
bf80e3d [Andrew Or] Imports, comments, and code formatting, once again (minor)
ac69ec8 [Andrew Or] Fix test fail
d801d11 [Andrew Or] Merge github.com:apache/spark (major)
dc93915 [Andrew Or] Imports, comments, and code formatting (minor)
77ba283 [Andrew Or] Address Kay's and Patrick's comments
b6eaea7 [Andrew Or] Treating SparkUI as a handler of MasterUI
d59da5f [Andrew Or] Avoid logging all the blocks on each executor
d6e3b4a [Andrew Or] Merge github.com:apache/spark
ca258a4 [Andrew Or] Master UI - add support for reading compressed event logs
176e68e [Andrew Or] Fix deprecated message for JavaSparkContext (minor)
4f69c4a [Andrew Or] Master UI - Rebuild SparkUI on application finish
291b2be [Andrew Or] Correct directory in log message "INFO: Logging events to <dir>"
1ba3407 [Andrew Or] Add a few configurable options to event logging
e375431 [Andrew Or] Add new constructors for SparkUI
18b256d [Andrew Or] Refactor out event logging and replaying logic from UI
bb4c503 [Andrew Or] Use a more mnemonic path for logging
aef411c [Andrew Or] Fix bug: storage status was not reflected on UI in the local case
03eda0b [Andrew Or] Fix HDFS flush behavior
36b3e5d [Andrew Or] Add HDFS support for event logging
cceff2b [andrewor14] Fix 100 char format fail
2fee310 [Andrew Or] Address Patrick's comments
2981d61 [Andrew Or] Move SparkListenerBus out of DAGScheduler + Clean up
5d2cec1 [Andrew Or] JobLogger: ID -> Id
0503e4b [Andrew Or] Fix PySpark tests + remove sc.clearFiles/clearJars
4d2fb0c [Andrew Or] Fix format fail
faa113e [Andrew Or] General clean up
d47585f [Andrew Or] Clean up FileLogger
472fd8a [Andrew Or] Fix a couple of tests
996d7a2 [Andrew Or] Reflect RDD unpersist on UI
7b2f811 [Andrew Or] Guard against TaskMetrics NPE + Fix tests
d1f4285 [Andrew Or] Migrate from lift-json to json4s-jackson
28019ca [Andrew Or] Merge github.com:apache/spark
bbe3501 [Andrew Or] Embed storage status and RDD info in Task events
6631c02 [Andrew Or] More formatting changes, this time mainly for Json DSL
70e7e7a [Andrew Or] Formatting changes
e9e1c6d [Andrew Or] Move all JSON de/serialization logic to JsonProtocol
d646df6 [Andrew Or] Completely decouple SparkUI from SparkContext
6814da0 [Andrew Or] Explicitly register each UI listener rather than through some magic
64d2ce1 [Andrew Or] Fix BlockManagerUI bug by introducing new event
4273013 [Andrew Or] Add a gateway SparkListener to simplify event logging
904c729 [Andrew Or] Fix another major bug
5ac906d [Andrew Or] Mostly naming, formatting, and code style changes
3fd584e [Andrew Or] Fix two major bugs
f3fc13b [Andrew Or] General refactor
4dfcd22 [Andrew Or] Merge git://git.apache.org/incubator-spark into persist-ui
b3976b0 [Andrew Or] Add functionality of reconstructing a persisted UI from SparkContext
8add36b [Andrew Or] JobProgressUI: Add JSON functionality
d859efc [Andrew Or] BlockManagerUI: Add JSON functionality
c4cd480 [Andrew Or] Also deserialize new events
8a2ebe6 [Andrew Or] Fix bugs for EnvironmentUI and ExecutorsUI
de8a1cd [Andrew Or] Serialize events both to and from JSON (rather than just to)
bf0b2e9 [Andrew Or] ExecutorUI: Serialize events rather than arbitary executor information
bb222b9 [Andrew Or] ExecutorUI: render completely from JSON
dcbd312 [Andrew Or] Add JSON Serializability for all SparkListenerEvent's
10ed49d [Andrew Or] Merge github.com:apache/incubator-spark into persist-ui
8e09306 [Andrew Or] Use JSON for ExecutorsUI
e3ae35f [Andrew Or] Merge github.com:apache/incubator-spark
3ddeb7e [Andrew Or] Also privatize fields
090544a [Andrew Or] Privatize methods
13920c9 [Andrew Or] Update docs
bd5a1d7 [Andrew Or] Typo: phyiscal -> physical
287ef44 [Andrew Or] Avoid reading the entire batch into memory; also simplify streaming logic
3df7005 [Andrew Or] Merge branch 'master' of github.com:andrewor14/incubator-spark
a531d2e [Andrew Or] Relax assumptions on compressors and serializers when batching
164489d [Andrew Or] Relax assumptions on compressors and serializers when batching
2014-03-19 13:17:01 -07:00
..
_layouts Add Jekyll tag to isolate "production-only" doc components. 2014-03-02 18:19:01 -08:00
_plugins Add Jekyll tag to isolate "production-only" doc components. 2014-03-02 18:19:01 -08:00
css Merge pull request #552 from martinjaggi/master. Closes #552. 2014-02-08 11:39:13 -08:00
img Merge pull request #497 from tdas/docs-update 2014-01-28 21:51:05 -08:00
js SPARK-1135: fix broken anchors in docs 2014-02-26 11:20:16 -08:00
_config.yml Removed reference to incubation in Spark user docs. 2014-02-27 21:13:22 -08:00
api.md Soften wording about GraphX superseding Bagel 2014-01-10 23:48:32 -08:00
bagel-programming-guide.md Removed reference to incubation in Spark user docs. 2014-02-27 21:13:22 -08:00
building-with-maven.md SPARK-1064 2014-03-11 22:39:17 -07:00
cluster-overview.md SPARK-1183. Don't use "worker" to mean executor 2014-03-13 12:11:33 -07:00
configuration.md [SPARK-1132] Persisting Web UI through refactoring the SparkListener interface 2014-03-19 13:17:01 -07:00
contributing-to-spark.md Work in progress: 2013-09-08 00:29:11 -07:00
ec2-scripts.md fix persistent-hdfs 2013-11-01 17:47:37 -07:00
graphx-programming-guide.md SPARK-1183. Don't use "worker" to mean executor 2014-03-13 12:11:33 -07:00
hadoop-third-party-distributions.md Code review feedback 2014-01-05 22:05:30 -08:00
hardware-provisioning.md Change port from 3030 to 4040 2013-09-11 10:01:38 -07:00
index.md [Spark-1261] add instructions for running python examples to doc overview page 2014-03-17 17:35:51 -07:00
java-programming-guide.md [java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs 2014-03-03 22:31:30 -08:00
job-scheduling.md SPARK-1183. Don't use "worker" to mean executor 2014-03-13 12:11:33 -07:00
mllib-classification-regression.md SPARK-1183. Don't use "worker" to mean executor 2014-03-13 12:11:33 -07:00
mllib-clustering.md Merge pull request #552 from martinjaggi/master. Closes #552. 2014-02-08 11:39:13 -08:00
mllib-collaborative-filtering.md Merge pull request #552 from martinjaggi/master. Closes #552. 2014-02-08 11:39:13 -08:00
mllib-guide.md Merge pull request #552 from martinjaggi/master. Closes #552. 2014-02-08 11:39:13 -08:00
mllib-linear-algebra.md Merge pull request #552 from martinjaggi/master. Closes #552. 2014-02-08 11:39:13 -08:00
mllib-optimization.md Merge pull request #566 from martinjaggi/copy-MLlib-d. 2014-02-09 15:19:50 -08:00
monitoring.md SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues... 2014-03-11 11:16:59 -07:00
python-programming-guide.md SPARK-1183. Don't use "worker" to mean executor 2014-03-13 12:11:33 -07:00
quick-start.md [SPARK-1105] fix site scala version error in docs 2014-02-19 15:54:03 -08:00
README.md Add Jekyll tag to isolate "production-only" doc components. 2014-03-02 18:19:01 -08:00
running-on-mesos.md Updated docs for SparkConf and handled review comments 2013-12-30 22:17:28 -05:00
running-on-yarn.md SPARK-1183. Don't use "worker" to mean executor 2014-03-13 12:11:33 -07:00
scala-programming-guide.md Removed reference to incubation in Spark user docs. 2014-02-27 21:13:22 -08:00
security.md SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets 2014-03-06 18:27:50 -06:00
spark-debugger.md Removed reference to incubation in Spark user docs. 2014-02-27 21:13:22 -08:00
spark-standalone.md Revert "[SPARK-1150] fix repo location in create script" 2014-03-01 17:15:38 -08:00
streaming-custom-receivers.md Merge pull request #577 from hsaputra/fix_simple_streaming_doc. 2014-02-11 14:46:22 -08:00
streaming-programming-guide.md maintain arbitrary state data for each key 2014-03-09 22:42:12 -07:00
tuning.md SPARK-929: Fully deprecate usage of SPARK_MEM 2014-03-09 11:08:39 -07:00

Welcome to the Spark documentation!

This readme will walk you through navigating and building the Spark documentation, which is included here with the Spark source code. You can also find documentation specific to release versions of Spark at http://spark.apache.org/documentation.html.

Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself. Why build it yourself? So that you have the docs that corresponds to whichever version of Spark you currently have checked out of revision control.

Generating the Documentation HTML

We include the Spark documentation as part of the source (as opposed to using a hosted wiki, such as the github wiki, as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded.

In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md.

The markdown code can be compiled to HTML using the Jekyll tool. To use the jekyll command, you will need to have Jekyll installed. The easiest way to do this is via a Ruby Gem, see the jekyll installation instructions. Compiling the site with Jekyll will create a directory called _site containing index.html as well as the rest of the compiled files.

You can modify the default Jekyll build as follows:

# Skip generating API docs (which takes a while)
$ SKIP_SCALADOC=1 jekyll build
# Serve content locally on port 4000
$ jekyll serve --watch
# Build the site with extra features used on the live page
$ PRODUCTION=1 jekyll build

Pygments

We also use pygments (http://pygments.org) for syntax highlighting in documentation markdown pages, so you will also need to install that (it requires Python) by running sudo easy_install Pygments.

To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile phase, use the following sytax:

{% highlight scala %}
// Your scala code goes here, you can replace scala with many other
// supported languages too.
{% endhighlight %}

API Docs (Scaladoc and Epydoc)

You can build just the Spark scaladoc by running sbt/sbt doc from the SPARK_PROJECT_ROOT directory.

Similarly, you can build just the PySpark epydoc by running epydoc --config epydoc.conf from the SPARK_PROJECT_ROOT/pyspark directory.

When you run jekyll in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run sbt/sbt doc before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using epydoc.

NOTE: To skip the step of building and copying over the Scala and Python API docs, run SKIP_API=1 jekyll.