Merge pull request #497 from tdas/docs-update
Updated Spark Streaming Programming Guide
Here is the updated version of the Spark Streaming Programming Guide. This is still a work in progress, but the major changes are in place. So feedback is most welcome.
In general, I have tried to make the guide to easier to understand even if the reader does not know much about Spark. The updated website is hosted here -
http://www.eecs.berkeley.edu/~tdas/spark_docs/streaming-programming-guide.html
The major changes are:
- Overview illustrates the usecases of Spark Streaming - various input sources and various output sources
- An example right after overview to quickly give an idea of what Spark Streaming program looks like
- Made Java API and examples a first class citizen like Scala by using tabs to show both Scala and Java examples (similar to AMPCamp tutorial's code tabs)
- Highlighted the DStream operations updateStateByKey and transform because of their powerful nature
- Updated driver node failure recovery text to highlight automatic recovery in Spark standalone mode
- Added information about linking and using the external input sources like Kafka and Flume
- In general, reorganized the sections to better show the Basic section and the more advanced sections like Tuning and Recovery.
Todos:
- Links to the docs of external Kafka, Flume, etc
- Illustrate window operation with figure as well as example.
Author: Tathagata Das <tathagata.das1565@gmail.com>
== Merge branch commits ==
commit 18ff10556570b39d672beeb0a32075215cfcc944
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Tue Jan 28 21:49:30 2014 -0800
Fixed a lot of broken links.
commit 34a5a6008dac2e107624c7ff0db0824ee5bae45f
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Tue Jan 28 18:02:28 2014 -0800
Updated github url to use SPARK_GITHUB_URL variable.
commit f338a60ae8069e0a382d2cb170227e5757cc0b7a
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Mon Jan 27 22:42:42 2014 -0800
More updates based on Patrick and Harvey's comments.
commit 89a81ff25726bf6d26163e0dd938290a79582c0f
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Mon Jan 27 13:08:34 2014 -0800
Updated docs based on Patricks PR comments.
commit d5b6196b532b5746e019b959a79ea0cc013a8fc3
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Sun Jan 26 20:15:58 2014 -0800
Added spark.streaming.unpersist config and info on StreamingListener interface.
commit e3dcb46ab83d7071f611d9b5008ba6bc16c9f951
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Sun Jan 26 18:41:12 2014 -0800
Fixed docs on StreamingContext.getOrCreate.
commit 6c29524639463f11eec721e4d17a9d7159f2944b
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Thu Jan 23 18:49:39 2014 -0800
Added example and figure for window operations, and links to Kafka and Flume API docs.
commit f06b964a51bb3b21cde2ff8bdea7d9785f6ce3a9
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Wed Jan 22 22:49:12 2014 -0800
Fixed missing endhighlight tag in the MLlib guide.
commit 036a7d46187ea3f2a0fb8349ef78f10d6c0b43a9
Merge: eab351d a1cd185
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Wed Jan 22 22:17:42 2014 -0800
Merge remote-tracking branch 'apache/master' into docs-update
commit eab351d05c0baef1d4b549e1581310087158d78d
Author: Tathagata Das <tathagata.das1565@gmail.com>
Date: Wed Jan 22 22:17:15 2014 -0800
Update Spark Streaming Programming Guide.
|
@ -8,3 +8,4 @@ SPARK_VERSION_SHORT: 0.9.0
|
|||
SCALA_VERSION: "2.10"
|
||||
MESOS_VERSION: 0.13.0
|
||||
SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
|
||||
SPARK_GITHUB_URL: https://github.com/apache/incubator-spark
|
||||
|
|
|
@ -82,6 +82,17 @@
|
|||
<li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
|
||||
<li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
|
||||
<li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph Processing)</a></li>
|
||||
<li class="divider"></li>
|
||||
<li class="dropdown-submenu">
|
||||
<a tabindex="-1" href="#">External Data Sources</a>
|
||||
<ul class="dropdown-menu">
|
||||
<li><a href="api/external/kafka/index.html#org.apache.spark.streaming.kafka.KafkaUtils$">Kafka</a></li>
|
||||
<li><a href="api/external/flume/index.html#org.apache.spark.streaming.flume.FlumeUtils$">Flume</a></li>
|
||||
<li><a href="api/external/twitter/index.html#org.apache.spark.streaming.twitter.TwitterUtils$">Twitter</a></li>
|
||||
<li><a href="api/external/zeromq/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$">ZeroMQ</a></li>
|
||||
<li><a href="api/external/mqtt/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$">MQTT</a></li>
|
||||
</ul>
|
||||
</li>
|
||||
</ul>
|
||||
</li>
|
||||
|
||||
|
|
|
@ -20,7 +20,10 @@ include FileUtils
|
|||
|
||||
if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1')
|
||||
# Build Scaladoc for Java/Scala
|
||||
projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"]
|
||||
core_projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"]
|
||||
external_projects = ["flume", "kafka", "mqtt", "twitter", "zeromq"]
|
||||
|
||||
projects = core_projects + external_projects.map { |project_name| "external/" + project_name }
|
||||
|
||||
puts "Moving to project root and building scaladoc."
|
||||
curr_dir = pwd
|
||||
|
|
|
@ -362,7 +362,16 @@ Apart from these, the following properties are also available, and may be useful
|
|||
<td>spark.streaming.blockInterval</td>
|
||||
<td>200</td>
|
||||
<td>
|
||||
Duration (milliseconds) of how long to batch new objects coming from network receivers.
|
||||
Duration (milliseconds) of how long to batch new objects coming from network receivers used
|
||||
in Spark Streaming.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>spark.streaming.unpersist</td>
|
||||
<td>false</td>
|
||||
<td>
|
||||
Force RDDs generated and persisted by Spark Streaming to be automatically unpersisted from
|
||||
Spark's memory. Setting this to true is likely to reduce Spark's RDD memory usage.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
|
|
@ -87,20 +87,54 @@ a:hover code {
|
|||
max-width: 914px;
|
||||
}
|
||||
|
||||
.dropdown-menu {
|
||||
/* Remove the default 2px top margin which causes a small
|
||||
gap between the hover trigger area and the popup menu */
|
||||
margin-top: 0;
|
||||
/* Avoid too much whitespace at the right for shorter menu items */
|
||||
min-width: 50px;
|
||||
}
|
||||
|
||||
/**
|
||||
* Make dropdown menus in nav bars show on hover instead of click
|
||||
* using solution at http://stackoverflow.com/questions/8878033/how-
|
||||
* to-make-twitter-bootstrap-menu-dropdown-on-hover-rather-than-click
|
||||
**/
|
||||
.dropdown-menu {
|
||||
/* Remove the default 2px top margin which causes a small
|
||||
gap between the hover trigger area and the popup menu */
|
||||
margin-top: 0;
|
||||
}
|
||||
ul.nav li.dropdown:hover ul.dropdown-menu{
|
||||
display: block;
|
||||
}
|
||||
|
||||
a.menu:after, .dropdown-toggle:after {
|
||||
content: none;
|
||||
}
|
||||
|
||||
/** Make the submenus open on hover on the parent menu item */
|
||||
ul.nav li.dropdown ul.dropdown-menu li.dropdown-submenu:hover ul.dropdown-menu {
|
||||
display: block;
|
||||
}
|
||||
|
||||
/** Make the submenus be invisible until the parent menu item is hovered upon */
|
||||
ul.nav li.dropdown ul.dropdown-menu li.dropdown-submenu ul.dropdown-menu {
|
||||
display: none;
|
||||
}
|
||||
|
||||
/**
|
||||
* Made the navigation bar buttons not grey out when clicked.
|
||||
* Essentially making nav bar buttons not react to clicks, only hover events.
|
||||
*/
|
||||
.navbar .nav li.dropdown.open > .dropdown-toggle {
|
||||
background-color: transparent;
|
||||
}
|
||||
|
||||
/**
|
||||
* Made the active tab caption blue. Otherwise the active tab is black, and inactive tab is blue.
|
||||
* That looks weird. Changed the colors to active - blue, inactive - black, and
|
||||
* no color change on hover.
|
||||
*/
|
||||
.nav-tabs > .active > a, .nav-tabs > .active > a:hover {
|
||||
color: #08c;
|
||||
}
|
||||
|
||||
.nav-tabs > li > a, .nav-tabs > li > a:hover {
|
||||
color: #333;
|
||||
}
|
||||
|
|
BIN
docs/img/java-sm.png
Normal file
After Width: | Height: | Size: 670 B |
BIN
docs/img/python-sm.png
Normal file
After Width: | Height: | Size: 1.4 KiB |
BIN
docs/img/scala-sm.png
Normal file
After Width: | Height: | Size: 2.2 KiB |
BIN
docs/img/streaming-arch.png
Normal file
After Width: | Height: | Size: 77 KiB |
BIN
docs/img/streaming-dstream-ops.png
Normal file
After Width: | Height: | Size: 47 KiB |
BIN
docs/img/streaming-dstream-window.png
Normal file
After Width: | Height: | Size: 40 KiB |
BIN
docs/img/streaming-dstream.png
Normal file
After Width: | Height: | Size: 26 KiB |
BIN
docs/img/streaming-figures.pptx
Normal file
BIN
docs/img/streaming-flow.png
Normal file
After Width: | Height: | Size: 31 KiB |
|
@ -75,7 +75,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
|
|||
* [Spark Programming Guide](scala-programming-guide.html): an overview of Spark concepts, and details on the Scala API
|
||||
* [Java Programming Guide](java-programming-guide.html): using Spark from Java
|
||||
* [Python Programming Guide](python-programming-guide.html): using Spark from Python
|
||||
* [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
|
||||
* [Spark Streaming](streaming-programming-guide.html): Spark's API for processing data streams
|
||||
* [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
|
||||
* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
|
||||
* [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs
|
||||
|
|
106
docs/js/main.js
|
@ -1 +1,107 @@
|
|||
|
||||
// From docs.scala-lang.org
|
||||
function styleCode() {
|
||||
if (typeof disableStyleCode != "undefined") {
|
||||
return;
|
||||
}
|
||||
$(".codetabs pre code").parent().each(function() {
|
||||
if (!$(this).hasClass("prettyprint")) {
|
||||
var lang = $(this).parent().data("lang");
|
||||
if (lang == "python") {
|
||||
lang = "py"
|
||||
}
|
||||
if (lang == "bash") {
|
||||
lang = "bsh"
|
||||
}
|
||||
$(this).addClass("prettyprint lang-"+lang+" linenums");
|
||||
}
|
||||
});
|
||||
console.log("runningPrettyPrint()")
|
||||
prettyPrint();
|
||||
}
|
||||
|
||||
|
||||
function codeTabs() {
|
||||
var counter = 0;
|
||||
var langImages = {
|
||||
"scala": "img/scala-sm.png",
|
||||
"python": "img/python-sm.png",
|
||||
"java": "img/java-sm.png"
|
||||
};
|
||||
$("div.codetabs").each(function() {
|
||||
$(this).addClass("tab-content");
|
||||
|
||||
// Insert the tab bar
|
||||
var tabBar = $('<ul class="nav nav-tabs" data-tabs="tabs"></ul>');
|
||||
$(this).before(tabBar);
|
||||
|
||||
// Add each code sample to the tab bar:
|
||||
var codeSamples = $(this).children("div");
|
||||
codeSamples.each(function() {
|
||||
$(this).addClass("tab-pane");
|
||||
var lang = $(this).data("lang");
|
||||
var image = $(this).data("image");
|
||||
var notabs = $(this).data("notabs");
|
||||
var capitalizedLang = lang.substr(0, 1).toUpperCase() + lang.substr(1);
|
||||
var id = "tab_" + lang + "_" + counter;
|
||||
$(this).attr("id", id);
|
||||
if (image != null && langImages[lang]) {
|
||||
var buttonLabel = "<img src='" +langImages[lang] + "' alt='" + capitalizedLang + "' />";
|
||||
} else if (notabs == null) {
|
||||
var buttonLabel = "<b>" + capitalizedLang + "</b>";
|
||||
} else {
|
||||
var buttonLabel = ""
|
||||
}
|
||||
tabBar.append(
|
||||
'<li><a class="tab_' + lang + '" href="#' + id + '">' + buttonLabel + '</a></li>'
|
||||
);
|
||||
});
|
||||
|
||||
codeSamples.first().addClass("active");
|
||||
tabBar.children("li").first().addClass("active");
|
||||
counter++;
|
||||
});
|
||||
$("ul.nav-tabs a").click(function (e) {
|
||||
// Toggling a tab should switch all tabs corresponding to the same language
|
||||
// while retaining the scroll position
|
||||
e.preventDefault();
|
||||
var scrollOffset = $(this).offset().top - $(document).scrollTop();
|
||||
$("." + $(this).attr('class')).tab('show');
|
||||
$(document).scrollTop($(this).offset().top - scrollOffset);
|
||||
});
|
||||
}
|
||||
|
||||
function makeCollapsable(elt, accordionClass, accordionBodyId, title) {
|
||||
$(elt).addClass("accordion-inner");
|
||||
$(elt).wrap('<div class="accordion ' + accordionClass + '"></div>')
|
||||
$(elt).wrap('<div class="accordion-group"></div>')
|
||||
$(elt).wrap('<div id="' + accordionBodyId + '" class="accordion-body collapse"></div>')
|
||||
$(elt).parent().before(
|
||||
'<div class="accordion-heading">' +
|
||||
'<a class="accordion-toggle" data-toggle="collapse" href="#' + accordionBodyId + '">' +
|
||||
title +
|
||||
'</a>' +
|
||||
'</div>'
|
||||
);
|
||||
}
|
||||
|
||||
function viewSolution() {
|
||||
var counter = 0
|
||||
$("div.solution").each(function() {
|
||||
var id = "solution_" + counter
|
||||
makeCollapsable(this, "", id,
|
||||
'<i class="icon-ok-sign" style="text-decoration: none; color: #0088cc">' +
|
||||
'</i>' + "View Solution");
|
||||
counter++;
|
||||
});
|
||||
}
|
||||
|
||||
|
||||
$(document).ready(function() {
|
||||
codeTabs();
|
||||
viewSolution();
|
||||
$('#chapter-toc').toc({exclude: '', context: '.container'});
|
||||
$('#chapter-toc').prepend('<p class="chapter-toc-header">In This Chapter</p>');
|
||||
makeCollapsable($('#global-toc'), "", "global-toc", "Show Table of Contents");
|
||||
//styleCode();
|
||||
});
|
||||
|
|
|
@ -489,3 +489,4 @@ val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), k)
|
|||
val = decomposed.S.data
|
||||
|
||||
println("singular values = " + s.toArray.mkString)
|
||||
{% endhighlight %}
|
|
@ -151,7 +151,7 @@ You can also pass an option `-c <numCores>` to control the number of cores that
|
|||
You may also run your application entirely inside of the cluster by submitting your application driver using the submission client. The syntax for submitting applications is as follows:
|
||||
|
||||
|
||||
./spark-class org.apache.spark.deploy.Client launch
|
||||
./bin/spark-class org.apache.spark.deploy.Client launch
|
||||
[client-options] \
|
||||
<cluster-url> <application-jar-url> <main-class> \
|
||||
[application-options]
|
||||
|
@ -176,7 +176,7 @@ Once you submit a driver program, it will appear in the cluster management UI at
|
|||
be assigned an identifier. If you'd like to prematurely terminate the program, you can do so using
|
||||
the same client:
|
||||
|
||||
./spark-class org.apache.spark.deploy.client.DriverClient kill <driverId>
|
||||
./bin/spark-class org.apache.spark.deploy.Client kill <driverId>
|
||||
|
||||
# Resource Scheduling
|
||||
|
||||
|
|