Adds liquid variables to docs templating system so that they can be used

throughout the docs: SPARK_VERSION, SCALA_VERSION, and MESOS_VERSION.

To use them, e.g. use {{site.SPARK_VERSION}}.

Also removes uses of {{HOME_PATH}} which were being resolved to ""
by the templating system anyway.
This commit is contained in:
Andy Konwinski 2012-10-08 10:13:26 -07:00
parent efc5423210
commit 45d03231d0
14 changed files with 95 additions and 89 deletions

View file

@ -1,2 +1,8 @@
pygments: true pygments: true
markdown: kramdown markdown: kramdown
# These allow the documentation to be updated with nerw releases
# of Spark, Scala, and Mesos.
SPARK_VERSION: 0.6.0
SCALA_VERSION: 2.9.2
MESOS_VERSION: 0.9.0

View file

@ -6,10 +6,10 @@
<head> <head>
<meta charset="utf-8"> <meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"> <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>{{ page.title }} - Spark 0.6.0 Documentation</title> <title>{{ page.title }} - Spark {{site.SPARK_VERSION}} Documentation</title>
<meta name="description" content=""> <meta name="description" content="">
<link rel="stylesheet" href="{{HOME_PATH}}css/bootstrap.min.css"> <link rel="stylesheet" href="css/bootstrap.min.css">
<style> <style>
body { body {
padding-top: 60px; padding-top: 60px;
@ -17,12 +17,12 @@
} }
</style> </style>
<meta name="viewport" content="width=device-width"> <meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="{{HOME_PATH}}css/bootstrap-responsive.min.css"> <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
<link rel="stylesheet" href="{{HOME_PATH}}css/main.css"> <link rel="stylesheet" href="css/main.css">
<script src="{{HOME_PATH}}js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script> <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
<link rel="stylesheet" href="{{HOME_PATH}}css/pygments-default.css"> <link rel="stylesheet" href="css/pygments-default.css">
</head> </head>
<body> <body>
<!--[if lt IE 7]> <!--[if lt IE 7]>
@ -34,17 +34,17 @@
<div class="navbar navbar-fixed-top" id="topbar"> <div class="navbar navbar-fixed-top" id="topbar">
<div class="navbar-inner"> <div class="navbar-inner">
<div class="container"> <div class="container">
<a class="brand" href="{{HOME_PATH}}index.html"></a> <a class="brand" href="index.html"></a>
<ul class="nav"> <ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.--> <!--TODO(andyk): Add class="active" attribute to li some how.-->
<li><a href="{{HOME_PATH}}index.html">Overview</a></li> <li><a href="index.html">Overview</a></li>
<li class="dropdown"> <li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu"> <ul class="dropdown-menu">
<li><a href="{{HOME_PATH}}quick-start.html">Quick Start</a></li> <li><a href="quick-start.html">Quick Start</a></li>
<li><a href="{{HOME_PATH}}scala-programming-guide.html">Scala</a></li> <li><a href="scala-programming-guide.html">Scala</a></li>
<li><a href="{{HOME_PATH}}java-programming-guide.html">Java</a></li> <li><a href="java-programming-guide.html">Java</a></li>
</ul> </ul>
</li> </li>
@ -53,15 +53,15 @@
<li class="dropdown"> <li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a> <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu"> <ul class="dropdown-menu">
<li><a href="{{HOME_PATH}}ec2-scripts.html">Amazon EC2</a></li> <li><a href="ec2-scripts.html">Amazon EC2</a></li>
<li><a href="{{HOME_PATH}}spark-standalone.html">Standalone Mode</a></li> <li><a href="spark-standalone.html">Standalone Mode</a></li>
<li><a href="{{HOME_PATH}}running-on-mesos.html">Mesos</a></li> <li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="{{HOME_PATH}}running-on-yarn.html">YARN</a></li> <li><a href="running-on-yarn.html">YARN</a></li>
</ul> </ul>
</li> </li>
<li class="dropdown"> <li class="dropdown">
<a href="{{HOME_PATH}}api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a> <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu"> <ul class="dropdown-menu">
<li><a href="configuration.html">Configuration</a></li> <li><a href="configuration.html">Configuration</a></li>
<li><a href="tuning.html">Tuning Guide</a></li> <li><a href="tuning.html">Tuning Guide</a></li>

View file

@ -5,6 +5,6 @@ title: Spark API documentation (Scaladoc)
Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory. Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory.
- [Core]({{HOME_PATH}}api/core/index.html) - [Core](api/core/index.html)
- [Examples]({{HOME_PATH}}api/examples/index.html) - [Examples](api/examples/index.html)
- [Bagel]({{HOME_PATH}}api/bagel/index.html) - [Bagel](api/bagel/index.html)

View file

@ -19,7 +19,7 @@ To write a Bagel application, you will need to add Spark, its dependencies, and
## Programming Model ## Programming Model
Bagel operates on a graph represented as a [distributed dataset]({{HOME_PATH}}scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages. Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to. For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.

View file

@ -23,7 +23,7 @@ the copy executable.
Inside `spark-env.sh`, you can set the following environment variables: Inside `spark-env.sh`, you can set the following environment variables:
* `SCALA_HOME` to point to your Scala installation. * `SCALA_HOME` to point to your Scala installation.
* `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster]({{HOME_PATH}}running-on-mesos.html). * `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster](running-on-mesos.html).
* `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`) * `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`)
* `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`. * `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`.
* `SPARK_CLASSPATH` to add elements to Spark's classpath. * `SPARK_CLASSPATH` to add elements to Spark's classpath.
@ -53,9 +53,9 @@ there are at least four properties that you will commonly want to control:
<td> <td>
Class to use for serializing objects that will be sent over the network or need to be cached Class to use for serializing objects that will be sent over the network or need to be cached
in serialized form. The default of Java serialization works with any Serializable Java object but is in serialized form. The default of Java serialization works with any Serializable Java object but is
quite slow, so we recommend <a href="{{HOME_PATH}}tuning.html">using <code>spark.KryoSerializer</code> quite slow, so we recommend <a href="tuning.html">using <code>spark.KryoSerializer</code>
and configuring Kryo serialization</a> when speed is necessary. Can be any subclass of and configuring Kryo serialization</a> when speed is necessary. Can be any subclass of
<a href="{{HOME_PATH}}api/core/index.html#spark.Serializer"><code>spark.Serializer</code></a>). <a href="api/core/index.html#spark.Serializer"><code>spark.Serializer</code></a>).
</td> </td>
</tr> </tr>
<tr> <tr>
@ -64,8 +64,8 @@ there are at least four properties that you will commonly want to control:
<td> <td>
If you use Kryo serialization, set this class to register your custom classes with Kryo. If you use Kryo serialization, set this class to register your custom classes with Kryo.
You need to set it to a class that extends You need to set it to a class that extends
<a href="{{HOME_PATH}}api/core/index.html#spark.KryoRegistrator"><code>spark.KryoRegistrator</code></a>). <a href="api/core/index.html#spark.KryoRegistrator"><code>spark.KryoRegistrator</code></a>).
See the <a href="{{HOME_PATH}}tuning.html#data-serialization">tuning guide</a> for more details. See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
</td> </td>
</tr> </tr>
<tr> <tr>
@ -81,8 +81,8 @@ there are at least four properties that you will commonly want to control:
<td>spark.cores.max</td> <td>spark.cores.max</td>
<td>(infinite)</td> <td>(infinite)</td>
<td> <td>
When running on a <a href="{{HOME_PATH}}spark-standalone.html">standalone deploy cluster</a> or a When running on a <a href="spark-standalone.html">standalone deploy cluster</a> or a
<a href="{{HOME_PATH}}running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained" <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained"
sharing mode</a>, how many CPU cores to request at most. The default will use all available cores. sharing mode</a>, how many CPU cores to request at most. The default will use all available cores.
</td> </td>
</tr> </tr>
@ -98,7 +98,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td> <td>false</td>
<td> <td>
If set to "true", runs over Mesos clusters in If set to "true", runs over Mesos clusters in
<a href="{{HOME_PATH}}running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>, <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task.
This gives lower-latency scheduling for short queries, but leaves resources in use for the whole This gives lower-latency scheduling for short queries, but leaves resources in use for the whole
duration of the Spark job. duration of the Spark job.

View file

@ -12,7 +12,7 @@ The Spark team welcomes contributions in the form of GitHub pull requests. Here
* Always import packages using absolute paths (e.g. `scala.collection.Map` instead of `collection.Map`). * Always import packages using absolute paths (e.g. `scala.collection.Map` instead of `collection.Map`).
* No "infix" syntax for methods other than operators. For example, don't write `table containsKey myKey`; replace it with `table.containsKey(myKey)`. * No "infix" syntax for methods other than operators. For example, don't write `table containsKey myKey`; replace it with `table.containsKey(myKey)`.
- Make sure that your code passes the unit tests. You can run the tests with `sbt/sbt test` in the root directory of Spark. - Make sure that your code passes the unit tests. You can run the tests with `sbt/sbt test` in the root directory of Spark.
But first, make sure that you have [configured a spark-env.sh]({{HOME_PATH}}configuration.html) with at least But first, make sure that you have [configured a spark-env.sh](configuration.html) with at least
`SCALA_HOME`, as some of the tests try to spawn subprocesses using this. `SCALA_HOME`, as some of the tests try to spawn subprocesses using this.
- Add new unit tests for your code. We use [ScalaTest](http://www.scalatest.org/) for testing. Just add a new Suite in `core/src/test`, or methods to an existing Suite. - Add new unit tests for your code. We use [ScalaTest](http://www.scalatest.org/) for testing. Just add a new Suite in `core/src/test`, or methods to an existing Suite.
- If you'd like to report a bug but don't have time to fix it, you can still post it to our [issues page](https://github.com/mesos/spark/issues), or email the [mailing list](http://www.spark-project.org/mailing-lists.html). - If you'd like to report a bug but don't have time to fix it, you can still post it to our [issues page](https://github.com/mesos/spark/issues), or email the [mailing list](http://www.spark-project.org/mailing-lists.html).

View file

@ -106,7 +106,7 @@ This file needs to be copied to **every machine** to reflect the change. The eas
is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master, is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master,
then run `~/mesos-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers. then run `~/mesos-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers.
The [configuration guide]({{HOME_PATH}}configuration.html) describes the available configuration options. The [configuration guide](configuration.html) describes the available configuration options.
# Terminating a Cluster # Terminating a Cluster
@ -146,7 +146,7 @@ section.
- Support for spot instances is limited. - Support for spot instances is limited.
If you have a patch or suggestion for one of these limitations, feel free to If you have a patch or suggestion for one of these limitations, feel free to
[contribute]({{HOME_PATH}}contributing-to-spark.html) it! [contribute](contributing-to-spark.html) it!
# Using a Newer Spark Version # Using a Newer Spark Version

View file

@ -37,7 +37,7 @@ For example, `./run spark.examples.SparkPi` will run a sample program that estim
examples prints usage help if no params are given. examples prints usage help if no params are given.
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
to connect to. This can be a [URL for a distributed cluster]({{HOME_PATH}}scala-programming-guide.html#master-urls), to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing. `local` for testing.
@ -56,27 +56,27 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
**Quick start:** **Quick start:**
* [Spark Quick Start]({{HOME_PATH}}quick-start.html): a quick intro to the Spark API * [Spark Quick Start](quick-start.html): a quick intro to the Spark API
**Programming guides:** **Programming guides:**
* [Spark Programming Guide]({{HOME_PATH}}scala-programming-guide.html): how to get started using Spark, and details on the Scala API * [Spark Programming Guide](scala-programming-guide.html): how to get started using Spark, and details on the Scala API
* [Java Programming Guide]({{HOME_PATH}}java-programming-guide.html): using Spark from Java * [Java Programming Guide](java-programming-guide.html): using Spark from Java
**Deployment guides:** **Deployment guides:**
* [Running Spark on Amazon EC2]({{HOME_PATH}}ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes * [Running Spark on Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
* [Standalone Deploy Mode]({{HOME_PATH}}spark-standalone.html): launch a standalone cluster quickly without Mesos * [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without Mesos
* [Running Spark on Mesos]({{HOME_PATH}}running-on-mesos.html): deploy a private cluster using * [Running Spark on Mesos](running-on-mesos.html): deploy a private cluster using
[Apache Mesos](http://incubator.apache.org/mesos) [Apache Mesos](http://incubator.apache.org/mesos)
* [Running Spark on YARN]({{HOME_PATH}}running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN) * [Running Spark on YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
**Other documents:** **Other documents:**
* [Configuration]({{HOME_PATH}}configuration.html): customize Spark via its configuration system * [Configuration](configuration.html): customize Spark via its configuration system
* [Tuning Guide]({{HOME_PATH}}tuning.html): best practices to optimize performance and memory use * [Tuning Guide](tuning.html): best practices to optimize performance and memory use
* [API Docs (Scaladoc)]({{HOME_PATH}}api/core/index.html) * [API Docs (Scaladoc)](api/core/index.html)
* [Bagel]({{HOME_PATH}}bagel-programming-guide.html): an implementation of Google's Pregel on Spark * [Bagel](bagel-programming-guide.html): an implementation of Google's Pregel on Spark
* [Contributing to Spark](contributing-to-spark.html) * [Contributing to Spark](contributing-to-spark.html)
**External resources:** **External resources:**
@ -96,4 +96,4 @@ To get help using Spark or keep up with Spark development, sign up for the [spar
If you're in the San Francisco Bay Area, there's a regular [Spark meetup](http://www.meetup.com/spark-users/) every few weeks. Come by to meet the developers and other users. If you're in the San Francisco Bay Area, there's a regular [Spark meetup](http://www.meetup.com/spark-users/) every few weeks. Come by to meet the developers and other users.
Finally, if you'd like to contribute code to Spark, read [how to contribute]({{HOME_PATH}}contributing-to-spark.html). Finally, if you'd like to contribute code to Spark, read [how to contribute](contributing-to-spark.html).

View file

@ -5,14 +5,14 @@ title: Java Programming Guide
The Spark Java API exposes all the Spark features available in the Scala version to Java. The Spark Java API exposes all the Spark features available in the Scala version to Java.
To learn the basics of Spark, we recommend reading through the To learn the basics of Spark, we recommend reading through the
[Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html) first; it should be [Scala Programming Guide](scala-programming-guide.html) first; it should be
easy to follow even if you don't know Scala. easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Java. This guide will show how to use the Spark features described there in Java.
The Spark Java API is defined in the The Spark Java API is defined in the
[`spark.api.java`]({{HOME_PATH}}api/core/index.html#spark.api.java.package) package, and includes [`spark.api.java`](api/core/index.html#spark.api.java.package) package, and includes
a [`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) for a [`JavaSparkContext`](api/core/index.html#spark.api.java.JavaSparkContext) for
initializing Spark and [`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes, initializing Spark and [`JavaRDD`](api/core/index.html#spark.api.java.JavaRDD) classes,
which support the same methods as their Scala counterparts but take Java functions and return which support the same methods as their Scala counterparts but take Java functions and return
Java data and collection types. The main differences have to do with passing functions to RDD Java data and collection types. The main differences have to do with passing functions to RDD
operations (e.g. map) and handling RDDs of different types, as discussed next. operations (e.g. map) and handling RDDs of different types, as discussed next.
@ -23,12 +23,12 @@ There are a few key differences between the Java and Scala APIs:
* Java does not support anonymous or first-class functions, so functions must * Java does not support anonymous or first-class functions, so functions must
be implemented by extending the be implemented by extending the
[`spark.api.java.function.Function`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function), [`spark.api.java.function.Function`](api/core/index.html#spark.api.java.function.Function),
[`Function2`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function2), etc. [`Function2`](api/core/index.html#spark.api.java.function.Function2), etc.
classes. classes.
* To maintain type safety, the Java API defines specialized Function and RDD * To maintain type safety, the Java API defines specialized Function and RDD
classes for key-value pairs and doubles. For example, classes for key-value pairs and doubles. For example,
[`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD) [`JavaPairRDD`](api/core/index.html#spark.api.java.JavaPairRDD)
stores key-value pairs. stores key-value pairs.
* RDD methods like `collect()` and `countByKey()` return Java collections types, * RDD methods like `collect()` and `countByKey()` return Java collections types,
such as `java.util.List` and `java.util.Map`. such as `java.util.List` and `java.util.Map`.
@ -44,8 +44,8 @@ In the Scala API, these methods are automatically added using Scala's
[implicit conversions](http://www.scala-lang.org/node/130) mechanism. [implicit conversions](http://www.scala-lang.org/node/130) mechanism.
In the Java API, the extra methods are defined in the In the Java API, the extra methods are defined in the
[`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD) [`JavaPairRDD`](api/core/index.html#spark.api.java.JavaPairRDD)
and [`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD) and [`JavaDoubleRDD`](api/core/index.html#spark.api.java.JavaDoubleRDD)
classes. RDD methods like `map` are overloaded by specialized `PairFunction` classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by types. Common methods like `filter` and `sample` are implemented by
@ -76,9 +76,9 @@ class has a single abstract method, `call()`, that must be implemented.
# Other Features # Other Features
The Java API supports other Spark features, including The Java API supports other Spark features, including
[accumulators]({{HOME_PATH}}scala-programming-guide.html#accumulators), [accumulators](scala-programming-guide.html#accumulators),
[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast-variables), and [broadcast variables](scala-programming-guide.html#broadcast-variables), and
[caching]({{HOME_PATH}}scala-programming-guide.html#rdd-persistence). [caching](scala-programming-guide.html#rdd-persistence).
# Example # Example
@ -173,7 +173,7 @@ just a matter of style.
# Javadoc # Javadoc
We currently provide documentation for the Java API as Scaladoc, in the We currently provide documentation for the Java API as Scaladoc, in the
[`spark.api.java` package]({{HOME_PATH}}api/core/index.html#spark.api.java.package), because [`spark.api.java` package](api/core/index.html#spark.api.java.package), because
some of the classes are implemented in Scala. The main downside is that the types and function some of the classes are implemented in Scala. The main downside is that the types and function
definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of
`T reduce(Function2<T, T> func)`). `T reduce(Function2<T, T> func)`).

View file

@ -8,7 +8,7 @@ title: Spark Quick Start
# Introduction # Introduction
This document provides a quick-and-dirty look at Spark's API. See the [programming guide]({{HOME_PATH}}/scala-programming-guide.html) for a complete reference. To follow along with this guide, you only need to have successfully [built spark]({{HOME_PATH}}) on one machine. Building Spark is as simple as running This document provides a quick-and-dirty look at Spark's API. See the [programming guide](scala-programming-guide.html) for a complete reference. To follow along with this guide, you only need to have successfully [built spark]() on one machine. Building Spark is as simple as running
{% highlight bash %} {% highlight bash %}
$ sbt/sbt package $ sbt/sbt package
@ -29,7 +29,7 @@ scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3 textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
{% endhighlight %} {% endhighlight %}
RDD's have _[actions]({{HOME_PATH}}/scala-programming-guide.html#actions)_, which return values, and _[transformations]({{HOME_PATH}}/scala-programming-guide.html#transformations)_, which return pointers to new RDD's. Let's start with a few actions: RDD's have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDD's. Let's start with a few actions:
{% highlight scala %} {% highlight scala %}
scala> textFile.count() // Number of items in this RDD scala> textFile.count() // Number of items in this RDD
@ -39,7 +39,7 @@ scala> textFile.first() // First item in this RDD
res1: String = # Spark res1: String = # Spark
{% endhighlight %} {% endhighlight %}
Now let's use a transformation. We will use the [filter]({{HOME_PATH}}/scala-programming-guide.html#transformations)() transformation to return a new RDD with a subset of the items in the file. Now let's use a transformation. We will use the [filter](scala-programming-guide.html#transformations)() transformation to return a new RDD with a subset of the items in the file.
{% highlight scala %} {% highlight scala %}
scala> val sparkLinesOnly = textFile.filter(line => line.contains("Spark")) scala> val sparkLinesOnly = textFile.filter(line => line.contains("Spark"))
@ -61,7 +61,7 @@ scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a < b) {b
res4: Long = 16 res4: Long = 16
{% endhighlight %} {% endhighlight %}
This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to [map]({{HOME_PATH}}/scala-programming-guide.html#transformations)() and [reduce]({{HOME_PATH}}/scala-programming-guide.html#actions)() are scala closures. We can easily include functions declared elsewhere, or include existing functions in our anonymous closures. For instance, we can use `Math.max()` to make this code easier to understand. This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to [map](scala-programming-guide.html#transformations)() and [reduce](scala-programming-guide.html#actions)() are scala closures. We can easily include functions declared elsewhere, or include existing functions in our anonymous closures. For instance, we can use `Math.max()` to make this code easier to understand.
{% highlight scala %} {% highlight scala %}
scala> import java.lang.Math; scala> import java.lang.Math;
@ -78,7 +78,7 @@ scala> val wordCountRDD = textFile.flatMap(line => line.split(" ")).map(word =>
wordCountRDD: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8 wordCountRDD: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
{% endhighlight %} {% endhighlight %}
Here, we combined the [flatMap]({{HOME_PATH}}/scala-programming-guide.html#transformations)(), [map]({{HOME_PATH}}/scala-programming-guide.html#transformations)() and [reduceByKey]({{HOME_PATH}}/scala-programming-guide.html#transformations)() transformations to create per-word counts in the file. To collect the word counts in our shell, we can use the [collect]({{HOME_PATH}}/scala-programming-guide.html#actions)() action: Here, we combined the [flatMap](scala-programming-guide.html#transformations)(), [map](scala-programming-guide.html#transformations)() and [reduceByKey](scala-programming-guide.html#transformations)() transformations to create per-word counts in the file. To collect the word counts in our shell, we can use the [collect](scala-programming-guide.html#actions)() action:
{% highlight scala %} {% highlight scala %}
scala> wordCountRDD.collect() scala> wordCountRDD.collect()
@ -158,7 +158,7 @@ $ sbt run
Lines with a: 8422, Lines with b: 1836 Lines with a: 8422, Lines with b: 1836
{% endhighlight %} {% endhighlight %}
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode]({{HOME_PATH}}/spark-standalone.html) documentation and consider using a distributed input source, such as HDFS. This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
# A Spark Job In Java # A Spark Job In Java
Now say we wanted to write custom job using the Spark API. We will walk through a simple job in both Scala (with sbt) and Java (with maven). If you using other build systems, please reference the Spark assembly jar in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory: Now say we wanted to write custom job using the Spark API. We will walk through a simple job in both Scala (with sbt) and Java (with maven). If you using other build systems, please reference the Spark assembly jar in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory:
@ -235,5 +235,5 @@ $ mvn exec:java -Dexec.mainClass="SimpleJob"
Lines with a: 8422, Lines with b: 1836 Lines with a: 8422, Lines with b: 1836
{% endhighlight %} {% endhighlight %}
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode]({{HOME_PATH}}/spark-standalone.html) documentation and consider using a distributed input source, such as HDFS. This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.

View file

@ -5,7 +5,7 @@ title: Running Spark on Mesos
Spark can run on private clusters managed by the [Apache Mesos](http://incubator.apache.org/mesos/) resource manager. Follow the steps below to install Mesos and Spark: Spark can run on private clusters managed by the [Apache Mesos](http://incubator.apache.org/mesos/) resource manager. Follow the steps below to install Mesos and Spark:
1. Download and build Spark using the instructions [here]({{HOME_PATH}}index.html). 1. Download and build Spark using the instructions [here](index.html).
2. Download Mesos 0.9.0 from a [mirror](http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-0.9.0-incubating/). 2. Download Mesos 0.9.0 from a [mirror](http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-0.9.0-incubating/).
3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.macosx`, that you can run. See the README file in Mesos for other options. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g. if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`. 3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.macosx`, that you can run. See the README file in Mesos for other options. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g. if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`.
4. Build Mesos using `make`, and then install it using `make install`. 4. Build Mesos using `make`, and then install it using `make install`.
@ -24,7 +24,7 @@ Spark can run on private clusters managed by the [Apache Mesos](http://incubator
new SparkContext("mesos://HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar")) new SparkContext("mesos://HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar"))
{% endhighlight %} {% endhighlight %}
If you want to run Spark on Amazon EC2, you can use the Spark [EC2 launch scripts]({{HOME_PATH}}ec2-scripts.html), which provide an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured. This will get you a cluster in about five minutes without any configuration on your part. If you want to run Spark on Amazon EC2, you can use the Spark [EC2 launch scripts](ec2-scripts.html), which provide an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured. This will get you a cluster in about five minutes without any configuration on your part.
# Mesos Run Modes # Mesos Run Modes

View file

@ -35,7 +35,7 @@ This is done through the following constructor:
new SparkContext(master, jobName, [sparkHome], [jars]) new SparkContext(master, jobName, [sparkHome], [jars])
{% endhighlight %} {% endhighlight %}
The `master` parameter is a string specifying a [Mesos]({{HOME_PATH}}running-on-mesos.html) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later. The `master` parameter is a string specifying a [Mesos](running-on-mesos.html) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.
In the Spark interpreter, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, run `MASTER=local[4] ./spark-shell` to run locally with four cores. In the Spark interpreter, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, run `MASTER=local[4] ./spark-shell` to run locally with four cores.
@ -48,16 +48,16 @@ The master URL passed to Spark can be in one of the following formats:
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr> <tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). <tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
</td></tr> </td></tr>
<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="{{HOME_PATH}}spark-standalone.html">Spark standalone <tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default. cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
</td></tr> </td></tr>
<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="{{HOME_PATH}}running-on-mesos.html">Mesos</a> cluster. <tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
The host parameter is the hostname of the Mesos master. The port must be whichever one the master is configured to use, The host parameter is the hostname of the Mesos master. The port must be whichever one the master is configured to use,
which is 5050 by default. which is 5050 by default.
</td></tr> </td></tr>
</table> </table>
For running on YARN, Spark launches an instance of the standalone deploy cluster within YARN; see [running on YARN]({{HOME_PATH}}running-on-yarn.html) for details. For running on YARN, Spark launches an instance of the standalone deploy cluster within YARN; see [running on YARN](running-on-yarn.html) for details.
### Deploying Code on a Cluster ### Deploying Code on a Cluster
@ -116,7 +116,7 @@ All transformations in Spark are <i>lazy</i>, in that they do not compute their
By default, each transformed RDD is recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options. By default, each transformed RDD is recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options.
The following tables list the transformations and actions currently supported (see also the [RDD API doc]({{HOME_PATH}}api/core/index.html#spark.RDD) for details): The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/core/index.html#spark.RDD) for details):
### Transformations ### Transformations
@ -185,7 +185,7 @@ The following tables list the transformations and actions currently supported (s
</tr> </tr>
</table> </table>
A complete list of transformations is available in the [RDD API doc]({{HOME_PATH}}api/core/index.html#spark.RDD). A complete list of transformations is available in the [RDD API doc](api/core/index.html#spark.RDD).
### Actions ### Actions
@ -233,7 +233,7 @@ A complete list of transformations is available in the [RDD API doc]({{HOME_PATH
</tr> </tr>
</table> </table>
A complete list of actions is available in the [RDD API doc]({{HOME_PATH}}api/core/index.html#spark.RDD). A complete list of actions is available in the [RDD API doc](api/core/index.html#spark.RDD).
## RDD Persistence ## RDD Persistence
@ -241,7 +241,7 @@ One of the most important capabilities in Spark is *persisting* (or *caching*) a
You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant -- if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it. You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant -- if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or even replicate it across nodes. These levels are chosen by passing a [`spark.storage.StorageLevel`]({{HOME_PATH}}api/core/index.html#spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of available storage levels is: In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or even replicate it across nodes. These levels are chosen by passing a [`spark.storage.StorageLevel`](api/core/index.html#spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of available storage levels is:
<table class="table"> <table class="table">
<tr><th style="width:23%">Storage Level</th><th>Meaning</th></tr> <tr><th style="width:23%">Storage Level</th><th>Meaning</th></tr>
@ -259,7 +259,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing
<td> MEMORY_ONLY_SER </td> <td> MEMORY_ONLY_SER </td>
<td> Store RDD as <i>serialized</i> Java objects (one byte array per partition). <td> Store RDD as <i>serialized</i> Java objects (one byte array per partition).
This is generally more space-efficient than deserialized objects, especially when using a This is generally more space-efficient than deserialized objects, especially when using a
<a href="{{HOME_PATH}}tuning.html">fast serializer</a>, but more CPU-intensive to read. <a href="tuning.html">fast serializer</a>, but more CPU-intensive to read.
</td> </td>
</tr> </tr>
<tr> <tr>
@ -284,7 +284,7 @@ We recommend going through the following process to select one:
* If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY`), leave them that way. This is the most * If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY`), leave them that way. This is the most
CPU-efficient option, allowing operations on the RDDs to run as fast as possible. CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
* If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects * If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library](tuning.html) to make the objects
much more space-efficient, but still reasonably fast to access. much more space-efficient, but still reasonably fast to access.
* Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large * Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large
amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk. amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.
@ -339,6 +339,6 @@ res2: Int = 10
You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website. You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website.
In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments. In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
For help on optimizing your program, the [configuration]({{HOME_PATH}}configuration.html) and For help on optimizing your program, the [configuration](configuration.html) and
[tuning]({{HOME_PATH}}tuning.html) guides provide information on best practices. They are especially important for [tuning](tuning.html) guides provide information on best practices. They are especially important for
making sure that your data is stored in memory in an efficient format. making sure that your data is stored in memory in an efficient format.

View file

@ -68,7 +68,7 @@ Finally, the following configuration options can be passed to the master and wor
To launch a Spark standalone cluster with the deploy scripts, you need to set up two files, `conf/spark-env.sh` and `conf/slaves`. The `conf/spark-env.sh` file lets you specify global settings for the master and slave instances, such as memory, or port numbers to bind to, while `conf/slaves` is a list of slave nodes. The system requires that all the slave machines have the same configuration files, so *copy these files to each machine*. To launch a Spark standalone cluster with the deploy scripts, you need to set up two files, `conf/spark-env.sh` and `conf/slaves`. The `conf/spark-env.sh` file lets you specify global settings for the master and slave instances, such as memory, or port numbers to bind to, while `conf/slaves` is a list of slave nodes. The system requires that all the slave machines have the same configuration files, so *copy these files to each machine*.
In `conf/spark-env.sh`, you can set the following parameters, in addition to the [standard Spark configuration settongs]({{HOME_PATH}}configuration.html): In `conf/spark-env.sh`, you can set the following parameters, in addition to the [standard Spark configuration settongs](configuration.html):
<table class="table"> <table class="table">
<tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr> <tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr>
@ -123,7 +123,7 @@ Note that the scripts must be executed on the machine you want to run the Spark
# Connecting a Job to the Cluster # Connecting a Job to the Cluster
To run a job on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext` To run a job on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
constructor]({{HOME_PATH}}scala-programming-guide.html#initializing-spark). constructor](scala-programming-guide.html#initializing-spark).
To run an interactive Spark shell against the cluster, run the following command: To run an interactive Spark shell against the cluster, run the following command:

View file

@ -7,15 +7,15 @@ Because of the in-memory nature of most Spark computations, Spark programs can b
by any resource in the cluster: CPU, network bandwidth, or memory. by any resource in the cluster: CPU, network bandwidth, or memory.
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
also need to do some tuning, such as also need to do some tuning, such as
[storing RDDs in serialized form]({{HOME_PATH}}/scala-programming-guide.html#rdd-persistence), to [storing RDDs in serialized form](scala-programming-guide.html#rdd-persistence), to
make the memory usage smaller. make the memory usage smaller.
This guide will cover two main topics: data serialization, which is crucial for good network This guide will cover two main topics: data serialization, which is crucial for good network
performance, and memory tuning. We also sketch several smaller topics. performance, and memory tuning. We also sketch several smaller topics.
This document assumes that you have familiarity with the Spark API and have already read the [Scala]({{HOME_PATH}}/scala-programming-guide.html) or [Java]({{HOME_PATH}}/java-programming-guide.html) programming guides. After reading this guide, do not hesitate to reach out to the [Spark mailing list](http://groups.google.com/group/spark-users) with performance related concerns. This document assumes that you have familiarity with the Spark API and have already read the [Scala](scala-programming-guide.html) or [Java](java-programming-guide.html) programming guides. After reading this guide, do not hesitate to reach out to the [Spark mailing list](http://groups.google.com/group/spark-users) with performance related concerns.
# The Spark Storage Model # The Spark Storage Model
Spark's key abstraction is a distributed dataset, or RDD. RDD's consist of partitions. RDD partitions are stored either in memory or on disk, with replication or without replication, depending on the chosen [persistence options]({{HOME_PATH}}/scala-programming-guide.html#rdd-persistence). When RDD's are stored in memory, they can be stored as deserialized Java objects, or in a serialized form, again depending on the persistence option chosen. When RDD's are transferred over the network, or spilled to disk, they are always serialized. Spark can use different serializers, configurable with the `spark.serializer` option. Spark's key abstraction is a distributed dataset, or RDD. RDD's consist of partitions. RDD partitions are stored either in memory or on disk, with replication or without replication, depending on the chosen [persistence options](scala-programming-guide.html#rdd-persistence). When RDD's are stored in memory, they can be stored as deserialized Java objects, or in a serialized form, again depending on the persistence option chosen. When RDD's are transferred over the network, or spilled to disk, they are always serialized. Spark can use different serializers, configurable with the `spark.serializer` option.
# Serialization Options # Serialization Options
@ -45,7 +45,7 @@ You can switch to using Kryo by calling `System.setProperty("spark.serializer",
registration requirement, but we recommend trying it in any network-intensive application. registration requirement, but we recommend trying it in any network-intensive application.
Finally, to register your classes with Kryo, create a public class that extends Finally, to register your classes with Kryo, create a public class that extends
[`spark.KryoRegistrator`]({{HOME_PATH}}api/core/index.html#spark.KryoRegistrator) and set the [`spark.KryoRegistrator`](api/core/index.html#spark.KryoRegistrator) and set the
`spark.kryo.registrator` system property to point to it, as follows: `spark.kryo.registrator` system property to point to it, as follows:
{% highlight scala %} {% highlight scala %}
@ -107,11 +107,11 @@ There are several ways to reduce this cost and still make Java objects efficient
3. If you have less than 32 GB of RAM, set the JVM flag `-XX:+UseCompressedOops` to make pointers be 3. If you have less than 32 GB of RAM, set the JVM flag `-XX:+UseCompressedOops` to make pointers be
four bytes instead of eight. Also, on Java 7 or later, try `-XX:+UseCompressedStrings` to store four bytes instead of eight. Also, on Java 7 or later, try `-XX:+UseCompressedStrings` to store
ASCII strings as just 8 bits per character. You can add these options in ASCII strings as just 8 bits per character. You can add these options in
[`spark-env.sh`]({{HOME_PATH}}configuration.html#environment-variables-in-spark-envsh). [`spark-env.sh`](configuration.html#environment-variables-in-spark-envsh).
When your objects are still too large to efficiently store despite this tuning, a much simpler way When your objects are still too large to efficiently store despite this tuning, a much simpler way
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
the [RDD persistence API]({{HOME_PATH}}scala-programming-guide#rdd-persistence). the [RDD persistence API](scala-programming-guide#rdd-persistence).
Spark will then store each RDD partition as one large byte array. Spark will then store each RDD partition as one large byte array.
The only downside of storing data in serialized form is slower access times, due to having to The only downside of storing data in serialized form is slower access times, due to having to
deserialize each object on the fly. deserialize each object on the fly.
@ -196,7 +196,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a
(though you can control it through optional parameters to `SparkContext.textFile`, etc), but for (though you can control it through optional parameters to `SparkContext.textFile`, etc), but for
distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses a default value of 8. distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses a default value of 8.
You can pass the level of parallelism as a second argument (see the You can pass the level of parallelism as a second argument (see the
[`spark.PairRDDFunctions`]({{HOME_PATH}}api/core/index.html#spark.PairRDDFunctions) documentation), [`spark.PairRDDFunctions`](api/core/index.html#spark.PairRDDFunctions) documentation),
or set the system property `spark.default.parallelism` to change the default. or set the system property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster. In general, we recommend 2-3 tasks per CPU core in your cluster.
@ -213,7 +213,7 @@ number of cores in your clusters.
## Broadcasting Large Variables ## Broadcasting Large Variables
Using the [broadcast functionality]({{HOME_PATH}}scala-programming-guide#broadcast-variables) Using the [broadcast functionality](scala-programming-guide#broadcast-variables)
available in `SparkContext` can greatly reduce the size of each serialized task, and the cost available in `SparkContext` can greatly reduce the size of each serialized task, and the cost
of launching a job over a cluster. If your tasks use any large object from the driver program of launching a job over a cluster. If your tasks use any large object from the driver program
inside of them (e.g. a static lookup table), consider turning it into a broadcast variable. inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.