Organize configuration docs

This PR improves and organizes the config option page
and makes a few other changes to config docs. See a preview here:
http://people.apache.org/~pwendell/config-improvements/configuration.html

The biggest changes are:
1. The configs for the standalone master/workers were moved to the
standalone page and out of the general config doc.
2. SPARK_LOCAL_DIRS was missing from the standalone docs.
3. Expanded discussion of injecting configs with spark-submit, including an
example.
4. Config options were organized into the following categories:
- Runtime Environment
- Shuffle Behavior
- Spark UI
- Compression and Serialization
- Execution Behavior
- Networking
- Scheduling
- Security
- Spark Streaming

Author: Patrick Wendell <pwendell@gmail.com>

Closes #880 from pwendell/config-cleanup and squashes the following commits:

93f56c3 [Patrick Wendell] Feedback from Matei
6f66efc [Patrick Wendell] More feedback
16ae776 [Patrick Wendell] Adding back header section
d9c264f [Patrick Wendell] Small fix
e0c1728 [Patrick Wendell] Response to Matei's review
27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896)
e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
a374369 [Patrick Wendell] Line wrapping fixes
fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
3289ea4 [Patrick Wendell] Pulling in changes from #856
106ee31 [Patrick Wendell] Small link fix
f7e79bc [Patrick Wendell] Re-organizing config options.
54b184d [Patrick Wendell] Adding standalone configs to the standalone page
592e94a [Patrick Wendell] Stash
29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs
2d719ef [Patrick Wendell] Small fix
4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs
204b248 [Patrick Wendell] Small fixes
This commit is contained in:
Patrick Wendell 2014-05-28 15:49:54 -07:00
parent 82eadc3b07
commit 7801d44fd3
3 changed files with 713 additions and 595 deletions

File diff suppressed because it is too large Load diff

View file

@ -252,11 +252,11 @@ we initialize a SparkContext as part of the program.
We pass the SparkContext constructor a
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
object which contains information about our
application. We also call sc.addJar to make sure that when our application is launched in cluster
mode, the jar file containing it will be shipped automatically to worker nodes.
application.
This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt`
which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
Our application depends on the Spark API, so we'll also include an sbt configuration file,
`simple.sbt` which explains that Spark is a dependency. This file also adds a repository that
Spark depends on:
{% highlight scala %}
name := "Simple Project"

View file

@ -93,7 +93,15 @@ You can optionally configure the cluster further by setting environment variable
</tr>
<tr>
<td><code>SPARK_MASTER_OPTS</code></td>
<td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
<td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
</tr>
<tr>
<td><code>SPARK_LOCAL_DIRS</code></td>
<td>
Directory to use for "scratch" space in Spark, including map output files and RDDs that get
stored on disk. This should be on a fast, local disk in your system. It can also be a
comma-separated list of multiple directories on different disks.
</td>
</tr>
<tr>
<td><code>SPARK_WORKER_CORES</code></td>
@ -126,7 +134,7 @@ You can optionally configure the cluster further by setting environment variable
</tr>
<tr>
<td><code>SPARK_WORKER_OPTS</code></td>
<td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).</td>
<td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
</tr>
<tr>
<td><code>SPARK_DAEMON_MEMORY</code></td>
@ -144,6 +152,73 @@ You can optionally configure the cluster further by setting environment variable
**Note:** The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.
SPARK_MASTER_OPTS supports the following system properties:
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.deploy.spreadOut</code></td>
<td>true</td>
<td>
Whether the standalone cluster manager should spread applications out across nodes or try
to consolidate them onto as few nodes as possible. Spreading out is usually better for
data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. <br/>
</td>
</tr>
<tr>
<td><code>spark.deploy.defaultCores</code></td>
<td>(infinite)</td>
<td>
Default number of cores to give to applications in Spark's standalone mode if they don't
set <code>spark.cores.max</code>. If not set, applications always get all available
cores unless they configure <code>spark.cores.max</code> themselves.
Set this lower on a shared cluster to prevent users from grabbing
the whole cluster by default. <br/>
</td>
</tr>
<tr>
<td><code>spark.worker.timeout</code></td>
<td>60</td>
<td>
Number of seconds after which the standalone deploy master considers a worker lost if it
receives no heartbeats.
</td>
</tr>
</table>
SPARK_WORKER_OPTS supports the following system properties:
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.worker.cleanup.enabled</code></td>
<td>false</td>
<td>
Enable periodic cleanup of worker / application directories. Note that this only affects standalone
mode, as YARN works differently. Applications directories are cleaned up regardless of whether
the application is still running.
</td>
</tr>
<tr>
<td><code>spark.worker.cleanup.interval</code></td>
<td>1800 (30 minutes)</td>
<td>
Controls the interval, in seconds, at which the worker cleans up old application work dirs
on the local machine.
</td>
</tr>
<tr>
<td><code>spark.worker.cleanup.appDataTtl</code></td>
<td>7 * 24 * 3600 (7 days)</td>
<td>
The number of seconds to retain application work directories on each worker. This is a Time To Live
and should depend on the amount of available disk space you have. Application logs and jars are
downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space,
especially if you run jobs very frequently.
</td>
</tr>
</table>
# Connecting an Application to the Cluster
To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
@ -212,6 +287,94 @@ In addition, detailed log output for each job is also written to the work direct
You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
# Configuring Ports for Network Security
Spark makes heavy use of the network, and some environments have strict requirements for using tight
firewall settings. Below are the primary ports that Spark uses for its communication and how to
configure those ports.
<table class="table">
<tr>
<th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
Setting</th><th>Notes</th>
</tr>
<!-- Web UIs -->
<tr>
<td>Browser</td>
<td>Standalone Cluster Master</td>
<td>8080</td>
<td>Web UI</td>
<td><code>master.ui.port</code></td>
<td>Jetty-based</td>
</tr>
<tr>
<td>Browser</td>
<td>Driver</td>
<td>4040</td>
<td>Web UI</td>
<td><code>spark.ui.port</code></td>
<td>Jetty-based</td>
</tr>
<tr>
<td>Browser</td>
<td>History Server</td>
<td>18080</td>
<td>Web UI</td>
<td><code>spark.history.ui.port</code></td>
<td>Jetty-based</td>
</tr>
<tr>
<td>Browser</td>
<td>Worker</td>
<td>8081</td>
<td>Web UI</td>
<td><code>worker.ui.port</code></td>
<td>Jetty-based</td>
</tr>
<!-- Cluster interactions -->
<tr>
<td>Application</td>
<td>Standalone Cluster Master</td>
<td>7077</td>
<td>Submit job to cluster</td>
<td><code>spark.driver.port</code></td>
<td>Akka-based. Set to "0" to choose a port randomly</td>
</tr>
<tr>
<td>Worker</td>
<td>Standalone Cluster Master</td>
<td>7077</td>
<td>Join cluster</td>
<td><code>spark.driver.port</code></td>
<td>Akka-based. Set to "0" to choose a port randomly</td>
</tr>
<tr>
<td>Application</td>
<td>Worker</td>
<td>(random)</td>
<td>Join cluster</td>
<td><code>SPARK_WORKER_PORT</code> (standalone cluster)</td>
<td>Akka-based</td>
</tr>
<!-- Other misc stuff -->
<tr>
<td>Driver and other Workers</td>
<td>Worker</td>
<td>(random)</td>
<td>
<ul>
<li>File server for file and jars</li>
<li>Http Broadcast</li>
<li>Class file server (Spark Shell only)</li>
</ul>
</td>
<td>None</td>
<td>Jetty-based. Each of these services starts on a random port that cannot be configured</td>
</tr>
</table>
# High Availability
By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.