Organize configuration docs

This PR improves and organizes the config option page and makes a few other changes to config docs. See a preview here: http://people.apache.org/~pwendell/config-improvements/configuration.html The biggest changes are: 1. The configs for the standalone master/workers were moved to the standalone page and out of the general config doc. 2. SPARK_LOCAL_DIRS was missing from the standalone docs. 3. Expanded discussion of injecting configs with spark-submit, including an example. 4. Config options were organized into the following categories: - Runtime Environment - Shuffle Behavior - Spark UI - Compression and Serialization - Execution Behavior - Networking - Scheduling - Security - Spark Streaming Author: Patrick Wendell <pwendell@gmail.com> Closes #880 from pwendell/config-cleanup and squashes the following commits: 93f56c3 [Patrick Wendell] Feedback from Matei 6f66efc [Patrick Wendell] More feedback 16ae776 [Patrick Wendell] Adding back header section d9c264f [Patrick Wendell] Small fix e0c1728 [Patrick Wendell] Response to Matei's review 27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896) e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup a374369 [Patrick Wendell] Line wrapping fixes fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup 3289ea4 [Patrick Wendell] Pulling in changes from #856 106ee31 [Patrick Wendell] Small link fix f7e79bc [Patrick Wendell] Re-organizing config options. 54b184d [Patrick Wendell] Adding standalone configs to the standalone page 592e94a [Patrick Wendell] Stash 29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs 2d719ef [Patrick Wendell] Small fix 4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs 204b248 [Patrick Wendell] Small fixes
2014-05-28 15:49:54 -07:00 · 2014-05-28 15:49:54 -07:00 · 7801d44fd3
parent 82eadc3b07
commit 7801d44fd3
3 changed files with 713 additions and 595 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@ -252,11 +252,11 @@ we initialize a SparkContext as part of the program.
 We pass the SparkContext constructor a 
 [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
 object which contains information about our
-application. We also call sc.addJar to make sure that when our application is launched in cluster
-mode, the jar file containing it will be shipped automatically to worker nodes.
+application. 

-This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt`
-which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
+Our application depends on the Spark API, so we'll also include an sbt configuration file, 
+`simple.sbt` which explains that Spark is a dependency. This file also adds a repository that 
+Spark depends on:

 {% highlight scala %}
 name := "Simple Project"
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@ -93,7 +93,15 @@ You can optionally configure the cluster further by setting environment variable
  </tr>
  <tr>
    <td><code>SPARK_MASTER_OPTS</code></td>
-    <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
+    <td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
+  </tr>
+  <tr>
+    <td><code>SPARK_LOCAL_DIRS</code></td>
+    <td>
+    Directory to use for "scratch" space in Spark, including map output files and RDDs that get 
+    stored on disk. This should be on a fast, local disk in your system. It can also be a 
+    comma-separated list of multiple directories on different disks.
+    </td>
  </tr>
  <tr>
    <td><code>SPARK_WORKER_CORES</code></td>
@ -126,7 +134,7 @@ You can optionally configure the cluster further by setting environment variable
  </tr>
  <tr>
    <td><code>SPARK_WORKER_OPTS</code></td>
-    <td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).</td>
+    <td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
  </tr>
  <tr>
    <td><code>SPARK_DAEMON_MEMORY</code></td>
@ -144,6 +152,73 @@ You can optionally configure the cluster further by setting environment variable

 **Note:** The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.

+SPARK_MASTER_OPTS supports the following system properties:
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td><code>spark.deploy.spreadOut</code></td>
+  <td>true</td>
+  <td>
+    Whether the standalone cluster manager should spread applications out across nodes or try
+    to consolidate them onto as few nodes as possible. Spreading out is usually better for
+    data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. <br/>
+  </td>
+</tr>
+<tr>
+  <td><code>spark.deploy.defaultCores</code></td>
+  <td>(infinite)</td>
+  <td>
+    Default number of cores to give to applications in Spark's standalone mode if they don't
+    set <code>spark.cores.max</code>. If not set, applications always get all available
+    cores unless they configure <code>spark.cores.max</code> themselves.
+    Set this lower on a shared cluster to prevent users from grabbing
+    the whole cluster by default. <br/>
+  </td>
+</tr>
+<tr>
+  <td><code>spark.worker.timeout</code></td>
+  <td>60</td>
+  <td>
+    Number of seconds after which the standalone deploy master considers a worker lost if it
+    receives no heartbeats.
+  </td>
+</tr>
+</table>
+
+SPARK_WORKER_OPTS supports the following system properties:
+
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
+<tr>
+  <td><code>spark.worker.cleanup.enabled</code></td>
+  <td>false</td>
+  <td>
+    Enable periodic cleanup of worker / application directories.  Note that this only affects standalone
+    mode, as YARN works differently. Applications directories are cleaned up regardless of whether
+    the application is still running.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.worker.cleanup.interval</code></td>
+  <td>1800 (30 minutes)</td>
+  <td>
+    Controls the interval, in seconds, at which the worker cleans up old application work dirs
+    on the local machine.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.worker.cleanup.appDataTtl</code></td>
+  <td>7 * 24 * 3600 (7 days)</td>
+  <td>
+    The number of seconds to retain application work directories on each worker.  This is a Time To Live
+    and should depend on the amount of available disk space you have.  Application logs and jars are
+    downloaded to each application work dir.  Over time, the work dirs can quickly fill up disk space,
+    especially if you run jobs very frequently.
+  </td>
+</tr>
+</table>
+
 # Connecting an Application to the Cluster

 To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
@ -212,6 +287,94 @@ In addition, detailed log output for each job is also written to the work direct
 You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).


+# Configuring Ports for Network Security
+
+Spark makes heavy use of the network, and some environments have strict requirements for using tight
+firewall settings.  Below are the primary ports that Spark uses for its communication and how to
+configure those ports.
+
+<table class="table">
+  <tr>
+    <th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
+    Setting</th><th>Notes</th>
+  </tr>
+  <!-- Web UIs -->
+  <tr>
+    <td>Browser</td>
+    <td>Standalone Cluster Master</td>
+    <td>8080</td>
+    <td>Web UI</td>
+    <td><code>master.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <tr>
+    <td>Browser</td>
+    <td>Driver</td>
+    <td>4040</td>
+    <td>Web UI</td>
+    <td><code>spark.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <tr>
+    <td>Browser</td>
+    <td>History Server</td>
+    <td>18080</td>
+    <td>Web UI</td>
+    <td><code>spark.history.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <tr>
+    <td>Browser</td>
+    <td>Worker</td>
+    <td>8081</td>
+    <td>Web UI</td>
+    <td><code>worker.ui.port</code></td>
+    <td>Jetty-based</td>
+  </tr>
+  <!-- Cluster interactions -->
+  <tr>
+    <td>Application</td>
+    <td>Standalone Cluster Master</td>
+    <td>7077</td>
+    <td>Submit job to cluster</td>
+    <td><code>spark.driver.port</code></td>
+    <td>Akka-based.  Set to "0" to choose a port randomly</td>
+  </tr>
+  <tr>
+    <td>Worker</td>
+    <td>Standalone Cluster Master</td>
+    <td>7077</td>
+    <td>Join cluster</td>
+    <td><code>spark.driver.port</code></td>
+    <td>Akka-based.  Set to "0" to choose a port randomly</td>
+  </tr>
+  <tr>
+    <td>Application</td>
+    <td>Worker</td>
+    <td>(random)</td>
+    <td>Join cluster</td>
+    <td><code>SPARK_WORKER_PORT</code> (standalone cluster)</td>
+    <td>Akka-based</td>
+  </tr>
+
+  <!-- Other misc stuff -->
+  <tr>
+    <td>Driver and other Workers</td>
+    <td>Worker</td>
+    <td>(random)</td>
+    <td>
+      <ul>
+        <li>File server for file and jars</li>
+        <li>Http Broadcast</li>
+        <li>Class file server (Spark Shell only)</li>
+      </ul>
+    </td>
+    <td>None</td>
+    <td>Jetty-based.  Each of these services starts on a random port that cannot be configured</td>
+  </tr>
+
+</table>
+
 # High Availability

 By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.