Organize configuration docs
This PR improves and organizes the config option page and makes a few other changes to config docs. See a preview here: http://people.apache.org/~pwendell/config-improvements/configuration.html The biggest changes are: 1. The configs for the standalone master/workers were moved to the standalone page and out of the general config doc. 2. SPARK_LOCAL_DIRS was missing from the standalone docs. 3. Expanded discussion of injecting configs with spark-submit, including an example. 4. Config options were organized into the following categories: - Runtime Environment - Shuffle Behavior - Spark UI - Compression and Serialization - Execution Behavior - Networking - Scheduling - Security - Spark Streaming Author: Patrick Wendell <pwendell@gmail.com> Closes #880 from pwendell/config-cleanup and squashes the following commits: 93f56c3 [Patrick Wendell] Feedback from Matei 6f66efc [Patrick Wendell] More feedback 16ae776 [Patrick Wendell] Adding back header section d9c264f [Patrick Wendell] Small fix e0c1728 [Patrick Wendell] Response to Matei's review 27d57db [Patrick Wendell] Reverting changes to index.html (covered in #896) e230ef9 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup a374369 [Patrick Wendell] Line wrapping fixes fdff7fc [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup 3289ea4 [Patrick Wendell] Pulling in changes from #856 106ee31 [Patrick Wendell] Small link fix f7e79bc [Patrick Wendell] Re-organizing config options. 54b184d [Patrick Wendell] Adding standalone configs to the standalone page 592e94a [Patrick Wendell] Stash 29b5446 [Patrick Wendell] Better discussion of spark-submit in configuration docs 2d719ef [Patrick Wendell] Small fix 4af9e07 [Patrick Wendell] Adding SPARK_LOCAL_DIRS docs 204b248 [Patrick Wendell] Small fixes
This commit is contained in:
parent
82eadc3b07
commit
7801d44fd3
File diff suppressed because it is too large
Load diff
|
@ -252,11 +252,11 @@ we initialize a SparkContext as part of the program.
|
|||
We pass the SparkContext constructor a
|
||||
[SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
|
||||
object which contains information about our
|
||||
application. We also call sc.addJar to make sure that when our application is launched in cluster
|
||||
mode, the jar file containing it will be shipped automatically to worker nodes.
|
||||
application.
|
||||
|
||||
This file depends on the Spark API, so we'll also include an sbt configuration file, `simple.sbt`
|
||||
which explains that Spark is a dependency. This file also adds a repository that Spark depends on:
|
||||
Our application depends on the Spark API, so we'll also include an sbt configuration file,
|
||||
`simple.sbt` which explains that Spark is a dependency. This file also adds a repository that
|
||||
Spark depends on:
|
||||
|
||||
{% highlight scala %}
|
||||
name := "Simple Project"
|
||||
|
|
|
@ -93,7 +93,15 @@ You can optionally configure the cluster further by setting environment variable
|
|||
</tr>
|
||||
<tr>
|
||||
<td><code>SPARK_MASTER_OPTS</code></td>
|
||||
<td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none).</td>
|
||||
<td>Configuration properties that apply only to the master in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SPARK_LOCAL_DIRS</code></td>
|
||||
<td>
|
||||
Directory to use for "scratch" space in Spark, including map output files and RDDs that get
|
||||
stored on disk. This should be on a fast, local disk in your system. It can also be a
|
||||
comma-separated list of multiple directories on different disks.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SPARK_WORKER_CORES</code></td>
|
||||
|
@ -126,7 +134,7 @@ You can optionally configure the cluster further by setting environment variable
|
|||
</tr>
|
||||
<tr>
|
||||
<td><code>SPARK_WORKER_OPTS</code></td>
|
||||
<td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none).</td>
|
||||
<td>Configuration properties that apply only to the worker in the form "-Dx=y" (default: none). See below for a list of possible options.</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>SPARK_DAEMON_MEMORY</code></td>
|
||||
|
@ -144,6 +152,73 @@ You can optionally configure the cluster further by setting environment variable
|
|||
|
||||
**Note:** The launch scripts do not currently support Windows. To run a Spark cluster on Windows, start the master and workers by hand.
|
||||
|
||||
SPARK_MASTER_OPTS supports the following system properties:
|
||||
|
||||
<table class="table">
|
||||
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
|
||||
<tr>
|
||||
<td><code>spark.deploy.spreadOut</code></td>
|
||||
<td>true</td>
|
||||
<td>
|
||||
Whether the standalone cluster manager should spread applications out across nodes or try
|
||||
to consolidate them onto as few nodes as possible. Spreading out is usually better for
|
||||
data locality in HDFS, but consolidating is more efficient for compute-intensive workloads. <br/>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.deploy.defaultCores</code></td>
|
||||
<td>(infinite)</td>
|
||||
<td>
|
||||
Default number of cores to give to applications in Spark's standalone mode if they don't
|
||||
set <code>spark.cores.max</code>. If not set, applications always get all available
|
||||
cores unless they configure <code>spark.cores.max</code> themselves.
|
||||
Set this lower on a shared cluster to prevent users from grabbing
|
||||
the whole cluster by default. <br/>
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.worker.timeout</code></td>
|
||||
<td>60</td>
|
||||
<td>
|
||||
Number of seconds after which the standalone deploy master considers a worker lost if it
|
||||
receives no heartbeats.
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
SPARK_WORKER_OPTS supports the following system properties:
|
||||
|
||||
<table class="table">
|
||||
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
|
||||
<tr>
|
||||
<td><code>spark.worker.cleanup.enabled</code></td>
|
||||
<td>false</td>
|
||||
<td>
|
||||
Enable periodic cleanup of worker / application directories. Note that this only affects standalone
|
||||
mode, as YARN works differently. Applications directories are cleaned up regardless of whether
|
||||
the application is still running.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.worker.cleanup.interval</code></td>
|
||||
<td>1800 (30 minutes)</td>
|
||||
<td>
|
||||
Controls the interval, in seconds, at which the worker cleans up old application work dirs
|
||||
on the local machine.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.worker.cleanup.appDataTtl</code></td>
|
||||
<td>7 * 24 * 3600 (7 days)</td>
|
||||
<td>
|
||||
The number of seconds to retain application work directories on each worker. This is a Time To Live
|
||||
and should depend on the amount of available disk space you have. Application logs and jars are
|
||||
downloaded to each application work dir. Over time, the work dirs can quickly fill up disk space,
|
||||
especially if you run jobs very frequently.
|
||||
</td>
|
||||
</tr>
|
||||
</table>
|
||||
|
||||
# Connecting an Application to the Cluster
|
||||
|
||||
To run an application on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
|
||||
|
@ -212,6 +287,94 @@ In addition, detailed log output for each job is also written to the work direct
|
|||
You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically `hdfs://<namenode>:9000/path`, but you can find the right URL on your Hadoop Namenode's web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on).
|
||||
|
||||
|
||||
# Configuring Ports for Network Security
|
||||
|
||||
Spark makes heavy use of the network, and some environments have strict requirements for using tight
|
||||
firewall settings. Below are the primary ports that Spark uses for its communication and how to
|
||||
configure those ports.
|
||||
|
||||
<table class="table">
|
||||
<tr>
|
||||
<th>From</th><th>To</th><th>Default Port</th><th>Purpose</th><th>Configuration
|
||||
Setting</th><th>Notes</th>
|
||||
</tr>
|
||||
<!-- Web UIs -->
|
||||
<tr>
|
||||
<td>Browser</td>
|
||||
<td>Standalone Cluster Master</td>
|
||||
<td>8080</td>
|
||||
<td>Web UI</td>
|
||||
<td><code>master.ui.port</code></td>
|
||||
<td>Jetty-based</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Browser</td>
|
||||
<td>Driver</td>
|
||||
<td>4040</td>
|
||||
<td>Web UI</td>
|
||||
<td><code>spark.ui.port</code></td>
|
||||
<td>Jetty-based</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Browser</td>
|
||||
<td>History Server</td>
|
||||
<td>18080</td>
|
||||
<td>Web UI</td>
|
||||
<td><code>spark.history.ui.port</code></td>
|
||||
<td>Jetty-based</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Browser</td>
|
||||
<td>Worker</td>
|
||||
<td>8081</td>
|
||||
<td>Web UI</td>
|
||||
<td><code>worker.ui.port</code></td>
|
||||
<td>Jetty-based</td>
|
||||
</tr>
|
||||
<!-- Cluster interactions -->
|
||||
<tr>
|
||||
<td>Application</td>
|
||||
<td>Standalone Cluster Master</td>
|
||||
<td>7077</td>
|
||||
<td>Submit job to cluster</td>
|
||||
<td><code>spark.driver.port</code></td>
|
||||
<td>Akka-based. Set to "0" to choose a port randomly</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Worker</td>
|
||||
<td>Standalone Cluster Master</td>
|
||||
<td>7077</td>
|
||||
<td>Join cluster</td>
|
||||
<td><code>spark.driver.port</code></td>
|
||||
<td>Akka-based. Set to "0" to choose a port randomly</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td>Application</td>
|
||||
<td>Worker</td>
|
||||
<td>(random)</td>
|
||||
<td>Join cluster</td>
|
||||
<td><code>SPARK_WORKER_PORT</code> (standalone cluster)</td>
|
||||
<td>Akka-based</td>
|
||||
</tr>
|
||||
|
||||
<!-- Other misc stuff -->
|
||||
<tr>
|
||||
<td>Driver and other Workers</td>
|
||||
<td>Worker</td>
|
||||
<td>(random)</td>
|
||||
<td>
|
||||
<ul>
|
||||
<li>File server for file and jars</li>
|
||||
<li>Http Broadcast</li>
|
||||
<li>Class file server (Spark Shell only)</li>
|
||||
</ul>
|
||||
</td>
|
||||
<td>None</td>
|
||||
<td>Jetty-based. Each of these services starts on a random port that cannot be configured</td>
|
||||
</tr>
|
||||
|
||||
</table>
|
||||
|
||||
# High Availability
|
||||
|
||||
By default, standalone scheduling clusters are resilient to Worker failures (insofar as Spark itself is resilient to losing work by moving it to other workers). However, the scheduler uses a Master to make scheduling decisions, and this (by default) creates a single point of failure: if the Master crashes, no new applications can be created. In order to circumvent this, we have two high availability schemes, detailed below.
|
||||
|
|
Loading…
Reference in a new issue