Added cluster overview doc, made logo higher-resolution, and added more
details on monitoring
This commit is contained in:
parent
651a96adf7
commit
f261d2a60f
|
@ -51,7 +51,7 @@
|
|||
<div class="navbar-inner">
|
||||
<div class="container">
|
||||
<div class="brand"><a href="index.html">
|
||||
<img src="img/spark-logo-77x50px-hd.png" /></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span>
|
||||
<img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">{{site.SPARK_VERSION_SHORT}}</span>
|
||||
</div>
|
||||
<ul class="nav">
|
||||
<!--TODO(andyk): Add class="active" attribute to li some how.-->
|
||||
|
@ -103,6 +103,7 @@
|
|||
<li><a href="hadoop-third-party-distributions.html">Running with CDH/HDP</a></li>
|
||||
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
|
||||
<li><a href="job-scheduling.html">Job Scheduling</a></li>
|
||||
<li class="divider"></li>
|
||||
<li><a href="building-with-maven.html">Building Spark with Maven</a></li>
|
||||
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
|
||||
</ul>
|
||||
|
|
|
@ -3,3 +3,68 @@ layout: global
|
|||
title: Cluster Mode Overview
|
||||
---
|
||||
|
||||
This document gives a short overview of how Spark runs on clusters, to make it easier to understand
|
||||
the components involved.
|
||||
|
||||
# Components
|
||||
|
||||
Spark applications run as independent sets of processes on a cluster, coordinated by the SparkContext
|
||||
object in your main program (called the _driver program_).
|
||||
Specifically, to run on a cluster, the SparkContext can connect to several types of _cluster managers_
|
||||
(either Spark's own standalone cluster manager or Mesos/YARN), which allocate resources across
|
||||
applications. Once connected, Spark acquires *executors* on nodes in the cluster, which are
|
||||
worker processes that run computations and store data for your application.
|
||||
Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to
|
||||
the executors. Finally, SparkContext sends *tasks* for the executors to run.
|
||||
|
||||
<p style="text-align: center;">
|
||||
<img src="img/cluster-overview.png" title="Spark cluster components" alt="Spark cluster components" />
|
||||
</p>
|
||||
|
||||
There are several useful things to note about this architecture:
|
||||
|
||||
1. Each application gets its own executor processes, which stay up for the duration of the whole
|
||||
application and run tasks in multiple threads. This has the benefit of isolating applications
|
||||
from each other, on both the scheduling side (each driver schedules its own tasks) and executor
|
||||
side (tasks from different applications run in different JVMs). However, it also means that
|
||||
data cannot be shared across different Spark applications (instances of SparkContext) without
|
||||
writing it to an external storage system.
|
||||
2. Spark is agnostic to the underlying cluster manager. As long as it can acquire executor
|
||||
processes, and these communicate with each other, it is relatively easy to run it even on a
|
||||
cluster manager that also supports other applications (e.g. Mesos/YARN).
|
||||
3. Because the driver schedules tasks on the cluster, it should be run close to the worker
|
||||
nodes, preferably on the same local area network. If you'd like to send requests to the
|
||||
cluster remotely, it's better to open an RPC to the driver and have it submit operations
|
||||
from nearby than to run a driver far away from the worker nodes.
|
||||
|
||||
# Cluster Manager Types
|
||||
|
||||
The system currently supports three cluster managers:
|
||||
|
||||
* [Standalone](spark-standalone.html) -- a simple cluster manager included with Spark that makes it
|
||||
easy to set up a cluster.
|
||||
* [Apache Mesos](running-on-mesos.html) -- a general cluster manager that can also run Hadoop MapReduce
|
||||
and service applications.
|
||||
* [Hadoop YARN](running-on-yarn.html) -- the resource manager in Hadoop 2.0.
|
||||
|
||||
In addition, Spark's [EC2 launch scripts](ec2-scripts.html) make it easy to launch a standalone
|
||||
cluster on Amazon EC2.
|
||||
|
||||
# Shipping Code to the Cluster
|
||||
|
||||
The recommended way to ship your code to the cluster is to pass it through SparkContext's constructor,
|
||||
which takes a list of JAR files (Java/Scala) or .egg and .zip libraries (Python) to disseminate to
|
||||
worker nodes. You can also dynamically add new files to be sent to executors with `SparkContext.addJar`
|
||||
and `addFile`.
|
||||
|
||||
# Monitoring
|
||||
|
||||
Each driver program has a web UI, typically on port 3030, that displays information about running
|
||||
tasks, executors, and storage usage. Simply go to `http://<driver-node>:3030` in a web browser to
|
||||
access this UI. The [monitoring guide](monitoring.html) also describes other monitoring options.
|
||||
|
||||
# Job Scheduling
|
||||
|
||||
Spark gives control over resource allocation both _across_ applications (at the level of the cluster
|
||||
manager) and _within_ applications (if multiple computations are happening on the same SparkContext).
|
||||
The [job scheduling overview](job-scheduling.html) describes this in more detail.
|
||||
|
|
BIN
docs/img/cluster-overview.png
Normal file
BIN
docs/img/cluster-overview.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 28 KiB |
BIN
docs/img/cluster-overview.pptx
Normal file
BIN
docs/img/cluster-overview.pptx
Normal file
Binary file not shown.
BIN
docs/img/spark-logo-hd.png
Normal file
BIN
docs/img/spark-logo-hd.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 13 KiB |
|
@ -48,11 +48,6 @@ options for deployment:
|
|||
* [Apache Mesos](running-on-mesos.html)
|
||||
* [Hadoop YARN](running-on-yarn.html)
|
||||
|
||||
There is a script, `./make-distribution.sh`, which will create a binary distribution of Spark for deployment
|
||||
to any machine with only the Java runtime as a necessary dependency.
|
||||
Running the script creates a distribution directory in `dist/`, or the `-tgz` option to create a .tgz file.
|
||||
Check the script for additional options.
|
||||
|
||||
# A Note About Hadoop Versions
|
||||
|
||||
Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
|
||||
|
|
|
@ -3,19 +3,30 @@ layout: global
|
|||
title: Monitoring and Instrumentation
|
||||
---
|
||||
|
||||
There are several ways to monitor the progress of Spark jobs.
|
||||
There are several ways to monitor Spark applications.
|
||||
|
||||
# Web Interfaces
|
||||
When a SparkContext is initialized, it launches a web server (by default at port 3030) which
|
||||
displays useful information. This includes a list of active and completed scheduler stages,
|
||||
a summary of RDD blocks and partitions, and environmental information. If multiple SparkContexts
|
||||
are running on the same host, they will bind to succesive ports beginning with 3030 (3031, 3032,
|
||||
etc).
|
||||
|
||||
Spark's Standlone Mode scheduler also has its own
|
||||
[web interface](spark-standalone.html#monitoring-and-logging).
|
||||
Every SparkContext launches a web UI, by default on port 3030, that
|
||||
displays useful information about the application. This includes:
|
||||
|
||||
* A list of scheduler stages and tasks
|
||||
* A summary of RDD sizes and memory usage
|
||||
* Information about the running executors
|
||||
* Environmental information.
|
||||
|
||||
You can access this interface by simply opening `http://<driver-node>:3030` in a web browser.
|
||||
If multiple SparkContexts are running on the same host, they will bind to succesive ports
|
||||
beginning with 3030 (3031, 3032, etc).
|
||||
|
||||
Spark's Standlone Mode cluster manager also has its own
|
||||
[web UI](spark-standalone.html#monitoring-and-logging).
|
||||
|
||||
Note that in both of these UIs, the tables are sortable by clicking their headers,
|
||||
making it easy to identify slow tasks, data skew, etc.
|
||||
|
||||
# Metrics
|
||||
|
||||
# Spark Metrics
|
||||
Spark has a configurable metrics system based on the
|
||||
[Coda Hale Metrics Library](http://metrics.codahale.com/).
|
||||
This allows users to report Spark metrics to a variety of sinks including HTTP, JMX, and CSV
|
||||
|
@ -35,6 +46,7 @@ The syntax of the metrics configuration file is defined in an example configurat
|
|||
`$SPARK_HOME/conf/metrics.conf.template`.
|
||||
|
||||
# Advanced Instrumentation
|
||||
|
||||
Several external tools can be used to help profile the performance of Spark jobs:
|
||||
|
||||
* Cluster-wide monitoring tools, such as [Ganglia](http://ganglia.sourceforge.net/), can provide
|
||||
|
|
Loading…
Reference in a new issue