2012-09-03 02:05:40 -04:00
---
layout: global
2012-09-12 22:47:31 -04:00
title: Spark Overview
2012-09-03 02:05:40 -04:00
---
2013-09-01 01:17:40 -04:00
Apache Spark is a fast and general-purpose cluster computing system.
It provides high-level APIs in [Scala ](scala-programming-guide.html ), [Java ](java-programming-guide.html ), and [Python ](python-programming-guide.html ) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
2014-01-10 03:39:08 -05:00
It also supports a rich set of higher-level tools including [Shark ](http://shark.cs.berkeley.edu ) (Hive on Spark), [MLlib ](mllib-guide.html ) for machine learning, [GraphX ](graphx-programming-guide.html ) for graph processing, and [Spark Streaming ](streaming-programming-guide.html ).
2012-09-03 02:05:40 -04:00
# Downloading
2014-02-28 00:13:22 -05:00
Get Spark by visiting the [downloads page ](http://spark.apache.org/downloads.html ) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}.
2012-09-03 02:05:40 -04:00
2013-09-02 01:12:03 -04:00
Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). All you need to run it is to have `java` to installed on your system `PATH` , or the `JAVA_HOME` environment variable pointing to a Java installation.
2012-09-03 02:05:40 -04:00
# Building
2013-04-07 20:31:19 -04:00
Spark uses [Simple Build Tool ](http://www.scala-sbt.org ), which is bundled with it. To compile the code, go into the top-level Spark directory and run
2012-09-03 02:05:40 -04:00
2014-01-06 01:05:30 -05:00
sbt/sbt assembly
2012-09-03 02:05:40 -04:00
2014-02-19 18:54:03 -05:00
For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_BINARY_VERSION}}. If you write applications in Scala, you will need to use a compatible Scala version (e.g. {{site.SCALA_BINARY_VERSION}}.X) -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org ](http://www.scala-lang.org/download/ ).
2013-03-17 17:47:44 -04:00
2013-09-06 00:29:37 -04:00
# Running the Examples and Shell
2012-09-03 02:05:40 -04:00
2013-08-30 18:04:43 -04:00
Spark comes with several sample programs in the `examples` directory.
2014-01-02 08:11:21 -05:00
To run one of the samples, use `./bin/run-example <class> <params>` in the top-level Spark directory
(the `bin/run-example` script sets up the appropriate paths and launches that program).
For example, try `./bin/run-example org.apache.spark.examples.SparkPi local` .
2013-09-01 01:17:40 -04:00
Each example prints usage help when run with no parameters.
2012-09-03 02:05:40 -04:00
2012-09-25 22:31:07 -04:00
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
2012-10-08 13:13:26 -04:00
to connect to. This can be a [URL for a distributed cluster ](scala-programming-guide.html#master-urls ),
2012-09-25 22:31:07 -04:00
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing.
2012-09-03 02:05:40 -04:00
2014-01-02 08:07:40 -05:00
Finally, you can run Spark interactively through modified versions of the Scala shell (`./bin/spark-shell`) or
2014-01-02 08:20:12 -05:00
Python interpreter (`./bin/pyspark`). These are a great way to learn the framework.
2012-09-03 02:05:40 -04:00
2013-09-06 00:29:37 -04:00
# Launching on a Cluster
2013-09-02 16:35:28 -04:00
2013-09-06 00:29:37 -04:00
The Spark [cluster mode overview ](cluster-overview.html ) explains the key concepts in running on a cluster.
Spark can run both by itself, or over several existing cluster managers. It currently provides several
options for deployment:
2013-09-02 16:35:28 -04:00
2013-09-06 00:29:37 -04:00
* [Amazon EC2 ](ec2-scripts.html ): our EC2 scripts let you launch a cluster in about 5 minutes
2013-09-02 16:35:28 -04:00
* [Standalone Deploy Mode ](spark-standalone.html ): simplest way to deploy Spark on a private cluster
* [Apache Mesos ](running-on-mesos.html )
* [Hadoop YARN ](running-on-yarn.html )
2012-09-25 18:46:18 -04:00
# A Note About Hadoop Versions
2012-09-03 02:05:40 -04:00
2013-08-30 15:38:23 -04:00
Spark uses the Hadoop-client library to talk to HDFS and other Hadoop-supported
2012-09-03 02:05:40 -04:00
storage systems. Because the HDFS protocol has changed in different versions of
2013-08-30 15:38:23 -04:00
Hadoop, you must build Spark against the same version that your cluster uses.
2013-08-31 20:40:33 -04:00
By default, Spark links to Hadoop 1.0.4. You can change this by setting the
`SPARK_HADOOP_VERSION` variable when compiling:
2013-08-30 15:38:23 -04:00
2014-01-06 01:05:30 -05:00
SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly
2013-08-30 15:38:23 -04:00
2013-12-08 01:20:14 -05:00
In addition, if you wish to run Spark on [YARN ](running-on-yarn.html ), set
2013-08-31 20:40:33 -04:00
`SPARK_YARN` to `true` :
2013-08-30 15:38:23 -04:00
2014-01-06 01:05:30 -05:00
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
2012-09-03 02:05:40 -04:00
2013-12-06 19:54:06 -05:00
Note that on Windows, you need to set the environment variables on separate lines, e.g., `set SPARK_HADOOP_VERSION=1.2.1` .
2013-12-06 20:41:27 -05:00
For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to build Spark and publish it locally. See [Launching Spark on YARN ](running-on-yarn.html ). This is needed because Hadoop 2.2 has non backwards compatible API changes.
2013-09-02 01:12:03 -04:00
2012-09-03 02:05:40 -04:00
# Where to Go from Here
2012-09-25 22:31:07 -04:00
**Programming guides:**
2012-09-26 02:59:04 -04:00
2012-10-09 17:30:23 -04:00
* [Quick Start ](quick-start.html ): a quick introduction to the Spark API; start here!
* [Spark Programming Guide ](scala-programming-guide.html ): an overview of Spark concepts, and details on the Scala API
2013-08-30 15:38:23 -04:00
* [Java Programming Guide ](java-programming-guide.html ): using Spark from Java
* [Python Programming Guide ](python-programming-guide.html ): using Spark from Python
2014-01-29 00:51:05 -05:00
* [Spark Streaming ](streaming-programming-guide.html ): Spark's API for processing data streams
2013-08-31 17:21:10 -04:00
* [MLlib (Machine Learning) ](mllib-guide.html ): Spark's built-in machine learning library
2014-01-11 02:48:32 -05:00
* [Bagel (Pregel on Spark) ](bagel-programming-guide.html ): simple graph processing model
2014-01-10 14:37:10 -05:00
* [GraphX (Graphs on Spark) ](graphx-programming-guide.html ): Spark's new API for graphs
2012-09-25 18:46:18 -04:00
2013-01-01 16:52:14 -05:00
**API Docs:**
2013-08-30 15:38:23 -04:00
* [Spark for Java/Scala (Scaladoc) ](api/core/index.html )
* [Spark for Python (Epydoc) ](api/pyspark/index.html )
* [Spark Streaming for Java/Scala (Scaladoc) ](api/streaming/index.html )
* [MLlib (Machine Learning) for Java/Scala (Scaladoc) ](api/mllib/index.html )
2014-01-11 02:48:32 -05:00
* [Bagel (Pregel on Spark) for Scala (Scaladoc) ](api/bagel/index.html )
2014-01-10 03:39:08 -05:00
* [GraphX (Graphs on Spark) for Scala (Scaladoc) ](api/graphx/index.html )
2013-08-30 15:38:23 -04:00
2012-09-25 18:46:18 -04:00
2012-09-25 22:31:07 -04:00
**Deployment guides:**
2012-09-26 02:59:04 -04:00
2013-09-06 00:29:37 -04:00
* [Cluster Overview ](cluster-overview.html ): overview of concepts and components when running on a cluster
* [Amazon EC2 ](ec2-scripts.html ): scripts that let you launch a cluster on EC2 in about 5 minutes
2012-10-09 17:30:23 -04:00
* [Standalone Deploy Mode ](spark-standalone.html ): launch a standalone cluster quickly without a third-party cluster manager
2013-09-06 00:29:37 -04:00
* [Mesos ](running-on-mesos.html ): deploy a private cluster using
2014-02-28 00:13:22 -05:00
[Apache Mesos ](http://mesos.apache.org )
2013-09-06 00:29:37 -04:00
* [YARN ](running-on-yarn.html ): deploy Spark on top of Hadoop NextGen (YARN)
2012-09-25 18:46:18 -04:00
2012-09-25 22:31:07 -04:00
**Other documents:**
2012-09-26 02:59:04 -04:00
2012-10-08 13:13:26 -04:00
* [Configuration ](configuration.html ): customize Spark via its configuration system
* [Tuning Guide ](tuning.html ): best practices to optimize performance and memory use
SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use servlets
resubmit pull request. was https://github.com/apache/incubator-spark/pull/332.
Author: Thomas Graves <tgraves@apache.org>
Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits:
dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser
05eebed [Thomas Graves] Fix dependency lost in upmerge
d1040ec [Thomas Graves] Fix up various imports
05ff5e0 [Thomas Graves] Fix up imports after upmerging to master
ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase
13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests.
4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets
2f77147 [Thomas Graves] Rework from comments
50dd9f2 [Thomas Graves] fix header in SecurityManager
ecbfb65 [Thomas Graves] Fix spacing and formatting
b514bec [Thomas Graves] Fix reference to config
ed3d1c1 [Thomas Graves] Add security.md
6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments
2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework
5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils
f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
2014-03-06 19:27:50 -05:00
* [Security ](security.html ): Spark security support
2013-08-30 15:38:23 -04:00
* [Hardware Provisioning ](hardware-provisioning.html ): recommendations for cluster hardware
2013-09-06 00:29:37 -04:00
* [Job Scheduling ](job-scheduling.html ): scheduling resources across and within Spark applications
* [Building Spark with Maven ](building-with-maven.html ): build Spark using the Maven system
* [Contributing to Spark ](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark )
2012-09-03 02:05:40 -04:00
2012-09-25 22:31:07 -04:00
**External resources:**
2012-09-26 02:59:04 -04:00
2014-02-28 00:13:22 -05:00
* [Spark Homepage ](http://spark.apache.org )
2013-09-02 16:35:28 -04:00
* [Shark ](http://shark.cs.berkeley.edu ): Apache Hive over Spark
2014-02-28 00:13:22 -05:00
* [Mailing Lists ](http://spark.apache.org/mailing-lists.html ): ask questions about Spark here
2013-08-30 15:38:23 -04:00
* [AMP Camps ](http://ampcamp.berkeley.edu/ ): a series of training camps at UC Berkeley that featured talks and
exercises about Spark, Shark, Mesos, and more. [Videos ](http://ampcamp.berkeley.edu/agenda-2012 ),
2012-09-25 22:31:07 -04:00
[slides ](http://ampcamp.berkeley.edu/agenda-2012 ) and [exercises ](http://ampcamp.berkeley.edu/exercises-2012 ) are
available online for free.
2014-02-28 00:13:22 -05:00
* [Code Examples ](http://spark.apache.org/examples.html ): more are also available in the [examples subfolder ](https://github.com/apache/spark/tree/master/examples/src/main/scala/ ) of Spark
2013-02-27 02:20:49 -05:00
* [Paper Describing Spark ](http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf )
* [Paper Describing Spark Streaming ](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf )
2012-09-03 02:05:40 -04:00
# Community
2014-02-28 00:13:22 -05:00
To get help using Spark or keep up with Spark development, sign up for the [user mailing list ](http://spark.apache.org/mailing-lists.html ).
2012-09-03 02:05:40 -04:00
If you're in the San Francisco Bay Area, there's a regular [Spark meetup ](http://www.meetup.com/spark-users/ ) every few weeks. Come by to meet the developers and other users.
2012-10-08 13:13:26 -04:00
Finally, if you'd like to contribute code to Spark, read [how to contribute ](contributing-to-spark.html ).