2013-08-31 21:08:05 -04:00
|
|
|
# Apache Spark
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-02-26 19:52:26 -05:00
|
|
|
Lightning-Fast Cluster Computing - <http://spark.apache.org/>
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Online Documentation
|
|
|
|
|
|
|
|
You can find the latest Spark documentation, including a programming
|
2014-02-26 19:52:26 -05:00
|
|
|
guide, on the project webpage at <http://spark.apache.org/documentation.html>.
|
2012-09-05 16:24:09 -04:00
|
|
|
This README file only contains basic setup instructions.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
## Building Spark
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
Spark is built on Scala 2.10. To build Spark and its example programs, run:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-01-05 00:45:22 -05:00
|
|
|
./sbt/sbt assembly
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-05-19 18:02:35 -04:00
|
|
|
(You do not need to do this if you downloaded a pre-built package.)
|
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
## Interactive Scala Shell
|
|
|
|
|
|
|
|
The easiest way to start using Spark is through the Scala shell:
|
2013-03-17 17:47:44 -04:00
|
|
|
|
2014-01-02 08:07:40 -05:00
|
|
|
./bin/spark-shell
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
Try the following command, which should return 1000:
|
|
|
|
|
|
|
|
scala> sc.parallelize(1 to 1000).count()
|
|
|
|
|
|
|
|
## Interactive Python Shell
|
|
|
|
|
|
|
|
Alternatively, if you prefer Python, you can use the Python shell:
|
|
|
|
|
|
|
|
./bin/pyspark
|
|
|
|
|
|
|
|
And run the following command, which should also return 1000:
|
|
|
|
|
|
|
|
>>> sc.parallelize(range(1000)).count()
|
|
|
|
|
|
|
|
## Example Programs
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Spark also comes with several sample programs in the `examples` directory.
|
2014-05-09 01:26:17 -04:00
|
|
|
To run one of them, use `./bin/run-example <class> [params]`. For example:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-05-19 18:02:35 -04:00
|
|
|
./bin/run-example SparkPi
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-05-19 18:02:35 -04:00
|
|
|
will run the Pi example locally.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-05-09 01:26:17 -04:00
|
|
|
You can set the MASTER environment variable when running examples to submit
|
|
|
|
examples to a cluster. This can be a mesos:// or spark:// URL,
|
|
|
|
"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run
|
|
|
|
locally with one thread, or "local[N]" to run locally with N threads. You
|
|
|
|
can also use an abbreviated class name if the class is in the `examples`
|
|
|
|
package. For instance:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-05-09 01:26:17 -04:00
|
|
|
MASTER=spark://host:7077 ./bin/run-example SparkPi
|
|
|
|
|
|
|
|
Many of the example programs print usage help if no params are given.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
## Running Tests
|
2014-01-02 03:39:37 -05:00
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
|
2014-01-03 20:32:25 -05:00
|
|
|
can be run using:
|
2014-01-02 03:39:37 -05:00
|
|
|
|
2014-04-19 01:34:39 -04:00
|
|
|
./sbt/sbt test
|
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
## A Note About Hadoop Versions
|
2012-03-17 16:49:55 -04:00
|
|
|
|
|
|
|
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
|
2013-08-21 17:51:56 -04:00
|
|
|
storage systems. Because the protocols have changed in different versions of
|
2012-03-17 16:49:55 -04:00
|
|
|
Hadoop, you must build Spark against the same version that your cluster runs.
|
2013-08-21 17:51:56 -04:00
|
|
|
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
|
|
|
|
when building Spark.
|
|
|
|
|
2013-08-21 20:12:03 -04:00
|
|
|
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
|
2013-08-21 17:51:56 -04:00
|
|
|
versions without YARN, use:
|
|
|
|
|
|
|
|
# Apache Hadoop 1.2.1
|
2014-01-06 01:05:30 -05:00
|
|
|
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v1
|
2014-01-06 01:05:30 -05:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-12-15 23:30:21 -05:00
|
|
|
For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
|
2013-08-31 21:08:05 -04:00
|
|
|
with YARN, also set `SPARK_YARN=true`:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Apache Hadoop 2.0.5-alpha
|
2014-01-06 01:05:30 -05:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v2
|
2014-01-06 01:05:30 -05:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
2013-12-06 20:41:27 -05:00
|
|
|
# Apache Hadoop 2.2.X and newer
|
2014-01-06 01:05:30 -05:00
|
|
|
$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
When developing a Spark application, specify the Hadoop version by adding the
|
|
|
|
"hadoop-client" artifact to your project's dependencies. For example, if you're
|
2013-11-02 15:58:44 -04:00
|
|
|
using Hadoop 1.2.1 and build your application using SBT, add this entry to
|
2013-08-21 17:51:56 -04:00
|
|
|
`libraryDependencies`:
|
|
|
|
|
2013-08-22 00:15:00 -04:00
|
|
|
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
If your project is built with Maven, add this to your POM file's `<dependencies>` section:
|
|
|
|
|
|
|
|
<dependency>
|
|
|
|
<groupId>org.apache.hadoop</groupId>
|
|
|
|
<artifactId>hadoop-client</artifactId>
|
2013-08-31 21:08:05 -04:00
|
|
|
<version>1.2.1</version>
|
2013-08-21 17:51:56 -04:00
|
|
|
</dependency>
|
2012-03-17 16:49:55 -04:00
|
|
|
|
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
## Configuration
|
|
|
|
|
2014-02-26 19:52:26 -05:00
|
|
|
Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)
|
2013-08-31 21:08:05 -04:00
|
|
|
in the online documentation for an overview on how to configure Spark.
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
## Contributing to Spark
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
Contributions via GitHub pull requests are gladly accepted from their original
|
|
|
|
author. Along with any pull requests, please state that the contribution is
|
|
|
|
your original work and that you license the work to the project under the
|
|
|
|
project's open source license. Whether or not you state this explicitly, by
|
|
|
|
submitting any copyrighted material via pull request, email, or other means
|
|
|
|
you agree to license the material under the project's open source license and
|
|
|
|
warrant that you have the legal authority to do so.
|
2013-07-16 02:45:57 -04:00
|
|
|
|