2013-08-31 21:08:05 -04:00
|
|
|
# Apache Spark
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Lightning-Fast Cluster Computing - <http://spark.incubator.apache.org/>
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Online Documentation
|
|
|
|
|
|
|
|
You can find the latest Spark documentation, including a programming
|
2013-08-31 21:08:05 -04:00
|
|
|
guide, on the project webpage at <http://spark.incubator.apache.org/documentation.html>.
|
2012-09-05 16:24:09 -04:00
|
|
|
This README file only contains basic setup instructions.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
|
|
|
## Building
|
|
|
|
|
2013-07-11 01:16:06 -04:00
|
|
|
Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The project is
|
2013-03-26 17:28:08 -04:00
|
|
|
built using Simple Build Tool (SBT), which is packaged with it. To build
|
|
|
|
Spark and its example programs, run:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-27 22:39:54 -04:00
|
|
|
sbt/sbt assembly
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Once you've built Spark, the easiest way to start using it is the shell:
|
2013-03-17 17:47:44 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
./spark-shell
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Or, for the Python API, the Python shell (`./pyspark`).
|
|
|
|
|
|
|
|
Spark also comes with several sample programs in the `examples` directory.
|
|
|
|
To run one of them, use `./run-example <class> <params>`. For example:
|
2011-06-22 20:24:04 -04:00
|
|
|
|
2013-08-31 22:27:07 -04:00
|
|
|
./run-example org.apache.spark.examples.SparkLR local[2]
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
will run the Logistic Regression example locally on 2 CPUs.
|
|
|
|
|
|
|
|
Each of the example programs prints usage help if no params are given.
|
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
All of the Spark samples take a `<master>` parameter that is the cluster URL
|
2012-10-14 15:04:58 -04:00
|
|
|
to connect to. This can be a mesos:// or spark:// URL, or "local" to run
|
2012-10-14 15:00:25 -04:00
|
|
|
locally with one thread, or "local[N]" to run locally with N threads.
|
2011-06-22 20:24:04 -04:00
|
|
|
|
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
## A Note About Hadoop Versions
|
2012-03-17 16:49:55 -04:00
|
|
|
|
|
|
|
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
|
2013-08-21 17:51:56 -04:00
|
|
|
storage systems. Because the protocols have changed in different versions of
|
2012-03-17 16:49:55 -04:00
|
|
|
Hadoop, you must build Spark against the same version that your cluster runs.
|
2013-08-21 17:51:56 -04:00
|
|
|
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
|
|
|
|
when building Spark.
|
|
|
|
|
2013-08-21 20:12:03 -04:00
|
|
|
For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
|
2013-08-21 17:51:56 -04:00
|
|
|
versions without YARN, use:
|
|
|
|
|
|
|
|
# Apache Hadoop 1.2.1
|
2013-08-27 22:39:54 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v1
|
2013-08-27 22:39:54 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
|
2013-08-31 21:08:05 -04:00
|
|
|
with YARN, also set `SPARK_YARN=true`:
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Apache Hadoop 2.0.5-alpha
|
2013-08-31 21:08:05 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
# Cloudera CDH 4.2.0 with MapReduce v2
|
2013-08-31 21:08:05 -04:00
|
|
|
$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
For convenience, these variables may also be set through the `conf/spark-env.sh` file
|
|
|
|
described below.
|
|
|
|
|
|
|
|
When developing a Spark application, specify the Hadoop version by adding the
|
|
|
|
"hadoop-client" artifact to your project's dependencies. For example, if you're
|
2013-08-22 00:15:00 -04:00
|
|
|
using Hadoop 1.0.1 and build your application using SBT, add this entry to
|
2013-08-21 17:51:56 -04:00
|
|
|
`libraryDependencies`:
|
|
|
|
|
2013-08-22 00:15:00 -04:00
|
|
|
"org.apache.hadoop" % "hadoop-client" % "1.2.1"
|
2013-08-21 17:51:56 -04:00
|
|
|
|
|
|
|
If your project is built with Maven, add this to your POM file's `<dependencies>` section:
|
|
|
|
|
|
|
|
<dependency>
|
|
|
|
<groupId>org.apache.hadoop</groupId>
|
|
|
|
<artifactId>hadoop-client</artifactId>
|
2013-08-31 21:08:05 -04:00
|
|
|
<version>1.2.1</version>
|
2013-08-21 17:51:56 -04:00
|
|
|
</dependency>
|
2012-03-17 16:49:55 -04:00
|
|
|
|
|
|
|
|
2011-06-22 20:24:04 -04:00
|
|
|
## Configuration
|
|
|
|
|
2013-08-31 21:08:05 -04:00
|
|
|
Please refer to the [Configuration guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
|
|
|
|
in the online documentation for an overview on how to configure Spark.
|
2011-06-22 20:27:14 -04:00
|
|
|
|
|
|
|
|
2013-09-02 17:34:09 -04:00
|
|
|
## Apache Incubator Notice
|
|
|
|
|
|
|
|
Apache Spark is an effort undergoing incubation at The Apache Software
|
|
|
|
Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of
|
|
|
|
all newly accepted projects until a further review indicates that the
|
|
|
|
infrastructure, communications, and decision making process have stabilized in
|
|
|
|
a manner consistent with other successful ASF projects. While incubation status
|
|
|
|
is not necessarily a reflection of the completeness or stability of the code,
|
|
|
|
it does indicate that the project has yet to be fully endorsed by the ASF.
|
|
|
|
|
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
## Contributing to Spark
|
2011-06-22 20:27:14 -04:00
|
|
|
|
2012-10-14 15:00:25 -04:00
|
|
|
Contributions via GitHub pull requests are gladly accepted from their original
|
|
|
|
author. Along with any pull requests, please state that the contribution is
|
|
|
|
your original work and that you license the work to the project under the
|
|
|
|
project's open source license. Whether or not you state this explicitly, by
|
|
|
|
submitting any copyrighted material via pull request, email, or other means
|
|
|
|
you agree to license the material under the project's open source license and
|
|
|
|
warrant that you have the legal authority to do so.
|