2013-08-31 21:08:05 -04:00
# Apache Spark
2011-06-22 20:24:04 -04:00
2019-05-10 04:55:23 -04:00
Spark is a unified analytics engine for large-scale data processing. It provides
2015-09-08 09:36:34 -04:00
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
2014-07-15 05:15:29 -04:00
supports general computation graphs for data analysis. It also supports a
2015-06-01 02:55:45 -04:00
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
MLlib for machine learning, GraphX for graph processing,
2019-05-10 04:55:23 -04:00
and Structured Streaming for stream processing.
2014-07-15 05:15:29 -04:00
< http: / / spark . apache . org / >
2011-06-22 20:24:04 -04:00
2019-05-10 04:55:23 -04:00
[![Jenkins Build ](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/badge/icon )](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7)
[![AppVeyor Build ](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor )](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
[![PySpark Coverage ](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic )](https://spark-test.github.io/pyspark-coverage-site)
2011-06-22 20:24:04 -04:00
## Online Documentation
You can find the latest Spark documentation, including a programming
2016-12-10 11:40:10 -05:00
guide, on the [project web page ](http://spark.apache.org/documentation.html ).
2012-09-05 16:24:09 -04:00
This README file only contains basic setup instructions.
2011-06-22 20:24:04 -04:00
2014-04-19 01:34:39 -04:00
## Building Spark
2011-06-22 20:24:04 -04:00
2014-09-16 12:18:03 -04:00
Spark is built using [Apache Maven ](http://maven.apache.org/ ).
To build Spark and its example programs, run:
2011-06-22 20:24:04 -04:00
2015-06-01 02:55:45 -04:00
build/mvn -DskipTests clean package
2011-06-22 20:24:04 -04:00
2014-05-19 18:02:35 -04:00
(You do not need to do this if you downloaded a pre-built package.)
2016-06-14 08:59:01 -04:00
You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3" ](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3 ).
2014-09-16 12:18:03 -04:00
More detailed documentation is available from the project site, at
2015-02-02 15:33:49 -05:00
["Building Spark" ](http://spark.apache.org/docs/latest/building-spark.html ).
2016-11-23 06:25:47 -05:00
2016-12-03 23:55:33 -05:00
For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools" ](http://spark.apache.org/developer-tools.html ).
2014-05-19 18:02:35 -04:00
2014-04-19 01:34:39 -04:00
## Interactive Scala Shell
The easiest way to start using Spark is through the Scala shell:
2013-03-17 17:47:44 -04:00
2014-01-02 08:07:40 -05:00
./bin/spark-shell
2011-06-22 20:24:04 -04:00
2019-05-10 04:55:23 -04:00
Try the following command, which should return 1,000,000,000:
2014-04-19 01:34:39 -04:00
2019-05-10 04:55:23 -04:00
scala> spark.range(1000 * 1000 * 1000).count()
2014-04-19 01:34:39 -04:00
## Interactive Python Shell
Alternatively, if you prefer Python, you can use the Python shell:
./bin/pyspark
2015-06-01 02:55:45 -04:00
2019-05-10 04:55:23 -04:00
And run the following command, which should also return 1,000,000,000:
2014-04-19 01:34:39 -04:00
2019-05-10 04:55:23 -04:00
>>> spark.range(1000 * 1000 * 1000).count()
2014-04-19 01:34:39 -04:00
## Example Programs
2011-06-22 20:24:04 -04:00
2013-08-31 21:08:05 -04:00
Spark also comes with several sample programs in the `examples` directory.
2014-05-09 01:26:17 -04:00
To run one of them, use `./bin/run-example <class> [params]` . For example:
2011-06-22 20:24:04 -04:00
2014-05-19 18:02:35 -04:00
./bin/run-example SparkPi
2011-06-22 20:24:04 -04:00
2014-05-19 18:02:35 -04:00
will run the Pi example locally.
2011-06-22 20:24:04 -04:00
2014-05-09 01:26:17 -04:00
You can set the MASTER environment variable when running examples to submit
2015-06-01 02:55:45 -04:00
examples to a cluster. This can be a mesos:// or spark:// URL,
2015-10-04 04:31:52 -04:00
"yarn" to run on YARN, and "local" to run
2015-06-01 02:55:45 -04:00
locally with one thread, or "local[N]" to run locally with N threads. You
2014-05-09 01:26:17 -04:00
can also use an abbreviated class name if the class is in the `examples`
package. For instance:
2011-06-22 20:24:04 -04:00
2014-05-09 01:26:17 -04:00
MASTER=spark://host:7077 ./bin/run-example SparkPi
Many of the example programs print usage help if no params are given.
2011-06-22 20:24:04 -04:00
2014-04-19 01:34:39 -04:00
## Running Tests
2014-01-02 03:39:37 -05:00
2014-04-19 01:34:39 -04:00
Testing first requires [building Spark ](#building-spark ). Once Spark is built, tests
2014-01-03 20:32:25 -05:00
can be run using:
2014-01-02 03:39:37 -05:00
2014-08-26 20:50:04 -04:00
./dev/run-tests
2014-04-19 01:34:39 -04:00
2015-06-01 02:55:45 -04:00
Please see the guidance on how to
2016-11-23 06:25:47 -05:00
[run tests for a module, or individual tests ](http://spark.apache.org/developer-tools.html#individual-tests ).
2014-09-16 12:18:03 -04:00
2018-06-08 18:15:24 -04:00
There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md
2012-10-14 15:00:25 -04:00
## A Note About Hadoop Versions
2012-03-17 16:49:55 -04:00
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
2013-08-21 17:51:56 -04:00
storage systems. Because the protocols have changed in different versions of
2012-03-17 16:49:55 -04:00
Hadoop, you must build Spark against the same version that your cluster runs.
2013-08-21 17:51:56 -04:00
2014-09-16 12:18:03 -04:00
Please refer to the build documentation at
2018-09-04 07:39:55 -04:00
["Specifying the Hadoop Version and Enabling YARN" ](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn )
2014-09-16 12:18:03 -04:00
for detailed guidance on building for a particular distribution of Hadoop, including
2015-11-01 07:25:49 -05:00
building for particular Hive and Hive Thriftserver distributions.
2014-08-13 17:42:57 -04:00
2011-06-22 20:24:04 -04:00
## Configuration
2015-09-08 09:36:34 -04:00
Please refer to the [Configuration Guide ](http://spark.apache.org/docs/latest/configuration.html )
2013-08-31 21:08:05 -04:00
in the online documentation for an overview on how to configure Spark.
2016-10-12 14:14:03 -04:00
2017-04-03 05:09:11 -04:00
## Contributing
2016-10-12 14:14:03 -04:00
2016-11-23 06:25:47 -05:00
Please review the [Contribution to Spark guide ](http://spark.apache.org/contributing.html )
for information on how to get started contributing to the project.