spark-instrumented-optimizer/README.md

# Apache Spark

Lightning-Fast Cluster Computing - <http://spark.apache.org/>


## Online Documentation

You can find the latest Spark documentation, including a programming
guide, on the project webpage at <http://spark.apache.org/documentation.html>.
This README file only contains basic setup instructions.

## Building Spark

Spark is built on Scala 2.10. To build Spark and its example programs, run:

    ./sbt/sbt assembly

(You do not need to do this if you downloaded a pre-built package.)

## Interactive Scala Shell

The easiest way to start using Spark is through the Scala shell:

    ./bin/spark-shell

Try the following command, which should return 1000:

    scala> sc.parallelize(1 to 1000).count()

## Interactive Python Shell

Alternatively, if you prefer Python, you can use the Python shell:

    ./bin/pyspark
    
And run the following command, which should also return 1000:

    >>> sc.parallelize(range(1000)).count()

## Example Programs

Spark also comes with several sample programs in the `examples` directory.
To run one of them, use `./bin/run-example <class> [params]`. For example:

    ./bin/run-example SparkPi

will run the Pi example locally.

You can set the MASTER environment variable when running examples to submit
examples to a cluster. This can be a mesos:// or spark:// URL, 
"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run 
locally with one thread, or "local[N]" to run locally with N threads. You 
can also use an abbreviated class name if the class is in the `examples`
package. For instance:

    MASTER=spark://host:7077 ./bin/run-example SparkPi

Many of the example programs print usage help if no params are given.

## Running Tests

Testing first requires [building Spark](#building-spark). Once Spark is built, tests
can be run using:

    ./sbt/sbt test

## A Note About Hadoop Versions

Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
storage systems. Because the protocols have changed in different versions of
Hadoop, you must build Spark against the same version that your cluster runs.
You can change the version by setting the `SPARK_HADOOP_VERSION` environment
when building Spark.

For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
versions without YARN, use:

    # Apache Hadoop 1.2.1
    $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly

    # Cloudera CDH 4.2.0 with MapReduce v1
    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly

For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
with YARN, also set `SPARK_YARN=true`:

    # Apache Hadoop 2.0.5-alpha
    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly

    # Cloudera CDH 4.2.0 with MapReduce v2
    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

    # Apache Hadoop 2.2.X and newer
    $ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

When developing a Spark application, specify the Hadoop version by adding the
"hadoop-client" artifact to your project's dependencies. For example, if you're
using Hadoop 1.2.1 and build your application using SBT, add this entry to
`libraryDependencies`:

    "org.apache.hadoop" % "hadoop-client" % "1.2.1"

If your project is built with Maven, add this to your POM file's `<dependencies>` section:

    <dependency>
      <groupId>org.apache.hadoop</groupId>
      <artifactId>hadoop-client</artifactId>
      <version>1.2.1</version>
    </dependency>


## Configuration

Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)
in the online documentation for an overview on how to configure Spark.


## Contributing to Spark

Contributions via GitHub pull requests are gladly accepted from their original
author. Along with any pull requests, please state that the contribution is
your original work and that you license the work to the project under the
project's open source license. Whether or not you state this explicitly, by
submitting any copyrighted material via pull request, email, or other means
you agree to license the material under the project's open source license and
warrant that you have the legal authority to do so.
Small fixes to README 2013-08-31 21:08:05 -04:00			`# Apache Spark`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
Removed reference to incubation in README.md. Author: Reynold Xin <rxin@apache.org> Closes #1 from rxin/readme and squashes the following commits: b3a77cd [Reynold Xin] Removed reference to incubation in README.md. 2014-02-26 19:52:26 -05:00			`Lightning-Fast Cluster Computing - <http://spark.apache.org/>`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00

			`## Online Documentation`

			`You can find the latest Spark documentation, including a programming`
Removed reference to incubation in README.md. Author: Reynold Xin <rxin@apache.org> Closes #1 from rxin/readme and squashes the following commits: b3a77cd [Reynold Xin] Removed reference to incubation in README.md. 2014-02-26 19:52:26 -05:00			`guide, on the project webpage at <http://spark.apache.org/documentation.html>.`
Updated base README to point to documentation site instead of wiki, updated docs/README.md to describe use of Jekyll, and renmaed things to make them more consistent with the lower-case-with-hyphens convention. 2012-09-05 16:24:09 -04:00			`This README file only contains basic setup instructions.`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`## Building Spark`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`Spark is built on Scala 2.10. To build Spark and its example programs, run:`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
And update docs to match 2014-01-05 00:45:22 -05:00			`./sbt/sbt assembly`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
[SPARK-1876] Windows fixes to deal with latest distribution layout changes - Look for JARs in the right place - Launch examples the same way as on Unix - Load datanucleus JARs if they exist - Don't attempt to parse local paths as URIs in SparkSubmit, since paths with C:\ are not valid URIs - Also fixed POM exclusion rules for datanucleus (it wasn't properly excluding it, whereas SBT was) Author: Matei Zaharia <matei@databricks.com> Closes #819 from mateiz/win-fixes and squashes the following commits: d558f96 [Matei Zaharia] Fix comment 228577b [Matei Zaharia] Review comments d3b71c7 [Matei Zaharia] Properly exclude datanucleus files in Maven assembly 144af84 [Matei Zaharia] Update Windows scripts to match latest binary package layout 2014-05-19 18:02:35 -04:00			`(You do not need to do this if you downloaded a pre-built package.)`

README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`## Interactive Scala Shell`

			`The easiest way to start using Spark is through the Scala shell:`
Adds page to docs about building using Maven. Adds links to new instructions in: * The main Spark project README.md * The docs nav menu called "More" * The docs Overview page under the "Building" and "Where to Go from Here" sections 2013-03-17 17:47:44 -04:00
spark-shell -> bin/spark-shell 2014-01-02 08:07:40 -05:00			`./bin/spark-shell`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`Try the following command, which should return 1000:`

			`scala> sc.parallelize(1 to 1000).count()`

			`## Interactive Python Shell`

			`Alternatively, if you prefer Python, you can use the Python shell:`

			`./bin/pyspark`

			`And run the following command, which should also return 1000:`

			`>>> sc.parallelize(range(1000)).count()`

			`## Example Programs`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
Small fixes to README 2013-08-31 21:08:05 -04:00			Spark also comes with several sample programs in the `examples` directory.
SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. Gives a nicely formatted message to the user when `run-example` is run to tell them to use `spark-submit`. Author: Patrick Wendell <pwendell@gmail.com> Closes #704 from pwendell/examples and squashes the following commits: 1996ee8 [Patrick Wendell] Feedback form Andrew 3eb7803 [Patrick Wendell] Suggestions from TD 2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. 2014-05-09 01:26:17 -04:00			To run one of them, use `./bin/run-example <class> [params]`. For example:
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
[SPARK-1876] Windows fixes to deal with latest distribution layout changes - Look for JARs in the right place - Launch examples the same way as on Unix - Load datanucleus JARs if they exist - Don't attempt to parse local paths as URIs in SparkSubmit, since paths with C:\ are not valid URIs - Also fixed POM exclusion rules for datanucleus (it wasn't properly excluding it, whereas SBT was) Author: Matei Zaharia <matei@databricks.com> Closes #819 from mateiz/win-fixes and squashes the following commits: d558f96 [Matei Zaharia] Fix comment 228577b [Matei Zaharia] Review comments d3b71c7 [Matei Zaharia] Properly exclude datanucleus files in Maven assembly 144af84 [Matei Zaharia] Update Windows scripts to match latest binary package layout 2014-05-19 18:02:35 -04:00			`./bin/run-example SparkPi`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
[SPARK-1876] Windows fixes to deal with latest distribution layout changes - Look for JARs in the right place - Launch examples the same way as on Unix - Load datanucleus JARs if they exist - Don't attempt to parse local paths as URIs in SparkSubmit, since paths with C:\ are not valid URIs - Also fixed POM exclusion rules for datanucleus (it wasn't properly excluding it, whereas SBT was) Author: Matei Zaharia <matei@databricks.com> Closes #819 from mateiz/win-fixes and squashes the following commits: d558f96 [Matei Zaharia] Fix comment 228577b [Matei Zaharia] Review comments d3b71c7 [Matei Zaharia] Properly exclude datanucleus files in Maven assembly 144af84 [Matei Zaharia] Update Windows scripts to match latest binary package layout 2014-05-19 18:02:35 -04:00			`will run the Pi example locally.`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. Gives a nicely formatted message to the user when `run-example` is run to tell them to use `spark-submit`. Author: Patrick Wendell <pwendell@gmail.com> Closes #704 from pwendell/examples and squashes the following commits: 1996ee8 [Patrick Wendell] Feedback form Andrew 3eb7803 [Patrick Wendell] Suggestions from TD 2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. 2014-05-09 01:26:17 -04:00			`You can set the MASTER environment variable when running examples to submit`
			`examples to a cluster. This can be a mesos:// or spark:// URL,`
			`"yarn-cluster" or "yarn-client" to run on YARN, and "local" to run`
			`locally with one thread, or "local[N]" to run locally with N threads. You`
			can also use an abbreviated class name if the class is in the `examples`
			`package. For instance:`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. Gives a nicely formatted message to the user when `run-example` is run to tell them to use `spark-submit`. Author: Patrick Wendell <pwendell@gmail.com> Closes #704 from pwendell/examples and squashes the following commits: 1996ee8 [Patrick Wendell] Feedback form Andrew 3eb7803 [Patrick Wendell] Suggestions from TD 2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. 2014-05-09 01:26:17 -04:00			`MASTER=spark://host:7077 ./bin/run-example SparkPi`

			`Many of the example programs print usage help if no params are given.`
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`## Running Tests`
Removed sbt folder and changed docs accordingly 2014-01-02 03:39:37 -05:00
README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`Testing first requires [building Spark](#building-spark). Once Spark is built, tests`
Changes on top of Prashant's patch. Closes #316 2014-01-03 20:32:25 -05:00			`can be run using:`
Removed sbt folder and changed docs accordingly 2014-01-02 03:39:37 -05:00
README update Author: Reynold Xin <rxin@apache.org> Closes #443 from rxin/readme and squashes the following commits: 16853de [Reynold Xin] Updated SBT and Scala instructions. 3ac3ceb [Reynold Xin] README update 2014-04-19 01:34:39 -04:00			`./sbt/sbt test`

Update README 2012-10-14 15:00:25 -04:00			`## A Note About Hadoop Versions`
Documentation 2012-03-17 16:49:55 -04:00
			`Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported`
Update build docs 2013-08-21 17:51:56 -04:00			`storage systems. Because the protocols have changed in different versions of`
Documentation 2012-03-17 16:49:55 -04:00			`Hadoop, you must build Spark against the same version that your cluster runs.`
Update build docs 2013-08-21 17:51:56 -04:00			You can change the version by setting the `SPARK_HADOOP_VERSION` environment
			`when building Spark.`

Remove references to unsupported Hadoop versions 2013-08-21 20:12:03 -04:00			`For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop`
Update build docs 2013-08-21 17:51:56 -04:00			`versions without YARN, use:`

			`# Apache Hadoop 1.2.1`
Code review feedback 2014-01-06 01:05:30 -05:00			`$ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly`
Update build docs 2013-08-21 17:51:56 -04:00
			`# Cloudera CDH 4.2.0 with MapReduce v1`
Code review feedback 2014-01-06 01:05:30 -05:00			`$ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly`
Update build docs 2013-08-21 17:51:56 -04:00
Attempt with extra repositories 2013-12-15 23:30:21 -05:00			`For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions`
Small fixes to README 2013-08-31 21:08:05 -04:00			with YARN, also set `SPARK_YARN=true`:
Update build docs 2013-08-21 17:51:56 -04:00
			`# Apache Hadoop 2.0.5-alpha`
Code review feedback 2014-01-06 01:05:30 -05:00			`$ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly`
Update build docs 2013-08-21 17:51:56 -04:00
			`# Cloudera CDH 4.2.0 with MapReduce v2`
Code review feedback 2014-01-06 01:05:30 -05:00			`$ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly`
Update build docs 2013-08-21 17:51:56 -04:00
Minor doc fixes and updating README 2013-12-06 20:41:27 -05:00			`# Apache Hadoop 2.2.X and newer`
Code review feedback 2014-01-06 01:05:30 -05:00			`$ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly`
Update build docs 2013-08-21 17:51:56 -04:00
			`When developing a Spark application, specify the Hadoop version by adding the`
			`"hadoop-client" artifact to your project's dependencies. For example, if you're`
Fixed a typo in Hadoop version in README. 2013-11-02 15:58:44 -04:00			`using Hadoop 1.2.1 and build your application using SBT, add this entry to`
Update build docs 2013-08-21 17:51:56 -04:00			`libraryDependencies`:

Use Hadoop 1.2.1 in application example 2013-08-22 00:15:00 -04:00			`"org.apache.hadoop" % "hadoop-client" % "1.2.1"`
Update build docs 2013-08-21 17:51:56 -04:00
			If your project is built with Maven, add this to your POM file's `<dependencies>` section:

			`<dependency>`
			`<groupId>org.apache.hadoop</groupId>`
			`<artifactId>hadoop-client</artifactId>`
Small fixes to README 2013-08-31 21:08:05 -04:00			`<version>1.2.1</version>`
Update build docs 2013-08-21 17:51:56 -04:00			`</dependency>`
Documentation 2012-03-17 16:49:55 -04:00

Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00			`## Configuration`

Removed reference to incubation in README.md. Author: Reynold Xin <rxin@apache.org> Closes #1 from rxin/readme and squashes the following commits: b3a77cd [Reynold Xin] Removed reference to incubation in README.md. 2014-02-26 19:52:26 -05:00			`Please refer to the [Configuration guide](http://spark.apache.org/docs/latest/configuration.html)`
Small fixes to README 2013-08-31 21:08:05 -04:00			`in the online documentation for an overview on how to configure Spark.`
format 2011-06-22 20:27:14 -04:00
Markdown rendering for the toplevel README.md to improve readability on github 2011-06-22 20:24:04 -04:00
Update README 2012-10-14 15:00:25 -04:00			`## Contributing to Spark`
format 2011-06-22 20:27:14 -04:00
Update README 2012-10-14 15:00:25 -04:00			`Contributions via GitHub pull requests are gladly accepted from their original`
			`author. Along with any pull requests, please state that the contribution is`
			`your original work and that you license the work to the project under the`
			`project's open source license. Whether or not you state this explicitly, by`
			`submitting any copyrighted material via pull request, email, or other means`
			`you agree to license the material under the project's open source license and`
			`warrant that you have the legal authority to do so.`
Test commit karma for Spark git. 2013-07-16 02:45:57 -04:00