f3a3fed76c
## What changes were proposed in this pull request? According to the notice of the following Wiki front page, we can remove the obsolete wiki pointer safely in `README.md` and `docs/index.md`, too. These two lines are the last occurrence of that links. ``` All current wiki content has been merged into pages at http://spark.apache.org as of November 2016. Each page links to the new location of its information on the Spark web site. Obsolete wiki content is still hosted here, but carries a notice that it is no longer current. ``` ## How was this patch tested? Manual. - `README.md`: https://github.com/dongjoon-hyun/spark/tree/remove_wiki_from_readme - `docs/index.md`: ``` cd docs SKIP_API=1 jekyll build ``` ![screen shot 2016-12-09 at 2 53 29 pm](https://cloud.githubusercontent.com/assets/9700541/21067323/517252e2-be1f-11e6-85b1-2a4471131c5d.png) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16239 from dongjoon-hyun/remove_wiki_from_readme.
104 lines
3.7 KiB
Markdown
104 lines
3.7 KiB
Markdown
# Apache Spark
|
||
|
||
Spark is a fast and general cluster computing system for Big Data. It provides
|
||
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
|
||
supports general computation graphs for data analysis. It also supports a
|
||
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
|
||
MLlib for machine learning, GraphX for graph processing,
|
||
and Spark Streaming for stream processing.
|
||
|
||
<http://spark.apache.org/>
|
||
|
||
|
||
## Online Documentation
|
||
|
||
You can find the latest Spark documentation, including a programming
|
||
guide, on the [project web page](http://spark.apache.org/documentation.html).
|
||
This README file only contains basic setup instructions.
|
||
|
||
## Building Spark
|
||
|
||
Spark is built using [Apache Maven](http://maven.apache.org/).
|
||
To build Spark and its example programs, run:
|
||
|
||
build/mvn -DskipTests clean package
|
||
|
||
(You do not need to do this if you downloaded a pre-built package.)
|
||
|
||
You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).
|
||
More detailed documentation is available from the project site, at
|
||
["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
|
||
|
||
For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](http://spark.apache.org/developer-tools.html).
|
||
|
||
## Interactive Scala Shell
|
||
|
||
The easiest way to start using Spark is through the Scala shell:
|
||
|
||
./bin/spark-shell
|
||
|
||
Try the following command, which should return 1000:
|
||
|
||
scala> sc.parallelize(1 to 1000).count()
|
||
|
||
## Interactive Python Shell
|
||
|
||
Alternatively, if you prefer Python, you can use the Python shell:
|
||
|
||
./bin/pyspark
|
||
|
||
And run the following command, which should also return 1000:
|
||
|
||
>>> sc.parallelize(range(1000)).count()
|
||
|
||
## Example Programs
|
||
|
||
Spark also comes with several sample programs in the `examples` directory.
|
||
To run one of them, use `./bin/run-example <class> [params]`. For example:
|
||
|
||
./bin/run-example SparkPi
|
||
|
||
will run the Pi example locally.
|
||
|
||
You can set the MASTER environment variable when running examples to submit
|
||
examples to a cluster. This can be a mesos:// or spark:// URL,
|
||
"yarn" to run on YARN, and "local" to run
|
||
locally with one thread, or "local[N]" to run locally with N threads. You
|
||
can also use an abbreviated class name if the class is in the `examples`
|
||
package. For instance:
|
||
|
||
MASTER=spark://host:7077 ./bin/run-example SparkPi
|
||
|
||
Many of the example programs print usage help if no params are given.
|
||
|
||
## Running Tests
|
||
|
||
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
|
||
can be run using:
|
||
|
||
./dev/run-tests
|
||
|
||
Please see the guidance on how to
|
||
[run tests for a module, or individual tests](http://spark.apache.org/developer-tools.html#individual-tests).
|
||
|
||
## A Note About Hadoop Versions
|
||
|
||
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
|
||
storage systems. Because the protocols have changed in different versions of
|
||
Hadoop, you must build Spark against the same version that your cluster runs.
|
||
|
||
Please refer to the build documentation at
|
||
["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
|
||
for detailed guidance on building for a particular distribution of Hadoop, including
|
||
building for particular Hive and Hive Thriftserver distributions.
|
||
|
||
## Configuration
|
||
|
||
Please refer to the [Configuration Guide](http://spark.apache.org/docs/latest/configuration.html)
|
||
in the online documentation for an overview on how to configure Spark.
|
||
|
||
## Contributing
|
||
|
||
Please review the [Contribution to Spark guide](http://spark.apache.org/contributing.html)
|
||
for information on how to get started contributing to the project.
|