a94a4fcf90
### What changes were proposed in this pull request? Jenkins link in README.md is currently broken: ![Screen Shot 2020-01-21 at 3 11 10 PM](https://user-images.githubusercontent.com/6477701/72779777-678c5b00-3c60-11ea-8523-9d82abc0493e.png) Seems new jobs are configured to test Hive 1.2 and 2.3 profiles. The link pointed out `spark-master-test-maven-hadoop-2.7` before. Now it become two. ``` spark-master-test-maven-hadoop-2.7 -> spark-master-test-maven-hadoop-2.7-hive-2.3 spark-master-test-maven-hadoop-2.7-hive-1.2 ``` Since the PR builder uses Hive 2.3 by default, this PR fixes the link to point out `spark-master-test-maven-hadoop-2.7-hive-2.3` ### Why are the changes needed? To fix the image and broken link. ### Does this PR introduce any user-facing change? No. Dev only change. ### How was this patch tested? Manually clicking. Closes #27301 from HyukjinKwon/minor-link. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
109 lines
4.4 KiB
Markdown
109 lines
4.4 KiB
Markdown
# Apache Spark
|
|
|
|
Spark is a unified analytics engine for large-scale data processing. It provides
|
|
high-level APIs in Scala, Java, Python, and R, and an optimized engine that
|
|
supports general computation graphs for data analysis. It also supports a
|
|
rich set of higher-level tools including Spark SQL for SQL and DataFrames,
|
|
MLlib for machine learning, GraphX for graph processing,
|
|
and Structured Streaming for stream processing.
|
|
|
|
<https://spark.apache.org/>
|
|
|
|
[![Jenkins Build](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3/badge/icon)](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7-hive-2.3)
|
|
[![AppVeyor Build](https://img.shields.io/appveyor/ci/ApacheSoftwareFoundation/spark/master.svg?style=plastic&logo=appveyor)](https://ci.appveyor.com/project/ApacheSoftwareFoundation/spark)
|
|
[![PySpark Coverage](https://img.shields.io/badge/dynamic/xml.svg?label=pyspark%20coverage&url=https%3A%2F%2Fspark-test.github.io%2Fpyspark-coverage-site&query=%2Fhtml%2Fbody%2Fdiv%5B1%5D%2Fdiv%2Fh1%2Fspan&colorB=brightgreen&style=plastic)](https://spark-test.github.io/pyspark-coverage-site)
|
|
|
|
|
|
## Online Documentation
|
|
|
|
You can find the latest Spark documentation, including a programming
|
|
guide, on the [project web page](https://spark.apache.org/documentation.html).
|
|
This README file only contains basic setup instructions.
|
|
|
|
## Building Spark
|
|
|
|
Spark is built using [Apache Maven](https://maven.apache.org/).
|
|
To build Spark and its example programs, run:
|
|
|
|
./build/mvn -DskipTests clean package
|
|
|
|
(You do not need to do this if you downloaded a pre-built package.)
|
|
|
|
More detailed documentation is available from the project site, at
|
|
["Building Spark"](https://spark.apache.org/docs/latest/building-spark.html).
|
|
|
|
For general development tips, including info on developing Spark using an IDE, see ["Useful Developer Tools"](https://spark.apache.org/developer-tools.html).
|
|
|
|
## Interactive Scala Shell
|
|
|
|
The easiest way to start using Spark is through the Scala shell:
|
|
|
|
./bin/spark-shell
|
|
|
|
Try the following command, which should return 1,000,000,000:
|
|
|
|
scala> spark.range(1000 * 1000 * 1000).count()
|
|
|
|
## Interactive Python Shell
|
|
|
|
Alternatively, if you prefer Python, you can use the Python shell:
|
|
|
|
./bin/pyspark
|
|
|
|
And run the following command, which should also return 1,000,000,000:
|
|
|
|
>>> spark.range(1000 * 1000 * 1000).count()
|
|
|
|
## Example Programs
|
|
|
|
Spark also comes with several sample programs in the `examples` directory.
|
|
To run one of them, use `./bin/run-example <class> [params]`. For example:
|
|
|
|
./bin/run-example SparkPi
|
|
|
|
will run the Pi example locally.
|
|
|
|
You can set the MASTER environment variable when running examples to submit
|
|
examples to a cluster. This can be a mesos:// or spark:// URL,
|
|
"yarn" to run on YARN, and "local" to run
|
|
locally with one thread, or "local[N]" to run locally with N threads. You
|
|
can also use an abbreviated class name if the class is in the `examples`
|
|
package. For instance:
|
|
|
|
MASTER=spark://host:7077 ./bin/run-example SparkPi
|
|
|
|
Many of the example programs print usage help if no params are given.
|
|
|
|
## Running Tests
|
|
|
|
Testing first requires [building Spark](#building-spark). Once Spark is built, tests
|
|
can be run using:
|
|
|
|
./dev/run-tests
|
|
|
|
Please see the guidance on how to
|
|
[run tests for a module, or individual tests](https://spark.apache.org/developer-tools.html#individual-tests).
|
|
|
|
There is also a Kubernetes integration test, see resource-managers/kubernetes/integration-tests/README.md
|
|
|
|
## A Note About Hadoop Versions
|
|
|
|
Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
|
|
storage systems. Because the protocols have changed in different versions of
|
|
Hadoop, you must build Spark against the same version that your cluster runs.
|
|
|
|
Please refer to the build documentation at
|
|
["Specifying the Hadoop Version and Enabling YARN"](https://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version-and-enabling-yarn)
|
|
for detailed guidance on building for a particular distribution of Hadoop, including
|
|
building for particular Hive and Hive Thriftserver distributions.
|
|
|
|
## Configuration
|
|
|
|
Please refer to the [Configuration Guide](https://spark.apache.org/docs/latest/configuration.html)
|
|
in the online documentation for an overview on how to configure Spark.
|
|
|
|
## Contributing
|
|
|
|
Please review the [Contribution to Spark guide](https://spark.apache.org/contributing.html)
|
|
for information on how to get started contributing to the project.
|