2013-09-06 15:15:49 -04:00
|
|
|
---
|
|
|
|
layout: global
|
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
|
|
|
title: Third-Party Hadoop Distributions
|
2013-09-06 15:15:49 -04:00
|
|
|
---
|
|
|
|
|
2013-09-08 03:44:41 -04:00
|
|
|
Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and
|
|
|
|
the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark
|
|
|
|
with these distributions:
|
2013-09-06 15:15:49 -04:00
|
|
|
|
|
|
|
# Compile-time Hadoop Version
|
2013-09-08 03:44:41 -04:00
|
|
|
|
2014-05-12 22:44:14 -04:00
|
|
|
When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
|
|
|
|
property. For certain versions, you will need to specify additional profiles. For more detail,
|
2014-09-16 12:18:03 -04:00
|
|
|
see the guide on [building with maven](building-spark.html#specifying-the-hadoop-version):
|
2013-09-07 14:51:40 -04:00
|
|
|
|
2014-05-12 22:44:14 -04:00
|
|
|
mvn -Dhadoop.version=1.0.4 -DskipTests clean package
|
2015-05-14 10:22:58 -04:00
|
|
|
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
|
2013-09-07 14:51:40 -04:00
|
|
|
|
2014-05-12 22:44:14 -04:00
|
|
|
The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
|
2013-09-06 15:15:49 -04:00
|
|
|
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
|
[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
Author: Brennon York <brennon.york@capitalone.com>
Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
2014-12-27 16:25:18 -05:00
|
|
|
distribution may "just work" without you needing to compile. That said, we recommend compiling with
|
2013-09-06 15:15:49 -04:00
|
|
|
the _exact_ Hadoop version you are running to avoid any compatibility errors.
|
|
|
|
|
|
|
|
<table>
|
|
|
|
<tr valign="top">
|
|
|
|
<td>
|
|
|
|
<h3>CDH Releases</h3>
|
2013-09-08 03:44:41 -04:00
|
|
|
<table class="table" style="width:350px; margin-right: 20px;">
|
|
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
2013-11-14 05:33:48 -05:00
|
|
|
<tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-cdh4.X.X</td></tr>
|
|
|
|
<tr><td>CDH 4.X.X</td><td>2.0.0-mr1-cdh4.X.X</td></tr>
|
2013-09-06 15:15:49 -04:00
|
|
|
</table>
|
|
|
|
</td>
|
|
|
|
<td>
|
|
|
|
<h3>HDP Releases</h3>
|
|
|
|
<table class="table" style="width:350px;">
|
2013-09-08 03:44:41 -04:00
|
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
2013-09-06 15:15:49 -04:00
|
|
|
<tr><td>HDP 1.3</td><td>1.2.0</td></tr>
|
|
|
|
<tr><td>HDP 1.2</td><td>1.1.2</td></tr>
|
|
|
|
<tr><td>HDP 1.1</td><td>1.0.3</td></tr>
|
2013-12-08 01:32:41 -05:00
|
|
|
<tr><td>HDP 1.0</td><td>1.0.3</td></tr>
|
2013-12-08 01:26:49 -05:00
|
|
|
<tr><td>HDP 2.0</td><td>2.2.0</td></tr>
|
2013-09-06 15:15:49 -04:00
|
|
|
</table>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
|
2014-07-15 13:46:17 -04:00
|
|
|
In SBT, the equivalent can be achieved by setting the the `hadoop.version` property:
|
2014-05-12 22:44:14 -04:00
|
|
|
|
[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
Author: Brennon York <brennon.york@capitalone.com>
Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
2014-12-27 16:25:18 -05:00
|
|
|
build/sbt -Dhadoop.version=1.0.4 assembly
|
2014-05-12 22:44:14 -04:00
|
|
|
|
2013-09-08 03:44:41 -04:00
|
|
|
# Linking Applications to the Hadoop Version
|
|
|
|
|
|
|
|
In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
|
|
|
|
version of `hadoop-client` to any Spark applications you run, so they can also talk to the HDFS version
|
|
|
|
on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository.
|
|
|
|
This looks as follows in SBT:
|
|
|
|
|
|
|
|
{% highlight scala %}
|
|
|
|
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>"
|
|
|
|
|
|
|
|
// If using CDH, also add Cloudera repo
|
|
|
|
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
Or in Maven:
|
|
|
|
|
|
|
|
{% highlight xml %}
|
|
|
|
<project>
|
|
|
|
<dependencies>
|
|
|
|
...
|
|
|
|
<dependency>
|
|
|
|
<groupId>org.apache.hadoop</groupId>
|
|
|
|
<artifactId>hadoop-client</artifactId>
|
|
|
|
<version>[version]</version>
|
|
|
|
</dependency>
|
|
|
|
</dependencies>
|
|
|
|
|
|
|
|
<!-- If using CDH, also add Cloudera repo -->
|
|
|
|
<repositories>
|
|
|
|
...
|
|
|
|
<repository>
|
|
|
|
<id>Cloudera repository</id>
|
|
|
|
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
|
|
|
|
</repository>
|
|
|
|
</repositories>
|
|
|
|
</project>
|
|
|
|
|
|
|
|
{% endhighlight %}
|
|
|
|
|
2013-09-06 15:15:49 -04:00
|
|
|
# Where to Run Spark
|
2013-09-08 03:44:41 -04:00
|
|
|
|
2013-09-06 15:15:49 -04:00
|
|
|
As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide,
|
|
|
|
Spark can run in a variety of deployment modes:
|
|
|
|
|
|
|
|
* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your
|
|
|
|
Hadoop installation.
|
[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
Author: Brennon York <brennon.york@capitalone.com>
Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
2014-12-27 16:25:18 -05:00
|
|
|
* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and
|
2013-09-06 15:15:49 -04:00
|
|
|
cores dedicated to Spark on each node.
|
|
|
|
* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.
|
|
|
|
|
[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
Author: Brennon York <brennon.york@capitalone.com>
Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
2014-12-27 16:25:18 -05:00
|
|
|
These options are identical for those using CDH and HDP.
|
2013-09-06 15:15:49 -04:00
|
|
|
|
|
|
|
# Inheriting Cluster Configuration
|
2013-09-08 03:44:41 -04:00
|
|
|
|
2013-09-07 14:51:40 -04:00
|
|
|
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
|
|
|
|
should be included on Spark's classpath:
|
|
|
|
|
|
|
|
* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
|
|
|
|
* `core-site.xml`, which sets the default filesystem name.
|
|
|
|
|
|
|
|
The location of these configuration files varies across CDH and HDP versions, but
|
2013-09-06 15:15:49 -04:00
|
|
|
a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
|
|
|
|
configurations on-the-fly, but offer a mechanisms to download copies of them.
|
|
|
|
|
[SPARK-4501][Core] - Create build/mvn to automatically download maven/zinc/scalac
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would.
Author: Brennon York <brennon.york@capitalone.com>
Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits:
0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this)
9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn
d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn
b979c58 [Brennon York] updated the merge conflict
c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt
b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version
be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set
28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead
7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed
14a5da0 [Brennon York] removed unnecessary zinc output on startup
1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable
3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed
a680d12 [Brennon York] Added comments to functions and tested various mvn calls
bb8cc9d [Brennon York] removed package files
ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli
07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version
f914dea [Brennon York] Beginning final portions of localized scala home
69c4e44 [Brennon York] working linux and osx installers for purely local mvn build
4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder
cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
2014-12-27 16:25:18 -05:00
|
|
|
To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh`
|
2014-04-22 22:22:06 -04:00
|
|
|
to a location containing the configuration files.
|