2013-09-06 15:15:49 -04:00
|
|
|
---
|
|
|
|
layout: global
|
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
2014-05-30 03:34:33 -04:00
|
|
|
title: Third-Party Hadoop Distributions
|
2013-09-06 15:15:49 -04:00
|
|
|
---
|
|
|
|
|
2013-09-08 03:44:41 -04:00
|
|
|
Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and
|
|
|
|
the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark
|
|
|
|
with these distributions:
|
2013-09-06 15:15:49 -04:00
|
|
|
|
|
|
|
# Compile-time Hadoop Version
|
2013-09-08 03:44:41 -04:00
|
|
|
|
2014-05-12 22:44:14 -04:00
|
|
|
When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
|
|
|
|
property. For certain versions, you will need to specify additional profiles. For more detail,
|
2014-09-16 12:18:03 -04:00
|
|
|
see the guide on [building with maven](building-spark.html#specifying-the-hadoop-version):
|
2013-09-07 14:51:40 -04:00
|
|
|
|
2014-05-12 22:44:14 -04:00
|
|
|
mvn -Dhadoop.version=1.0.4 -DskipTests clean package
|
|
|
|
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
|
2013-09-07 14:51:40 -04:00
|
|
|
|
2014-05-12 22:44:14 -04:00
|
|
|
The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
|
2013-09-06 15:15:49 -04:00
|
|
|
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
|
|
|
|
distribution may "just work" without you needing to compile. That said, we recommend compiling with
|
|
|
|
the _exact_ Hadoop version you are running to avoid any compatibility errors.
|
|
|
|
|
|
|
|
<table>
|
|
|
|
<tr valign="top">
|
|
|
|
<td>
|
|
|
|
<h3>CDH Releases</h3>
|
2013-09-08 03:44:41 -04:00
|
|
|
<table class="table" style="width:350px; margin-right: 20px;">
|
|
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
2013-11-14 05:33:48 -05:00
|
|
|
<tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-cdh4.X.X</td></tr>
|
|
|
|
<tr><td>CDH 4.X.X</td><td>2.0.0-mr1-cdh4.X.X</td></tr>
|
2013-09-06 15:15:49 -04:00
|
|
|
<tr><td>CDH 3u6</td><td>0.20.2-cdh3u6</td></tr>
|
|
|
|
<tr><td>CDH 3u5</td><td>0.20.2-cdh3u5</td></tr>
|
|
|
|
<tr><td>CDH 3u4</td><td>0.20.2-cdh3u4</td></tr>
|
|
|
|
</table>
|
|
|
|
</td>
|
|
|
|
<td>
|
|
|
|
<h3>HDP Releases</h3>
|
|
|
|
<table class="table" style="width:350px;">
|
2013-09-08 03:44:41 -04:00
|
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
2013-09-06 15:15:49 -04:00
|
|
|
<tr><td>HDP 1.3</td><td>1.2.0</td></tr>
|
|
|
|
<tr><td>HDP 1.2</td><td>1.1.2</td></tr>
|
|
|
|
<tr><td>HDP 1.1</td><td>1.0.3</td></tr>
|
2013-12-08 01:32:41 -05:00
|
|
|
<tr><td>HDP 1.0</td><td>1.0.3</td></tr>
|
2013-12-08 01:26:49 -05:00
|
|
|
<tr><td>HDP 2.0</td><td>2.2.0</td></tr>
|
2013-09-06 15:15:49 -04:00
|
|
|
</table>
|
|
|
|
</td>
|
|
|
|
</tr>
|
|
|
|
</table>
|
|
|
|
|
2014-07-15 13:46:17 -04:00
|
|
|
In SBT, the equivalent can be achieved by setting the the `hadoop.version` property:
|
2014-05-12 22:44:14 -04:00
|
|
|
|
2014-07-15 13:46:17 -04:00
|
|
|
sbt/sbt -Dhadoop.version=1.0.4 assembly
|
2014-05-12 22:44:14 -04:00
|
|
|
|
2013-09-08 03:44:41 -04:00
|
|
|
# Linking Applications to the Hadoop Version
|
|
|
|
|
|
|
|
In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
|
|
|
|
version of `hadoop-client` to any Spark applications you run, so they can also talk to the HDFS version
|
|
|
|
on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository.
|
|
|
|
This looks as follows in SBT:
|
|
|
|
|
|
|
|
{% highlight scala %}
|
|
|
|
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>"
|
|
|
|
|
|
|
|
// If using CDH, also add Cloudera repo
|
|
|
|
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
|
|
|
|
{% endhighlight %}
|
|
|
|
|
|
|
|
Or in Maven:
|
|
|
|
|
|
|
|
{% highlight xml %}
|
|
|
|
<project>
|
|
|
|
<dependencies>
|
|
|
|
...
|
|
|
|
<dependency>
|
|
|
|
<groupId>org.apache.hadoop</groupId>
|
|
|
|
<artifactId>hadoop-client</artifactId>
|
|
|
|
<version>[version]</version>
|
|
|
|
</dependency>
|
|
|
|
</dependencies>
|
|
|
|
|
|
|
|
<!-- If using CDH, also add Cloudera repo -->
|
|
|
|
<repositories>
|
|
|
|
...
|
|
|
|
<repository>
|
|
|
|
<id>Cloudera repository</id>
|
|
|
|
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
|
|
|
|
</repository>
|
|
|
|
</repositories>
|
|
|
|
</project>
|
|
|
|
|
|
|
|
{% endhighlight %}
|
|
|
|
|
2013-09-06 15:15:49 -04:00
|
|
|
# Where to Run Spark
|
2013-09-08 03:44:41 -04:00
|
|
|
|
2013-09-06 15:15:49 -04:00
|
|
|
As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide,
|
|
|
|
Spark can run in a variety of deployment modes:
|
|
|
|
|
|
|
|
* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your
|
|
|
|
Hadoop installation.
|
|
|
|
* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and
|
|
|
|
cores dedicated to Spark on each node.
|
|
|
|
* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.
|
|
|
|
|
2013-09-07 17:37:54 -04:00
|
|
|
These options are identical for those using CDH and HDP.
|
2013-09-06 15:15:49 -04:00
|
|
|
|
|
|
|
# Inheriting Cluster Configuration
|
2013-09-08 03:44:41 -04:00
|
|
|
|
2013-09-07 14:51:40 -04:00
|
|
|
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
|
|
|
|
should be included on Spark's classpath:
|
|
|
|
|
|
|
|
* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
|
|
|
|
* `core-site.xml`, which sets the default filesystem name.
|
|
|
|
|
|
|
|
The location of these configuration files varies across CDH and HDP versions, but
|
2013-09-06 15:15:49 -04:00
|
|
|
a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
|
|
|
|
configurations on-the-fly, but offer a mechanisms to download copies of them.
|
|
|
|
|
2014-04-22 22:22:06 -04:00
|
|
|
To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh`
|
|
|
|
to a location containing the configuration files.
|