67ed0aa0fd
Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons.
Changes proposed by vanzin resulting from previous pull-request https://github.com/apache/spark/pull/5783 that did not fixed the problem correctly.
Please let me know if this is the correct way of doing this, the comments of vanzin are in the pull-request mentioned.
Author: FavioVazquez <favio.vazquezp@gmail.com>
Closes #5786 from FavioVazquez/update-hadoop-dependencies and squashes the following commits:
11670e5 [FavioVazquez] - Added missing instance of -Phadoop-2.2 in create-release.sh
379f50d [FavioVazquez] - Added instances of -Phadoop-2.2 in create-release.sh, run-tests, scalastyle and building-spark.md - Reconstructed docs to not ask users to rely on default behavior
3f9249d [FavioVazquez] Merge branch 'master' of https://github.com/apache/spark into update-hadoop-dependencies
31bdafa [FavioVazquez] - Added missing instances in -Phadoop-1 in create-release.sh, run-tests and in the building-spark documentation
cbb93e8 [FavioVazquez] - Added comment related to SPARK-3710 about hadoop-yarn-server-tests in Hadoop 2.2 that fails to pull some needed dependencies
83dc332 [FavioVazquez] - Cleaned up the main POM concerning the yarn profile - Erased hadoop-2.2 profile from yarn/pom.xml and its content was integrated into yarn/pom.xml
93f7624 [FavioVazquez] - Deleted unnecessary comments and <activation> tag on the YARN profile in the main POM
668d126 [FavioVazquez] - Moved <dependencies> <activation> and <properties> sections of the hadoop-2.2 profile in the YARN POM to the YARN profile in the root POM - Erased unnecessary hadoop-2.2 profile from the YARN POM
fda6a51 [FavioVazquez] - Updated hadoop1 releases in create-release.sh due to changes in the default hadoop version set - Erased unnecessary instance of -Dyarn.version=2.2.0 in create-release.sh - Prettify comment in yarn/pom.xml
0470587 [FavioVazquez] - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in create-release.sh - Updated how the releases are made in the create-release.sh no that the default hadoop version is the 2.2.0 - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in scalastyle - Erased unnecessary instance of -Phadoop-2.2 -Dhadoop.version=2.2.0 in run-tests - Better example given in the hadoop-third-party-distributions.md now that the default hadoop version is 2.2.0
a650779 [FavioVazquez] - Default value of avro.mapred.classifier has been set to hadoop2 in pom.xml - Cleaned up hadoop-2.3 and 2.4 profiles due to change in the default set in avro.mapred.classifier in pom.xml
199f40b [FavioVazquez] - Erased unnecessary CDH5-specific note in docs/building-spark.md - Remove example of instance -Phadoop-2.2 -Dhadoop.version=2.2.0 in docs/building-spark.md - Enabled hadoop-2.2 profile when the Hadoop version is 2.2.0, which is now the default .Added comment in the yarn/pom.xml to specify that.
88a8b88 [FavioVazquez] - Simplified Hadoop profiles due to new setting of global properties in the pom.xml file - Added comment to specify that the hadoop-2.2 profile is now the default hadoop profile in the pom.xml file - Erased hadoop-2.2 from related hadoop profiles now that is a no-op in the make-distribution.sh file
70b8344 [FavioVazquez] - Fixed typo in the make-distribution.sh file and added hadoop-1 in the Related profiles
287fa2f [FavioVazquez] - Updated documentation about specifying the hadoop version in building-spark. Now is clear that Spark will build against Hadoop 2.2.0 by default. - Added Cloudera CDH 5.3.3 without MapReduce example in the building-spark doc.
1354292 [FavioVazquez] - Fixed hadoop-1 version to match jenkins build profile in hadoop1.0 tests and documentation
6b4bfaf [FavioVazquez] - Cleanup in hadoop-2.x profiles since they contained mostly redundant stuff.
7e9955d [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
660decc [FavioVazquez] - Updated Hadoop dependencies due to inconsistency in the versions. Now the global properties are the ones used by the hadoop-2.2 profile, and the profile was set to empty but kept for backwards compatibility reasons
ec91ce3 [FavioVazquez] - Updated protobuf-java version of com.google.protobuf dependancy to fix blocking error when connecting to HDFS via the Hadoop Cloudera HDFS CDH5 (fix for 2.5.0-cdh5.3.3 version)
(cherry picked from commit 7fb715de6d
)
Signed-off-by: Sean Owen <sowen@cloudera.com>
118 lines
4.3 KiB
Markdown
118 lines
4.3 KiB
Markdown
---
|
|
layout: global
|
|
title: Third-Party Hadoop Distributions
|
|
---
|
|
|
|
Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and
|
|
the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark
|
|
with these distributions:
|
|
|
|
# Compile-time Hadoop Version
|
|
|
|
When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
|
|
property. For certain versions, you will need to specify additional profiles. For more detail,
|
|
see the guide on [building with maven](building-spark.html#specifying-the-hadoop-version):
|
|
|
|
mvn -Dhadoop.version=1.0.4 -DskipTests clean package
|
|
mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
|
|
|
|
The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
|
|
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
|
|
distribution may "just work" without you needing to compile. That said, we recommend compiling with
|
|
the _exact_ Hadoop version you are running to avoid any compatibility errors.
|
|
|
|
<table>
|
|
<tr valign="top">
|
|
<td>
|
|
<h3>CDH Releases</h3>
|
|
<table class="table" style="width:350px; margin-right: 20px;">
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
|
<tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-cdh4.X.X</td></tr>
|
|
<tr><td>CDH 4.X.X</td><td>2.0.0-mr1-cdh4.X.X</td></tr>
|
|
</table>
|
|
</td>
|
|
<td>
|
|
<h3>HDP Releases</h3>
|
|
<table class="table" style="width:350px;">
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
|
<tr><td>HDP 1.3</td><td>1.2.0</td></tr>
|
|
<tr><td>HDP 1.2</td><td>1.1.2</td></tr>
|
|
<tr><td>HDP 1.1</td><td>1.0.3</td></tr>
|
|
<tr><td>HDP 1.0</td><td>1.0.3</td></tr>
|
|
<tr><td>HDP 2.0</td><td>2.2.0</td></tr>
|
|
</table>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
In SBT, the equivalent can be achieved by setting the the `hadoop.version` property:
|
|
|
|
build/sbt -Dhadoop.version=1.0.4 assembly
|
|
|
|
# Linking Applications to the Hadoop Version
|
|
|
|
In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
|
|
version of `hadoop-client` to any Spark applications you run, so they can also talk to the HDFS version
|
|
on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository.
|
|
This looks as follows in SBT:
|
|
|
|
{% highlight scala %}
|
|
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>"
|
|
|
|
// If using CDH, also add Cloudera repo
|
|
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
|
|
{% endhighlight %}
|
|
|
|
Or in Maven:
|
|
|
|
{% highlight xml %}
|
|
<project>
|
|
<dependencies>
|
|
...
|
|
<dependency>
|
|
<groupId>org.apache.hadoop</groupId>
|
|
<artifactId>hadoop-client</artifactId>
|
|
<version>[version]</version>
|
|
</dependency>
|
|
</dependencies>
|
|
|
|
<!-- If using CDH, also add Cloudera repo -->
|
|
<repositories>
|
|
...
|
|
<repository>
|
|
<id>Cloudera repository</id>
|
|
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
|
|
</repository>
|
|
</repositories>
|
|
</project>
|
|
|
|
{% endhighlight %}
|
|
|
|
# Where to Run Spark
|
|
|
|
As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide,
|
|
Spark can run in a variety of deployment modes:
|
|
|
|
* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your
|
|
Hadoop installation.
|
|
* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and
|
|
cores dedicated to Spark on each node.
|
|
* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.
|
|
|
|
These options are identical for those using CDH and HDP.
|
|
|
|
# Inheriting Cluster Configuration
|
|
|
|
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
|
|
should be included on Spark's classpath:
|
|
|
|
* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
|
|
* `core-site.xml`, which sets the default filesystem name.
|
|
|
|
The location of these configuration files varies across CDH and HDP versions, but
|
|
a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
|
|
configurations on-the-fly, but offer a mechanisms to download copies of them.
|
|
|
|
To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh`
|
|
to a location containing the configuration files.
|