a3e51cc990
Creates a top level directory script (as `build/mvn`) to automatically download zinc and the specific version of scala used to easily build spark. This will also download and install maven if the user doesn't already have it and all packages are hosted under the `build/` directory. Tested on both Linux and OSX OS's and both work. All commands pass through to the maven binary so it acts exactly as a traditional maven call would. Author: Brennon York <brennon.york@capitalone.com> Closes #3707 from brennonyork/SPARK-4501 and squashes the following commits: 0e5a0e4 [Brennon York] minor incorrect doc verbage (with -> this) 9b79e38 [Brennon York] fixed merge conflicts with dev/run-tests, properly quoted args in sbt/sbt, fixed bug where relative paths would fail if passed in from build/mvn d2d41b6 [Brennon York] added blurb about leverging zinc with build/mvn b979c58 [Brennon York] updated the merge conflict c5634de [Brennon York] updated documentation to overview build/mvn, updated all points where sbt/sbt was referenced with build/sbt b8437ba [Brennon York] set progress bars for curl and wget when not run on jenkins, no progress bar when run on jenkins, moved sbt script to build/sbt, wrote stub and warning under sbt/sbt which calls build/sbt, modified build/sbt to use the correct directory, fixed bug in build/sbt-launch-lib.bash to correctly pull the sbt version be11317 [Brennon York] added switch to silence download progress only if AMPLAB_JENKINS is set 28d0a99 [Brennon York] updated to remove the python dependency, uses grep instead 7e785a6 [Brennon York] added silent and quiet flags to curl and wget respectively, added single echo output to denote start of a download if download is needed 14a5da0 [Brennon York] removed unnecessary zinc output on startup 1af4a94 [Brennon York] fixed bug with uppercase vs lowercase variable 3e8b9b3 [Brennon York] updated to properly only restart zinc if it was freshly installed a680d12 [Brennon York] Added comments to functions and tested various mvn calls bb8cc9d [Brennon York] removed package files ef017e6 [Brennon York] removed OS complexities, setup generic install_app call, removed extra file complexities, removed help, removed forced install (defaults now), removed double-dash from cli 07bf018 [Brennon York] Updated to specifically handle pulling down the correct scala version f914dea [Brennon York] Beginning final portions of localized scala home 69c4e44 [Brennon York] working linux and osx installers for purely local mvn build 4a1609c [Brennon York] finalizing working linux install for maven to local ./build/apache-maven folder cbfcc68 [Brennon York] Changed the default sbt/sbt to build/sbt and added a build/mvn which will automatically download, install, and execute maven with zinc for easier build capability
121 lines
4.5 KiB
Markdown
121 lines
4.5 KiB
Markdown
---
|
|
layout: global
|
|
title: Third-Party Hadoop Distributions
|
|
---
|
|
|
|
Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and
|
|
the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark
|
|
with these distributions:
|
|
|
|
# Compile-time Hadoop Version
|
|
|
|
When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
|
|
property. For certain versions, you will need to specify additional profiles. For more detail,
|
|
see the guide on [building with maven](building-spark.html#specifying-the-hadoop-version):
|
|
|
|
mvn -Dhadoop.version=1.0.4 -DskipTests clean package
|
|
mvn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package
|
|
|
|
The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
|
|
some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
|
|
distribution may "just work" without you needing to compile. That said, we recommend compiling with
|
|
the _exact_ Hadoop version you are running to avoid any compatibility errors.
|
|
|
|
<table>
|
|
<tr valign="top">
|
|
<td>
|
|
<h3>CDH Releases</h3>
|
|
<table class="table" style="width:350px; margin-right: 20px;">
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
|
<tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-cdh4.X.X</td></tr>
|
|
<tr><td>CDH 4.X.X</td><td>2.0.0-mr1-cdh4.X.X</td></tr>
|
|
<tr><td>CDH 3u6</td><td>0.20.2-cdh3u6</td></tr>
|
|
<tr><td>CDH 3u5</td><td>0.20.2-cdh3u5</td></tr>
|
|
<tr><td>CDH 3u4</td><td>0.20.2-cdh3u4</td></tr>
|
|
</table>
|
|
</td>
|
|
<td>
|
|
<h3>HDP Releases</h3>
|
|
<table class="table" style="width:350px;">
|
|
<tr><th>Release</th><th>Version code</th></tr>
|
|
<tr><td>HDP 1.3</td><td>1.2.0</td></tr>
|
|
<tr><td>HDP 1.2</td><td>1.1.2</td></tr>
|
|
<tr><td>HDP 1.1</td><td>1.0.3</td></tr>
|
|
<tr><td>HDP 1.0</td><td>1.0.3</td></tr>
|
|
<tr><td>HDP 2.0</td><td>2.2.0</td></tr>
|
|
</table>
|
|
</td>
|
|
</tr>
|
|
</table>
|
|
|
|
In SBT, the equivalent can be achieved by setting the the `hadoop.version` property:
|
|
|
|
build/sbt -Dhadoop.version=1.0.4 assembly
|
|
|
|
# Linking Applications to the Hadoop Version
|
|
|
|
In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
|
|
version of `hadoop-client` to any Spark applications you run, so they can also talk to the HDFS version
|
|
on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository.
|
|
This looks as follows in SBT:
|
|
|
|
{% highlight scala %}
|
|
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>"
|
|
|
|
// If using CDH, also add Cloudera repo
|
|
resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
|
|
{% endhighlight %}
|
|
|
|
Or in Maven:
|
|
|
|
{% highlight xml %}
|
|
<project>
|
|
<dependencies>
|
|
...
|
|
<dependency>
|
|
<groupId>org.apache.hadoop</groupId>
|
|
<artifactId>hadoop-client</artifactId>
|
|
<version>[version]</version>
|
|
</dependency>
|
|
</dependencies>
|
|
|
|
<!-- If using CDH, also add Cloudera repo -->
|
|
<repositories>
|
|
...
|
|
<repository>
|
|
<id>Cloudera repository</id>
|
|
<url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
|
|
</repository>
|
|
</repositories>
|
|
</project>
|
|
|
|
{% endhighlight %}
|
|
|
|
# Where to Run Spark
|
|
|
|
As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide,
|
|
Spark can run in a variety of deployment modes:
|
|
|
|
* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your
|
|
Hadoop installation.
|
|
* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and
|
|
cores dedicated to Spark on each node.
|
|
* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.
|
|
|
|
These options are identical for those using CDH and HDP.
|
|
|
|
# Inheriting Cluster Configuration
|
|
|
|
If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
|
|
should be included on Spark's classpath:
|
|
|
|
* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
|
|
* `core-site.xml`, which sets the default filesystem name.
|
|
|
|
The location of these configuration files varies across CDH and HDP versions, but
|
|
a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
|
|
configurations on-the-fly, but offer a mechanisms to download copies of them.
|
|
|
|
To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh`
|
|
to a location containing the configuration files.
|