88 lines
4.3 KiB
Markdown
88 lines
4.3 KiB
Markdown
---
|
|
layout: global
|
|
title: Launching Spark on YARN
|
|
---
|
|
|
|
Experimental support for running over a [YARN (Hadoop
|
|
NextGen)](http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html)
|
|
cluster was added to Spark in version 0.6.0. This was merged into master as part of 0.7 effort.
|
|
To build spark core with YARN support, please use the hadoop2-yarn profile.
|
|
Ex: mvn -Phadoop2-yarn clean install
|
|
|
|
# Building spark core consolidated jar.
|
|
|
|
We need a consolidated spark core jar (which bundles all the required dependencies) to run Spark jobs on a yarn cluster.
|
|
This can be built either through sbt or via maven.
|
|
|
|
- Building spark assembled jar via sbt.
|
|
It is a manual process of enabling it in project/SparkBuild.scala.
|
|
Please comment out the
|
|
HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
|
|
variables before the line 'For Hadoop 2 YARN support'
|
|
Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
|
|
|
|
Assembly of the jar Ex:
|
|
|
|
./sbt/sbt clean assembly
|
|
|
|
The assembled jar would typically be something like :
|
|
`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar`
|
|
|
|
|
|
- Building spark assembled jar via Maven.
|
|
Use the hadoop2-yarn profile and execute the package target.
|
|
|
|
Something like this. Ex:
|
|
|
|
mvn -Phadoop2-yarn clean package -DskipTests=true
|
|
|
|
|
|
This will build the shaded (consolidated) jar. Typically something like :
|
|
`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar`
|
|
|
|
|
|
# Preparations
|
|
|
|
- Building spark core assembled jar (see above).
|
|
- Your application code must be packaged into a separate JAR file.
|
|
|
|
If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
|
|
|
|
# Launching Spark on YARN
|
|
|
|
Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the hadoop cluster.
|
|
This would be used to connect to the cluster, write to the dfs and submit jobs to the resource manager.
|
|
|
|
The command to launch the YARN Client is as follows:
|
|
|
|
SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client \
|
|
--jar <YOUR_APP_JAR_FILE> \
|
|
--class <APP_MAIN_CLASS> \
|
|
--args <APP_MAIN_ARGUMENTS> \
|
|
--num-workers <NUMBER_OF_WORKER_MACHINES> \
|
|
--master-memory <MEMORY_FOR_MASTER> \
|
|
--worker-memory <MEMORY_PER_WORKER> \
|
|
--worker-cores <CORES_PER_WORKER> \
|
|
--user <hadoop_user> \
|
|
--queue <queue_name>
|
|
|
|
For example:
|
|
|
|
SPARK_JAR=./core/target/spark-core-assembly-{{site.SPARK_VERSION}}.jar ./run spark.deploy.yarn.Client \
|
|
--jar examples/target/scala-{{site.SCALA_VERSION}}/spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}.jar \
|
|
--class spark.examples.SparkPi \
|
|
--args yarn-standalone \
|
|
--num-workers 3 \
|
|
--master-memory 4g \
|
|
--worker-memory 2g \
|
|
--worker-cores 1
|
|
|
|
The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.
|
|
|
|
# Important Notes
|
|
|
|
- When your application instantiates a Spark context it must use a special "yarn-standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "yarn-standalone" as an argument to your program, as shown in the example above.
|
|
- We do not requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
|
|
- Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster).
|
|
Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed.
|