spark-instrumented-optimizer/docs/running-on-yarn.md

1.9 KiB

layout title
global Launching Spark on YARN

Spark allows you to launch jobs on an existing YARN cluster.

Preparations

  • In order to distribute Spark within the cluster it must be packaged into a single JAR file. This can be done by running sbt/sbt assembly
  • Your application code must be packaged into a separate jar file.

If you want to test out the YARN deployment mode, you can use the current spark examples. A spark-examples_2.9.1-0.6.0-SNAPSHOT.jar file can be generated by running sbt/sbt package.

Launching Spark on YARN

The command to launch the YARN Client is as follows:

SPARK_JAR=<SPARK_YAR_FILE> ./run spark.deploy.yarn.Client 
--jar <YOUR_APP_JAR_FILE> 
--class <APP_MAIN_CLASS> 
--args <APP_MAIN_ARGUMENTS> 
--num-workers <NUMBER_OF_WORKER_MACHINES> 
--worker-memory <MEMORY_PER_WORKER> 
--worker-cores <CORES_PER_WORKER>

For example:

SPARK_JAR=./core/target/spark-core-assembly-0.6.0-SNAPSHOT.jar ./run spark.deploy.yarn.Client 
--jar examples/target/scala-2.9.1/spark-examples_2.9.1-0.6.0-SNAPSHOT.jar
--class spark.examples.SparkPi
--args standalone
--num-workers 3
--worker-memory 2g
--worker-cores 2

The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.

Important Notes

  • When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
  • YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.