Add some basic documentation
This commit is contained in:
parent
5ee2f5c483
commit
ac2e8e8720
|
@ -94,9 +94,11 @@ class ClientArguments(val args: Array[String]) {
|
|||
" Mutliple invocations are possible, each will be passed in order.\n" +
|
||||
" Note that first argument will ALWAYS be yarn-standalone : will be added if missing.\n" +
|
||||
" --num-workers NUM Number of workers to start (Default: 2)\n" +
|
||||
" --worker-cores NUM Number of cores for the workers (Default: 1)\n" +
|
||||
" --worker-cores NUM Number of cores for the workers (Default: 1). This is unsused right now.\n" +
|
||||
" --master-memory MEM Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)\n" +
|
||||
" --worker-memory MEM Memory per Worker (e.g. 1000M, 2G) (Default: 1G)\n" +
|
||||
" --user USERNAME Run the ApplicationMaster as a different user\n"
|
||||
" --queue QUEUE The hadoop queue to use for allocation requests (Default: 'default')\n" +
|
||||
" --user USERNAME Run the ApplicationMaster (and slaves) as a different user\n"
|
||||
)
|
||||
System.exit(exitCode)
|
||||
}
|
||||
|
|
|
@ -5,18 +5,25 @@ title: Launching Spark on YARN
|
|||
|
||||
Experimental support for running over a [YARN (Hadoop
|
||||
NextGen)](http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html)
|
||||
cluster was added to Spark in version 0.6.0. Because YARN depends on version
|
||||
2.0 of the Hadoop libraries, this currently requires checking out a separate
|
||||
branch of Spark, called `yarn`, which you can do as follows:
|
||||
cluster was added to Spark in version 0.6.0. This was merged into master as part of 0.7 effort.
|
||||
To build spark core with YARN support, please use the hadoop2-yarn profile.
|
||||
Ex: mvn -Phadoop2-yarn clean install
|
||||
|
||||
git clone git://github.com/mesos/spark
|
||||
cd spark
|
||||
git checkout -b yarn --track origin/yarn
|
||||
# Building spark core consolidated jar.
|
||||
|
||||
Currently, only sbt can buid a consolidated jar which contains the entire spark code - which is required for launching jars on yarn.
|
||||
To do this via sbt - though (right now) is a manual process of enabling it in project/SparkBuild.scala.
|
||||
Please comment out the
|
||||
HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
|
||||
variables before the line 'For Hadoop 2 YARN support'
|
||||
Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
|
||||
|
||||
Currnetly, it is a TODO to add support for maven assembly.
|
||||
|
||||
|
||||
# Preparations
|
||||
|
||||
- In order to distribute Spark within the cluster, it must be packaged into a single JAR file. This can be done by running `sbt/sbt assembly`
|
||||
- Building spark core assembled jar (see above).
|
||||
- Your application code must be packaged into a separate JAR file.
|
||||
|
||||
If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
|
||||
|
@ -30,8 +37,11 @@ The command to launch the YARN Client is as follows:
|
|||
--class <APP_MAIN_CLASS> \
|
||||
--args <APP_MAIN_ARGUMENTS> \
|
||||
--num-workers <NUMBER_OF_WORKER_MACHINES> \
|
||||
--master-memory <MEMORY_FOR_MASTER> \
|
||||
--worker-memory <MEMORY_PER_WORKER> \
|
||||
--worker-cores <CORES_PER_WORKER>
|
||||
--worker-cores <CORES_PER_WORKER> \
|
||||
--user <hadoop_user> \
|
||||
--queue <queue_name>
|
||||
|
||||
For example:
|
||||
|
||||
|
@ -40,8 +50,9 @@ For example:
|
|||
--class spark.examples.SparkPi \
|
||||
--args standalone \
|
||||
--num-workers 3 \
|
||||
--master-memory 4g \
|
||||
--worker-memory 2g \
|
||||
--worker-cores 2
|
||||
--worker-cores 1
|
||||
|
||||
The above starts a YARN Client programs which periodically polls the Application Master for status updates and displays them in the console. The client will exit once your application has finished running.
|
||||
|
||||
|
@ -49,3 +60,5 @@ The above starts a YARN Client programs which periodically polls the Application
|
|||
|
||||
- When your application instantiates a Spark context it must use a special "standalone" master url. This starts the scheduler without forcing it to connect to a cluster. A good way to handle this is to pass "standalone" as an argument to your program, as shown in the example above.
|
||||
- YARN does not support requesting container resources based on the number of cores. Thus the numbers of cores given via command line arguments cannot be guaranteed.
|
||||
- Currently, we have not yet integrated with hadoop security. If --user is present, the hadoop_user specified will be used to run the tasks on the cluster. If unspecified, current user will be used (which should be valid in cluster).
|
||||
Once hadoop security support is added, and if hadoop cluster is enabled with security, additional restrictions would apply via delegation tokens passed.
|
||||
|
|
Loading…
Reference in a new issue