[SPARK-31960][YARN][BUILD] Only populate Hadoop classpath for no-hadoop build

### What changes were proposed in this pull request?
If a Spark distribution has built-in hadoop runtime, Spark will not populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to Yarn. Users can override this behavior by setting `spark.yarn.populateHadoopClasspath` to `true`.

### Why are the changes needed?
Without this, Spark will populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` even Spark distribution has built-in hadoop. This results jar conflict and many unexpected behaviors in runtime.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test with two builds, with-hadoop and no-hadoop builds.

Closes #28788 from dbtsai/yarn-classpath.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
This commit is contained in:
DB Tsai 2020-06-18 06:08:40 +00:00
parent 4badef38a5
commit 9b792518b2
5 changed files with 56 additions and 4 deletions

View file

@ -123,3 +123,4 @@ SessionManager.java
SessionHandler.java
GangliaReporter.java
application_1578436911597_0052
config.properties

View file

@ -82,6 +82,18 @@ In `cluster` mode, the driver runs on a different machine than the client, so `S
Running Spark on YARN requires a binary distribution of Spark which is built with YARN support.
Binary distributions can be downloaded from the [downloads page](https://spark.apache.org/downloads.html) of the project website.
There are two variants of Spark binary distributions you can download. One is pre-built with a certain
version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it `with-hadoop` Spark
distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution
doesn't contain a built-in Hadoop runtime, it's smaller, but users have to provide a Hadoop installation separately.
We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark distribution, since
it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
populate Yarn's classpath into Spark. To override this behavior, you can set <code>spark.yarn.populateHadoopClasspath=true</code>.
For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by default in order to get Hadoop runtime. For `with-hadoop` Spark distribution,
if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting
the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library
in your application jar.
To build Spark yourself, refer to [Building Spark](building-spark.html).
To make Spark runtime jars accessible from YARN side, you can specify `spark.yarn.archive` or `spark.yarn.jars`. For details please refer to [Spark Properties](running-on-yarn.html#spark-properties). If neither `spark.yarn.archive` nor `spark.yarn.jars` is specified, Spark will create a zip file with all jars under `$SPARK_HOME/jars` and upload it to the distributed cache.
@ -396,7 +408,10 @@ To use a custom metrics.properties for the application master and executors, upd
</tr>
<tr>
<td><code>spark.yarn.populateHadoopClasspath</code></td>
<td>true</td>
<td>
For <code>with-hadoop</code> Spark distribution, this is set to false;
for <code>no-hadoop</code> distribution, this is set to true.
</td>
<td>
Whether to populate Hadoop classpath from <code>yarn.application.classpath</code> and
<code>mapreduce.application.classpath</code> Note that if this is set to <code>false</code>,

View file

@ -30,8 +30,18 @@
<properties>
<sbt.project.name>yarn</sbt.project.name>
<jersey-1.version>1.19</jersey-1.version>
<spark.yarn.isHadoopProvided>false</spark.yarn.isHadoopProvided>
</properties>
<profiles>
<profile>
<id>hadoop-provided</id>
<properties>
<spark.yarn.isHadoopProvided>true</spark.yarn.isHadoopProvided>
</properties>
</profile>
</profiles>
<dependencies>
<dependency>
<groupId>org.apache.spark</groupId>
@ -201,6 +211,12 @@
<build>
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
</build>
</project>

View file

@ -0,0 +1 @@
spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}

View file

@ -17,12 +17,14 @@
package org.apache.spark.deploy.yarn
import java.util.Properties
import java.util.concurrent.TimeUnit
import org.apache.spark.internal.Logging
import org.apache.spark.internal.config.ConfigBuilder
import org.apache.spark.network.util.ByteUnit
package object config {
package object config extends Logging {
/* Common app configuration. */
@ -74,10 +76,11 @@ package object config {
.doc("Whether to populate Hadoop classpath from `yarn.application.classpath` and " +
"`mapreduce.application.classpath` Note that if this is set to `false`, it requires " +
"a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user has to provide " +
"a Hadoop installation separately.")
"a Hadoop installation separately. By default, for `with-hadoop` Spark distribution, " +
"this is set to `false`; for `no-hadoop` distribution, this is set to `true`.")
.version("2.4.6")
.booleanConf
.createWithDefault(true)
.createWithDefault(isHadoopProvided())
private[spark] val GATEWAY_ROOT_PATH = ConfigBuilder("spark.yarn.config.gatewayPath")
.doc("Root of configuration paths that is present on gateway nodes, and will be replaced " +
@ -394,4 +397,20 @@ package object config {
private[yarn] val YARN_DRIVER_RESOURCE_TYPES_PREFIX = "spark.yarn.driver.resource."
private[yarn] val YARN_AM_RESOURCE_TYPES_PREFIX = "spark.yarn.am.resource."
def isHadoopProvided(): Boolean = IS_HADOOP_PROVIDED
private lazy val IS_HADOOP_PROVIDED: Boolean = {
val configPath = "org/apache/spark/deploy/yarn/config.properties"
val propertyKey = "spark.yarn.isHadoopProvided"
try {
val prop = new Properties()
prop.load(ClassLoader.getSystemClassLoader.getResourceAsStream(configPath))
prop.getProperty(propertyKey).toBoolean
} catch {
case e: Exception =>
log.warn(s"Can not load the default value of `$propertyKey` from " +
s"`$configPath` with error, ${e.toString}. Using `false` as a default value.")
false
}
}
}