[SPARK-31960][YARN][BUILD] Only populate Hadoop classpath for no-hadoop build

### What changes were proposed in this pull request? If a Spark distribution has built-in hadoop runtime, Spark will not populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to Yarn. Users can override this behavior by setting `spark.yarn.populateHadoopClasspath` to `true`. ### Why are the changes needed? Without this, Spark will populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` even Spark distribution has built-in hadoop. This results jar conflict and many unexpected behaviors in runtime. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test with two builds, with-hadoop and no-hadoop builds. Closes #28788 from dbtsai/yarn-classpath. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-06-18 06:08:40 +00:00 · 2020-06-18 06:08:40 +00:00 · 9b792518b2
parent 4badef38a5
commit 9b792518b2
5 changed files with 56 additions and 4 deletions
--- a/dev/.rat-excludes
+++ b/dev/.rat-excludes
@ -123,3 +123,4 @@ SessionManager.java
 SessionHandler.java
 GangliaReporter.java
 application_1578436911597_0052
+config.properties
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@ -82,6 +82,18 @@ In `cluster` mode, the driver runs on a different machine than the client, so `S

 Running Spark on YARN requires a binary distribution of Spark which is built with YARN support.
 Binary distributions can be downloaded from the [downloads page](https://spark.apache.org/downloads.html) of the project website.
+There are two variants of Spark binary distributions you can download. One is pre-built with a certain
+version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it `with-hadoop` Spark
+distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution
+doesn't contain a built-in Hadoop runtime, it's smaller, but users have to provide a Hadoop installation separately.
+We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark distribution, since
+it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
+populate Yarn's classpath into Spark. To override this behavior, you can set <code>spark.yarn.populateHadoopClasspath=true</code>.
+For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by default in order to get Hadoop runtime. For `with-hadoop` Spark distribution,
+if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting
+the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library
+in your application jar.
+
 To build Spark yourself, refer to [Building Spark](building-spark.html).

 To make Spark runtime jars accessible from YARN side, you can specify `spark.yarn.archive` or `spark.yarn.jars`. For details please refer to [Spark Properties](running-on-yarn.html#spark-properties). If neither `spark.yarn.archive` nor `spark.yarn.jars` is specified, Spark will create a zip file with all jars under `$SPARK_HOME/jars` and upload it to the distributed cache.
@ -396,7 +408,10 @@ To use a custom metrics.properties for the application master and executors, upd
 </tr>
 <tr>
  <td><code>spark.yarn.populateHadoopClasspath</code></td>
-  <td>true</td>
+  <td>
+    For <code>with-hadoop</code> Spark distribution, this is set to false; 
+    for <code>no-hadoop</code> distribution, this is set to true.
+  </td>
  <td>
    Whether to populate Hadoop classpath from <code>yarn.application.classpath</code> and
    <code>mapreduce.application.classpath</code> Note that if this is set to <code>false</code>, 
--- a/resource-managers/yarn/pom.xml
+++ b/resource-managers/yarn/pom.xml
@ -30,8 +30,18 @@
  <properties>
    <sbt.project.name>yarn</sbt.project.name>
    <jersey-1.version>1.19</jersey-1.version>
+    <spark.yarn.isHadoopProvided>false</spark.yarn.isHadoopProvided>
  </properties>

+  <profiles>
+    <profile>
+      <id>hadoop-provided</id>
+      <properties>
+        <spark.yarn.isHadoopProvided>true</spark.yarn.isHadoopProvided>
+      </properties>
+    </profile>
+  </profiles>
+
  <dependencies>
    <dependency>
      <groupId>org.apache.spark</groupId>
@ -201,6 +211,12 @@
  <build>
    <outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
    <testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
+    <resources>
+      <resource>
+        <directory>src/main/resources</directory>
+        <filtering>true</filtering>
+      </resource>
+    </resources>
  </build>

 </project>
--- a/resource-managers/yarn/src/main/resources/org/apache/spark/deploy/yarn/config.properties
+++ b/resource-managers/yarn/src/main/resources/org/apache/spark/deploy/yarn/config.properties
@ -0,0 +1 @@
+spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}
--- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
+++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/config.scala
@ -17,12 +17,14 @@

 package org.apache.spark.deploy.yarn

+import java.util.Properties
 import java.util.concurrent.TimeUnit

+import org.apache.spark.internal.Logging
 import org.apache.spark.internal.config.ConfigBuilder
 import org.apache.spark.network.util.ByteUnit

-package object config {
+package object config extends Logging {

  /* Common app configuration. */

@ -74,10 +76,11 @@ package object config {
    .doc("Whether to populate Hadoop classpath from `yarn.application.classpath` and " +
      "`mapreduce.application.classpath` Note that if this is set to `false`, it requires " +
      "a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user has to provide " +
-      "a Hadoop installation separately.")
+      "a Hadoop installation separately. By default, for `with-hadoop` Spark distribution, " +
+      "this is set to `false`; for `no-hadoop` distribution, this is set to `true`.")
    .version("2.4.6")
    .booleanConf
-    .createWithDefault(true)
+    .createWithDefault(isHadoopProvided())

  private[spark] val GATEWAY_ROOT_PATH = ConfigBuilder("spark.yarn.config.gatewayPath")
    .doc("Root of configuration paths that is present on gateway nodes, and will be replaced " +
@ -394,4 +397,20 @@ package object config {
  private[yarn] val YARN_DRIVER_RESOURCE_TYPES_PREFIX = "spark.yarn.driver.resource."
  private[yarn] val YARN_AM_RESOURCE_TYPES_PREFIX = "spark.yarn.am.resource."

+  def isHadoopProvided(): Boolean = IS_HADOOP_PROVIDED
+
+  private lazy val IS_HADOOP_PROVIDED: Boolean = {
+    val configPath = "org/apache/spark/deploy/yarn/config.properties"
+    val propertyKey = "spark.yarn.isHadoopProvided"
+    try {
+      val prop = new Properties()
+      prop.load(ClassLoader.getSystemClassLoader.getResourceAsStream(configPath))
+      prop.getProperty(propertyKey).toBoolean
+    } catch {
+      case e: Exception =>
+        log.warn(s"Can not load the default value of `$propertyKey` from " +
+          s"`$configPath` with error, ${e.toString}. Using `false` as a default value.")
+        false
+    }
+  }
 }
				`@ -0,0 +1 @@`
				`spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}`