[SPARK-31960][YARN][BUILD] Only populate Hadoop classpath for no-hadoop build
### What changes were proposed in this pull request? If a Spark distribution has built-in hadoop runtime, Spark will not populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to Yarn. Users can override this behavior by setting `spark.yarn.populateHadoopClasspath` to `true`. ### Why are the changes needed? Without this, Spark will populate the hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` even Spark distribution has built-in hadoop. This results jar conflict and many unexpected behaviors in runtime. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually test with two builds, with-hadoop and no-hadoop builds. Closes #28788 from dbtsai/yarn-classpath. Authored-by: DB Tsai <d_tsai@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>
This commit is contained in:
parent
4badef38a5
commit
9b792518b2
|
@ -123,3 +123,4 @@ SessionManager.java
|
|||
SessionHandler.java
|
||||
GangliaReporter.java
|
||||
application_1578436911597_0052
|
||||
config.properties
|
||||
|
|
|
@ -82,6 +82,18 @@ In `cluster` mode, the driver runs on a different machine than the client, so `S
|
|||
|
||||
Running Spark on YARN requires a binary distribution of Spark which is built with YARN support.
|
||||
Binary distributions can be downloaded from the [downloads page](https://spark.apache.org/downloads.html) of the project website.
|
||||
There are two variants of Spark binary distributions you can download. One is pre-built with a certain
|
||||
version of Apache Hadoop; this Spark distribution contains built-in Hadoop runtime, so we call it `with-hadoop` Spark
|
||||
distribution. The other one is pre-built with user-provided Hadoop; since this Spark distribution
|
||||
doesn't contain a built-in Hadoop runtime, it's smaller, but users have to provide a Hadoop installation separately.
|
||||
We call this variant `no-hadoop` Spark distribution. For `with-hadoop` Spark distribution, since
|
||||
it contains a built-in Hadoop runtime already, by default, when a job is submitted to Hadoop Yarn cluster, to prevent jar conflict, it will not
|
||||
populate Yarn's classpath into Spark. To override this behavior, you can set <code>spark.yarn.populateHadoopClasspath=true</code>.
|
||||
For `no-hadoop` Spark distribution, Spark will populate Yarn's classpath by default in order to get Hadoop runtime. For `with-hadoop` Spark distribution,
|
||||
if your application depends on certain library that is only available in the cluster, you can try to populate the Yarn classpath by setting
|
||||
the property mentioned above. If you run into jar conflict issue by doing so, you will need to turn it off and include this library
|
||||
in your application jar.
|
||||
|
||||
To build Spark yourself, refer to [Building Spark](building-spark.html).
|
||||
|
||||
To make Spark runtime jars accessible from YARN side, you can specify `spark.yarn.archive` or `spark.yarn.jars`. For details please refer to [Spark Properties](running-on-yarn.html#spark-properties). If neither `spark.yarn.archive` nor `spark.yarn.jars` is specified, Spark will create a zip file with all jars under `$SPARK_HOME/jars` and upload it to the distributed cache.
|
||||
|
@ -396,7 +408,10 @@ To use a custom metrics.properties for the application master and executors, upd
|
|||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.yarn.populateHadoopClasspath</code></td>
|
||||
<td>true</td>
|
||||
<td>
|
||||
For <code>with-hadoop</code> Spark distribution, this is set to false;
|
||||
for <code>no-hadoop</code> distribution, this is set to true.
|
||||
</td>
|
||||
<td>
|
||||
Whether to populate Hadoop classpath from <code>yarn.application.classpath</code> and
|
||||
<code>mapreduce.application.classpath</code> Note that if this is set to <code>false</code>,
|
||||
|
|
|
@ -30,8 +30,18 @@
|
|||
<properties>
|
||||
<sbt.project.name>yarn</sbt.project.name>
|
||||
<jersey-1.version>1.19</jersey-1.version>
|
||||
<spark.yarn.isHadoopProvided>false</spark.yarn.isHadoopProvided>
|
||||
</properties>
|
||||
|
||||
<profiles>
|
||||
<profile>
|
||||
<id>hadoop-provided</id>
|
||||
<properties>
|
||||
<spark.yarn.isHadoopProvided>true</spark.yarn.isHadoopProvided>
|
||||
</properties>
|
||||
</profile>
|
||||
</profiles>
|
||||
|
||||
<dependencies>
|
||||
<dependency>
|
||||
<groupId>org.apache.spark</groupId>
|
||||
|
@ -201,6 +211,12 @@
|
|||
<build>
|
||||
<outputDirectory>target/scala-${scala.binary.version}/classes</outputDirectory>
|
||||
<testOutputDirectory>target/scala-${scala.binary.version}/test-classes</testOutputDirectory>
|
||||
<resources>
|
||||
<resource>
|
||||
<directory>src/main/resources</directory>
|
||||
<filtering>true</filtering>
|
||||
</resource>
|
||||
</resources>
|
||||
</build>
|
||||
|
||||
</project>
|
||||
|
|
|
@ -0,0 +1 @@
|
|||
spark.yarn.isHadoopProvided = ${spark.yarn.isHadoopProvided}
|
|
@ -17,12 +17,14 @@
|
|||
|
||||
package org.apache.spark.deploy.yarn
|
||||
|
||||
import java.util.Properties
|
||||
import java.util.concurrent.TimeUnit
|
||||
|
||||
import org.apache.spark.internal.Logging
|
||||
import org.apache.spark.internal.config.ConfigBuilder
|
||||
import org.apache.spark.network.util.ByteUnit
|
||||
|
||||
package object config {
|
||||
package object config extends Logging {
|
||||
|
||||
/* Common app configuration. */
|
||||
|
||||
|
@ -74,10 +76,11 @@ package object config {
|
|||
.doc("Whether to populate Hadoop classpath from `yarn.application.classpath` and " +
|
||||
"`mapreduce.application.classpath` Note that if this is set to `false`, it requires " +
|
||||
"a `with-Hadoop` Spark distribution that bundles Hadoop runtime or user has to provide " +
|
||||
"a Hadoop installation separately.")
|
||||
"a Hadoop installation separately. By default, for `with-hadoop` Spark distribution, " +
|
||||
"this is set to `false`; for `no-hadoop` distribution, this is set to `true`.")
|
||||
.version("2.4.6")
|
||||
.booleanConf
|
||||
.createWithDefault(true)
|
||||
.createWithDefault(isHadoopProvided())
|
||||
|
||||
private[spark] val GATEWAY_ROOT_PATH = ConfigBuilder("spark.yarn.config.gatewayPath")
|
||||
.doc("Root of configuration paths that is present on gateway nodes, and will be replaced " +
|
||||
|
@ -394,4 +397,20 @@ package object config {
|
|||
private[yarn] val YARN_DRIVER_RESOURCE_TYPES_PREFIX = "spark.yarn.driver.resource."
|
||||
private[yarn] val YARN_AM_RESOURCE_TYPES_PREFIX = "spark.yarn.am.resource."
|
||||
|
||||
def isHadoopProvided(): Boolean = IS_HADOOP_PROVIDED
|
||||
|
||||
private lazy val IS_HADOOP_PROVIDED: Boolean = {
|
||||
val configPath = "org/apache/spark/deploy/yarn/config.properties"
|
||||
val propertyKey = "spark.yarn.isHadoopProvided"
|
||||
try {
|
||||
val prop = new Properties()
|
||||
prop.load(ClassLoader.getSystemClassLoader.getResourceAsStream(configPath))
|
||||
prop.getProperty(propertyKey).toBoolean
|
||||
} catch {
|
||||
case e: Exception =>
|
||||
log.warn(s"Can not load the default value of `$propertyKey` from " +
|
||||
s"`$configPath` with error, ${e.toString}. Using `false` as a default value.")
|
||||
false
|
||||
}
|
||||
}
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue