spark-instrumented-optimizer/yarn
Marcelo Vanzin f47dbf27fa [SPARK-14602][YARN] Use SparkConf to propagate the list of cached files.
This change avoids using the environment to pass this information, since
with many jars it's easy to hit limits on certain OSes. Instead, it encodes
the information into the Spark configuration propagated to the AM.

The first problem that needed to be solved is a chicken & egg issue: the
config file is distributed using the cache, and it needs to contain information
about the files that are being distributed. To solve that, the code now treats
the config archive especially, and uses slightly different code to distribute
it, so that only its cache path needs to be saved to the config file.

The second problem is that the extra information would show up in the Web UI,
which made the environment tab even more noisy than it already is when lots
of jars are listed. This is solved by two changes: the list of cached files
is now read only once in the AM, and propagated down to the ExecutorRunnable
code (which actually sends the list to the NMs when starting containers). The
second change is to unset those config entries after the list is read, so that
the SparkContext never sees them.

Tested with both client and cluster mode by running "run-example SparkPi". This
uploads a whole lot of files when run from a build dir (instead of a distribution,
where the list is cleaned up), and I verified that the configs do not show
up in the UI.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12487 from vanzin/SPARK-14602.
2016-04-20 16:57:23 -07:00
..
src [SPARK-14602][YARN] Use SparkConf to propagate the list of cached files. 2016-04-20 16:57:23 -07:00
pom.xml [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version 2016-01-30 00:20:28 -08:00