[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files

Clarify what may cause long-running Spark apps to preserve shuffle files

Author: Sean Owen <sowen@cloudera.com>

Closes #6901 from srowen/SPARK-5836 and squashes the following commits:

a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files

(cherry picked from commit 4be53d0395)
Signed-off-by: Andrew Or <andrew@databricks.com>
This commit is contained in:
Sean Owen 2015-06-19 11:03:04 -07:00 committed by Andrew Or
parent aedd893b42
commit 3415fb978b

View file

@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
to disk, incurring the additional overhead of disk I/O and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
are not cleaned up from Spark's temporary storage until Spark is stopped, which means that
long-running Spark jobs may consume available disk space. This is done so the shuffle doesn't need
to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the
are preserved until the corresponding RDDs are no longer used and are garbage collected.
This is done so the shuffle files don't need to be re-created if the lineage is re-computed.
Garbage collection may happen only after a long period time, if the application retains references
to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
consume a large amount of disk space. The temporary storage directory is specified by the
`spark.local.dir` configuration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the