[SPARK-5836] [DOCS] [STREAMING] Clarify what may cause long-running Spark apps to preserve shuffle files
Clarify what may cause long-running Spark apps to preserve shuffle files
Author: Sean Owen <sowen@cloudera.com>
Closes #6901 from srowen/SPARK-5836 and squashes the following commits:
a9faef0 [Sean Owen] Clarify what may cause long-running Spark apps to preserve shuffle files
(cherry picked from commit 4be53d0395
)
Signed-off-by: Andrew Or <andrew@databricks.com>
This commit is contained in:
parent
aedd893b42
commit
3415fb978b
|
@ -1144,9 +1144,11 @@ generate these on the reduce side. When data does not fit in memory Spark will s
|
|||
to disk, incurring the additional overhead of disk I/O and increased garbage collection.
|
||||
|
||||
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
|
||||
are not cleaned up from Spark's temporary storage until Spark is stopped, which means that
|
||||
long-running Spark jobs may consume available disk space. This is done so the shuffle doesn't need
|
||||
to be re-computed if the lineage is re-computed. The temporary storage directory is specified by the
|
||||
are preserved until the corresponding RDDs are no longer used and are garbage collected.
|
||||
This is done so the shuffle files don't need to be re-created if the lineage is re-computed.
|
||||
Garbage collection may happen only after a long period time, if the application retains references
|
||||
to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
|
||||
consume a large amount of disk space. The temporary storage directory is specified by the
|
||||
`spark.local.dir` configuration parameter when configuring the Spark context.
|
||||
|
||||
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
|
||||
|
|
Loading…
Reference in a new issue