spark-instrumented-optimizer/core/src/main
Eric Liang 649fa4bf1d [SPARK-17370] Shuffle service files not invalidated when a slave is lost
## What changes were proposed in this pull request?

DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime.

However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files.

The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event.

## How was this patch tested?

Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected.

cc mateiz

Author: Eric Liang <ekl@databricks.com>

Closes #14931 from ericl/sc-4439.
2016-09-07 12:33:50 -07:00
..
java/org/apache/spark [SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasks 2016-09-06 16:55:22 -07:00
resources/org/apache/spark [SPARK-17342][WEBUI] Style of event timeline is broken 2016-09-02 08:46:15 +01:00
scala/org/apache/spark [SPARK-17370] Shuffle service files not invalidated when a slave is lost 2016-09-07 12:33:50 -07:00