[SPARK-35399][DOCUMENTATION] State is still needed in the event of executor failure

### What changes were proposed in this pull request?

Fix incorrect statement that state is no longer needed in the event of executor failure and document that it is needed in the case of a flaky app causing occasional executor failure.

SO [discussion](https://stackoverflow.com/questions/67466878/can-spark-with-external-shuffle-service-use-saved-shuffle-files-in-the-event-of/67507439#67507439).

### Why are the changes needed?

To fix the documentation and guide users as to additional use case for the Shuffle Service.

### Does this PR introduce _any_ user-facing change?

Documentation only.

### How was this patch tested?

N/A.

Closes #32538 from chrisheaththomas/shuffle-service-and-executor-failure.

Authored-by: Chris Thomas <chrisheaththomas@hotmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
This commit is contained in:
Chris Thomas 2021-05-17 08:58:46 -05:00 committed by Sean Owen
parent b4348b7e56
commit ceb8122c40
2 changed files with 8 additions and 9 deletions

View file

@ -943,8 +943,8 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Enables the external shuffle service. This service preserves the shuffle files written by
executors so the executors can be safely removed. The external shuffle service
must be set up in order to enable it. See
executors e.g. so that executors can be safely removed, or so that shuffle fetches can continue in
the event of executor failure. The external shuffle service must be set up in order to enable it. See
<a href="job-scheduling.html#configuration-and-setup">dynamic allocation
configuration and setup documentation</a> for more information.
</td>

View file

@ -142,13 +142,12 @@ an executor should not be idle if there are still pending tasks to be scheduled.
### Graceful Decommission of Executors
Before dynamic allocation, a Spark executor exits either on failure or when the associated
application has also exited. In both scenarios, all state associated with the executor is no
longer needed and can be safely discarded. With dynamic allocation, however, the application
is still running when an executor is explicitly removed. If the application attempts to access
state stored in or written by the executor, it will have to perform a recompute the state. Thus,
Spark needs a mechanism to decommission an executor gracefully by preserving its state before
removing it.
Before dynamic allocation, if a Spark executor exits when the associated application has also exited
then all state associated with the executor is no longer needed and can be safely discarded.
With dynamic allocation, however, the application is still running when an executor is explicitly
removed. If the application attempts to access state stored in or written by the executor, it will
have to perform a recompute the state. Thus, Spark needs a mechanism to decommission an executor
gracefully by preserving its state before removing it.
This requirement is especially important for shuffles. During a shuffle, the Spark executor first
writes its own map outputs locally to disk, and then acts as the server for those files when other