[SPARK-35399][DOCUMENTATION] State is still needed in the event of executor failure

### What changes were proposed in this pull request? Fix incorrect statement that state is no longer needed in the event of executor failure and document that it is needed in the case of a flaky app causing occasional executor failure. SO [discussion](https://stackoverflow.com/questions/67466878/can-spark-with-external-shuffle-service-use-saved-shuffle-files-in-the-event-of/67507439#67507439). ### Why are the changes needed? To fix the documentation and guide users as to additional use case for the Shuffle Service. ### Does this PR introduce _any_ user-facing change? Documentation only. ### How was this patch tested? N/A. Closes #32538 from chrisheaththomas/shuffle-service-and-executor-failure. Authored-by: Chris Thomas <chrisheaththomas@hotmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>
2021-05-17 08:58:46 -05:00 · 2021-05-17 08:58:46 -05:00 · ceb8122c40
parent b4348b7e56
commit ceb8122c40
2 changed files with 8 additions and 9 deletions
--- a/docs/configuration.md
+++ b/docs/configuration.md
@ -943,8 +943,8 @@ Apart from these, the following properties are also available, and may be useful
  <td>false</td>
  <td>
    Enables the external shuffle service. This service preserves the shuffle files written by
-    executors so the executors can be safely removed. The external shuffle service
-    must be set up in order to enable it. See
+    executors e.g. so that executors can be safely removed, or so that shuffle fetches can continue in 
+    the event of executor failure. The external shuffle service must be set up in order to enable it. See
    <a href="job-scheduling.html#configuration-and-setup">dynamic allocation
    configuration and setup documentation</a> for more information.
  </td>
--- a/docs/job-scheduling.md
+++ b/docs/job-scheduling.md
@ -142,13 +142,12 @@ an executor should not be idle if there are still pending tasks to be scheduled.

 ### Graceful Decommission of Executors

-Before dynamic allocation, a Spark executor exits either on failure or when the associated
-application has also exited. In both scenarios, all state associated with the executor is no
-longer needed and can be safely discarded. With dynamic allocation, however, the application
-is still running when an executor is explicitly removed. If the application attempts to access
-state stored in or written by the executor, it will have to perform a recompute the state. Thus,
-Spark needs a mechanism to decommission an executor gracefully by preserving its state before
-removing it.
+Before dynamic allocation, if a Spark executor exits when the associated application has also exited 
+then all state associated with the executor is no longer needed and can be safely discarded. 
+With dynamic allocation, however, the application is still running when an executor is explicitly 
+removed. If the application attempts to access state stored in or written by the executor, it will 
+have to perform a recompute the state. Thus, Spark needs a mechanism to decommission an executor 
+gracefully by preserving its state before removing it.

 This requirement is especially important for shuffles. During a shuffle, the Spark executor first
 writes its own map outputs locally to disk, and then acts as the server for those files when other