spark-instrumented-optimizer/core
yi.wu 578b90cdec [SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor
### What changes were proposed in this pull request?

This PR adds the check to see whether the executor is lost (by asking the `CoarseGrainedSchedulerBackend`) after timeout error raised in `BlockManagerMasterEndponit` due to removing blocks(e.g. RDD, broadcast, shuffle). If the executor is lost, we will ignore the error. Otherwise, throw the error.

### Why are the changes needed?

When removing blocks(e.g. RDD, broadcast, shuffle), `BlockManagerMaserEndpoint` will make RPC calls to each known `BlockManagerSlaveEndpoint` to remove the specific blocks. The PRC call sometimes could end in a timeout when the executor has been lost, but only notified the `BlockManagerMasterEndpoint` after the removing call has already happened. The timeout error could therefore fail the whole job.

In this case, we actually could just ignore the error since those blocks on the lost executor could be considered as removed already.

### Does this PR introduce _any_ user-facing change?

Yes. In case of users hits this issue, they will have the job executed successfully instead of throwing the exception.

### How was this patch tested?

Added unit tests.

Closes #28924 from Ngone51/ignore-timeout-error-for-inactive-executor.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-10 13:36:29 +00:00
..
benchmarks [SPARK-29576][CORE] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus 2019-10-23 18:17:37 -07:00
src [SPARK-32091][CORE] Ignore timeout error when remove blocks on the lost executor 2020-07-10 13:36:29 +00:00
pom.xml [SPARK-31765][WEBUI][TEST-MAVEN] Upgrade HtmlUnit >= 2.37.0 2020-06-11 18:27:53 -05:00