[SPARK-36532][CORE] Fix deadlock in CoarseGrainedExecutorBackend.onDisconnected to avoid executor shutdown hang
### What changes were proposed in this pull request? Instead of exiting the executor within the RpcEnv's thread, exit the executor in a separate thread. ### Why are the changes needed? The current exit way in `onDisconnected` can cause the deadlock, which has the exact same root cause with https://github.com/apache/spark/pull/12012: * `onDisconnected` -> `System.exit` are called in sequence in the thread of `MessageLoop.threadpool` * `System.exit` triggers shutdown hooks and `executor.stop` is one of the hooks. * `executor.stop` stops the `Dispatcher`, which waits for the `MessageLoop.threadpool` to shutdown further. * Thus, the thread which runs `System.exit` waits for hooks to be done, but the `MessageLoop.threadpool` in the hook waits that thread to finish. Finally, this mutual dependence results in the deadlock. ### Does this PR introduce _any_ user-facing change? Yes, the executor shutdown won't hang. ### How was this patch tested? Pass existing tests. Closes #33759 from Ngone51/fix-executor-shutdown-hang. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This commit is contained in:
parent
a1ecf83f2e
commit
996551fece
|
@ -202,11 +202,17 @@ private[spark] class CoarseGrainedExecutorBackend(
|
|||
stopping.set(true)
|
||||
new Thread("CoarseGrainedExecutorBackend-stop-executor") {
|
||||
override def run(): Unit = {
|
||||
// executor.stop() will call `SparkEnv.stop()` which waits until RpcEnv stops totally.
|
||||
// However, if `executor.stop()` runs in some thread of RpcEnv, RpcEnv won't be able to
|
||||
// stop until `executor.stop()` returns, which becomes a dead-lock (See SPARK-14180).
|
||||
// Therefore, we put this line in a new thread.
|
||||
executor.stop()
|
||||
// `executor` can be null if there's any error in `CoarseGrainedExecutorBackend.onStart`
|
||||
// or fail to create `Executor`.
|
||||
if (executor == null) {
|
||||
System.exit(1)
|
||||
} else {
|
||||
// executor.stop() will call `SparkEnv.stop()` which waits until RpcEnv stops totally.
|
||||
// However, if `executor.stop()` runs in some thread of RpcEnv, RpcEnv won't be able to
|
||||
// stop until `executor.stop()` returns, which becomes a dead-lock (See SPARK-14180).
|
||||
// Therefore, we put this line in a new thread.
|
||||
executor.stop()
|
||||
}
|
||||
}
|
||||
}.start()
|
||||
|
||||
|
@ -286,8 +292,7 @@ private[spark] class CoarseGrainedExecutorBackend(
|
|||
if (notifyDriver && driver.nonEmpty) {
|
||||
driver.get.send(RemoveExecutor(executorId, new ExecutorLossReason(reason)))
|
||||
}
|
||||
|
||||
System.exit(code)
|
||||
self.send(Shutdown)
|
||||
} else {
|
||||
logInfo("Skip exiting executor since it's been already asked to exit before.")
|
||||
}
|
||||
|
|
Loading…
Reference in a new issue