spark-instrumented-optimizer/core
Marcelo Vanzin dca838058f [SPARK-29950][K8S] Blacklist deleted executors in K8S with dynamic allocation
The issue here is that when Spark is downscaling the application and deletes
a few pod requests that aren't needed anymore, it may actually race with the
K8S scheduler, who may be bringing up those executors. So they may have enough
time to connect back to the driver, register, to just be deleted soon after.
This wastes resources and causes misleading entries in the driver log.

The change (ab)uses the blacklisting mechanism to consider the deleted excess
pods as blacklisted, so that if they try to connect back, the driver will deny
it.

It also changes the executor registration slightly, since even with the above
change there were misleading logs. That was because the executor registration
message was an RPC that always succeeded (bar network issues), so the executor
would always try to send an unregistration message to the driver, which would
then log several messages about not knowing anything about the executor. The
change makes the registration RPC succeed or fail directly, instead of using
the separate failure message that would lead to this issue.

Note the last change required some changes in a standalone test suite related
to dynamic allocation, since it relied on the driver not throwing exceptions
when a duplicate executor registration happened.

Tested with existing unit tests, and with live cluster with dyn alloc on.

Closes #26586 from vanzin/SPARK-29950.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2020-01-16 13:37:11 -08:00
..
benchmarks [SPARK-29576][CORE] Use Spark's CompressionCodec for Ser/Deser of MapOutputStatus 2019-10-23 18:17:37 -07:00
src [SPARK-29950][K8S] Blacklist deleted executors in K8S with dynamic allocation 2020-01-16 13:37:11 -08:00
pom.xml [INFRA] Reverts commit 56dcd79 and c216ef1 2019-12-16 19:57:44 -07:00