spark-instrumented-optimizer

History

Marcelo Vanzin b8ccd75524 [SPARK-29905][K8S] Improve pod lifecycle manager behavior with dynamic allocation This issue mainly shows up when you enable dynamic allocation: because there are many executor state changes (because of executors being requested and starting to run, and later stopped), the lifecycle manager class could end up logging information about the same executor multiple times, since the different events would cause the same executor update to be present in multiple pod snapshots. On top of that, it could end up making multiple redundant calls into the API server for the same pod. Another issue was when the config was set to not delete executor pods; with dynamic allocation, that means pods keep accumulating in the API server, and every time the full sync is done by the polling source, all executors, even the finished ones that Spark technically does not care about anymore, would be processed. The change modifies the lifecycle monitor so that it: - logs executor updates a single time, even if it shows up in multiple snapshots, by checking whether the state change happened before. - marks finished-but-not-deleted-in-k8s executors with a label so that they can be easily filtered out. This reduces the amount of logging done by the lifecycle manager, which is a minor thing in general since the logs are at debug level. But it also reduces the amount of data that needs to be fetched from the API server under certain configurations, and overall reduces interaction with the API server when dynamic allocation is on. There's also a change in the snapshot store to ensure that the same subscriber is not called concurrently. That is kind of a bug, since it means subscribers could be processing snapshots out of order, or even that they could block multiple threads (e.g. the allocator callback was synchronized). I actually ran into the "concurrent calls" situation in the lifecycle manager during testing, and while it did not seem to cause problems, it did make for some head scratching while looking at the logs. It seemed safer to fix that. Unit tests were updated to check for the changes. Also tested in real cluster with dynamic allocation on. Closes #26535 from vanzin/SPARK-29905. Lead-authored-by: Marcelo Vanzin <vanzin@apache.org> Co-authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>		2020-04-16 14:15:10 -07:00
..
kubernetes	[SPARK-29905][K8S] Improve pod lifecycle manager behavior with dynamic allocation	2020-04-16 14:15:10 -07:00
mesos	[SPARK-18886][CORE] Make Locality wait time measure resource under utilization due to delay scheduling	2020-04-09 11:00:29 +00:00
yarn	[SPARK-31092][YARN][DOC] Add version information to the configuration of Yarn	2020-03-12 09:52:57 +09:00