spark-instrumented-optimizer/core
Andrew Or c9da466edb [SPARK-3015] Block on cleaning tasks to prevent Akka timeouts
More detail on the issue is described in [SPARK-3015](https://issues.apache.org/jira/browse/SPARK-3015), but the TLDR is if we send too many blocking Akka messages that are dependent on each other in quick successions, then we end up causing a few of these messages to time out and ultimately kill the executors. As of #1498, we broadcast each RDD whether or not it is persisted. This means if we create many RDDs (each of which becomes a broadcast) and the driver performs a GC that cleans up all of these broadcast blocks, then we end up sending many `RemoveBroadcast` messages in parallel and trigger the chain of blocking messages at high frequencies.

We do not know of the Akka-level root cause yet, so this is intended to be a temporary solution until we identify the real issue. I have done some preliminary testing of enabling blocking and observed that the queue length remains quite low (< 1000) even under very intensive workloads.

In the long run, we should do something more sophisticated to allow a limited degree of parallelism through batching clean up tasks or processing them in a sliding window. In the longer run, we should clean up the whole `BlockManager*` message passing interface to avoid unnecessarily awaiting on futures created from Akka asks.

tdas pwendell mengxr

Author: Andrew Or <andrewor14@gmail.com>

Closes #1931 from andrewor14/reference-blocking and squashes the following commits:

d0f7195 [Andrew Or] Merge branch 'master' of github.com:apache/spark into reference-blocking
ce9daf5 [Andrew Or] Remove logic for logging queue length
111192a [Andrew Or] Add missing space in log message (minor)
a183b83 [Andrew Or] Switch order of code blocks (minor)
9fd1fe6 [Andrew Or] Remove outdated log
104b366 [Andrew Or] Use the actual reference queue length
0b7e768 [Andrew Or] Block on cleaning tasks by default + log error on queue full
2014-08-15 22:55:32 -07:00
..
src [SPARK-3015] Block on cleaning tasks to prevent Akka timeouts 2014-08-15 22:55:32 -07:00
pom.xml [SPARK-2806] core - upgrade to json4s-jackson 3.2.10 2014-08-05 21:59:10 -07:00