spark-instrumented-optimizer/common
manuzhang 0d997e5156 [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService
### What changes were proposed in this pull request?
Close idle connections at shuffle server side when an `IdleStateEvent` is triggered after `spark.shuffle.io.connectionTimeout` or `spark.network.timeout` time. It's based on following investigations.

1. We found connections on our clusters building up continuously (> 10k for some nodes). Is that normal ? We don't think so.
2. We looked into the connections on one node and found there were a lot of half-open connections. (connections only existed on one node)
3. We also checked those connections were very old (> 21 hours). (FYI, https://superuser.com/questions/565991/how-to-determine-the-socket-connection-up-time-on-linux)
4. Looking at the code, TransportContext registers an IdleStateHandler which should fire an IdleStateEvent when timeout. We did a heap dump of the YarnShuffleService and checked the attributes of IdleStateHandler. It turned out firstAllIdleEvent of many IdleStateHandlers were already false so IdleStateEvent were already fired.
5. Finally, we realized the IdleStateEvent would not be handled since closeIdleConnections are hardcoded to false for YarnShuffleService.

### Why are the changes needed?
Idle connections to YarnShuffleService could never be closed, and will be accumulating and taking up memory and file descriptors.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #27998 from manuzhang/spark-31219.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-03-30 12:44:46 -05:00
..
kvstore [SPARK-31014][CORE] InMemoryStore: remove key from parentToChildrenMap when removing key from CountingRemoveIfForEach 2020-03-04 10:05:26 -08:00
network-common [SPARK-30623][CORE] Spark external shuffle allow disable of separate event loop group 2020-03-26 12:37:48 +08:00
network-shuffle [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00
network-yarn [SPARK-31219][YARN] Enable closeIdleConnections in YarnShuffleService 2020-03-30 12:44:46 -05:00
sketch [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00
tags [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00
unsafe [SPARK-30292][SQL][FOLLOWUP] ansi cast from strings to integral numbers (byte/short/int/long) should fail with fraction 2020-03-20 00:52:09 +09:00