[SPARK-5724] fix the misconfiguration in AkkaUtils

https://issues.apache.org/jira/browse/SPARK-5724

In AkkaUtil, we set several failure detector related the parameters as following

```
al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String])
      .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString(
      s"""
      |akka.daemonic = on
      |akka.loggers = [""akka.event.slf4j.Slf4jLogger""]
      |akka.stdout-loglevel = "ERROR"
      |akka.jvm-exit-on-fatal-error = off
      |akka.remote.require-cookie = "$requireCookie"
      |akka.remote.secure-cookie = "$secureCookie"
      |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
      |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
      |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
      |akka.actor.provider = "akka.remote.RemoteActorRefProvider"
      |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
      |akka.remote.netty.tcp.hostname = "$host"
      |akka.remote.netty.tcp.port = $port
      |akka.remote.netty.tcp.tcp-nodelay = on
      |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s
      |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B
      |akka.remote.netty.tcp.execution-pool-size = $akkaThreads
      |akka.actor.default-dispatcher.throughput = $akkaBatchSize
      |akka.log-config-on-start = $logAkkaConfig
      |akka.remote.log-remote-lifecycle-events = $lifecycleEvents
      |akka.log-dead-letters = $lifecycleEvents
      |akka.log-dead-letters-during-shutdown = $lifecycleEvents
      """.stripMargin))

```

Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold"
see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html
what we have is "akka.remote.watch-failure-detector.threshold"

Author: CodingCat <zhunansjtu@gmail.com>

Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits:

bafe56e [CodingCat] fix the grammar in configuration doc
338296e [CodingCat] remove failure-detector related info
8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
This commit is contained in:
CodingCat 2015-02-23 11:29:25 +00:00 committed by Sean Owen
parent 757b14b862
commit 242d49584c
2 changed files with 12 additions and 27 deletions

View file

@ -79,8 +79,6 @@ private[spark] object AkkaUtils extends Logging {
val logAkkaConfig = if (conf.getBoolean("spark.akka.logAkkaConfig", false)) "on" else "off"
val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000)
val akkaFailureDetector =
conf.getDouble("spark.akka.failure-detector.threshold", 300.0)
val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000)
val secretKey = securityManager.getSecretKey()
@ -106,7 +104,6 @@ private[spark] object AkkaUtils extends Logging {
|akka.remote.secure-cookie = "$secureCookie"
|akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
|akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
|akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
|akka.actor.provider = "akka.remote.RemoteActorRefProvider"
|akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
|akka.remote.netty.tcp.hostname = "$host"

View file

@ -903,36 +903,24 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.akka.heartbeat.pauses</code></td>
<td>6000</td>
<td>
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause
in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in
combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold`
if you need to.
</td>
</tr>
<tr>
<td><code>spark.akka.failure-detector.threshold</code></td>
<td>300.0</td>
<td>
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
enabled again, if you plan to use this feature (Not recommended). This maps to akka's
`akka.remote.transport-failure-detector.threshold`. Tune this in combination of
`spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
This is set to a larger value to disable the transport failure detector that comes built in to Akka.
It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart
beat pause in seconds for Akka. This can be used to control sensitivity to GC pauses. Tune
this along with `spark.akka.heartbeat.interval` if you need to.
</td>
</tr>
<tr>
<td><code>spark.akka.heartbeat.interval</code></td>
<td>1000</td>
<td>
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
enabled again, if you plan to use this feature (Not recommended). A larger interval value in
seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for
akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and
`spark.akka.failure-detector.threshold` if you need to. Only positive use case for using
failure detector can be, a sensistive failure detector can help evict rogue executors really
quick. However this is usually not the case as gc pauses and network lags are expected in a
real Spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats
between nodes leading to flooding the network with those.
This is set to a larger value to disable the transport failure detector that comes built in to Akka.
It can be enabled again, if you plan to use this feature (Not recommended). A larger interval
value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative
for Akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` if you need
to. A likely positive use case for using failure detector would be: a sensistive failure detector
can help evict rogue executors quickly. However this is usually not the case as GC pauses
and network lags are expected in a real Spark cluster. Apart from that enabling this leads to
a lot of exchanges of heart beats between nodes leading to flooding the network with those.
</td>
</tr>
<tr>