[SPARK-5724] fix the misconfiguration in AkkaUtils
https://issues.apache.org/jira/browse/SPARK-5724 In AkkaUtil, we set several failure detector related the parameters as following ``` al akkaConf = ConfigFactory.parseMap(conf.getAkkaConf.toMap[String, String]) .withFallback(akkaSslConfig).withFallback(ConfigFactory.parseString( s""" |akka.daemonic = on |akka.loggers = [""akka.event.slf4j.Slf4jLogger""] |akka.stdout-loglevel = "ERROR" |akka.jvm-exit-on-fatal-error = off |akka.remote.require-cookie = "$requireCookie" |akka.remote.secure-cookie = "$secureCookie" |akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s |akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s |akka.remote.transport-failure-detector.threshold = $akkaFailureDetector |akka.actor.provider = "akka.remote.RemoteActorRefProvider" |akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport" |akka.remote.netty.tcp.hostname = "$host" |akka.remote.netty.tcp.port = $port |akka.remote.netty.tcp.tcp-nodelay = on |akka.remote.netty.tcp.connection-timeout = $akkaTimeout s |akka.remote.netty.tcp.maximum-frame-size = ${akkaFrameSize}B |akka.remote.netty.tcp.execution-pool-size = $akkaThreads |akka.actor.default-dispatcher.throughput = $akkaBatchSize |akka.log-config-on-start = $logAkkaConfig |akka.remote.log-remote-lifecycle-events = $lifecycleEvents |akka.log-dead-letters = $lifecycleEvents |akka.log-dead-letters-during-shutdown = $lifecycleEvents """.stripMargin)) ``` Actually, we do not have any parameter naming "akka.remote.transport-failure-detector.threshold" see: http://doc.akka.io/docs/akka/2.3.4/general/configuration.html what we have is "akka.remote.watch-failure-detector.threshold" Author: CodingCat <zhunansjtu@gmail.com> Closes #4512 from CodingCat/SPARK-5724 and squashes the following commits: bafe56e [CodingCat] fix the grammar in configuration doc 338296e [CodingCat] remove failure-detector related info 8bfcfd4 [CodingCat] fix the misconfiguration in AkkaUtils
This commit is contained in:
parent
757b14b862
commit
242d49584c
|
@ -79,8 +79,6 @@ private[spark] object AkkaUtils extends Logging {
|
|||
val logAkkaConfig = if (conf.getBoolean("spark.akka.logAkkaConfig", false)) "on" else "off"
|
||||
|
||||
val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000)
|
||||
val akkaFailureDetector =
|
||||
conf.getDouble("spark.akka.failure-detector.threshold", 300.0)
|
||||
val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000)
|
||||
|
||||
val secretKey = securityManager.getSecretKey()
|
||||
|
@ -106,7 +104,6 @@ private[spark] object AkkaUtils extends Logging {
|
|||
|akka.remote.secure-cookie = "$secureCookie"
|
||||
|akka.remote.transport-failure-detector.heartbeat-interval = $akkaHeartBeatInterval s
|
||||
|akka.remote.transport-failure-detector.acceptable-heartbeat-pause = $akkaHeartBeatPauses s
|
||||
|akka.remote.transport-failure-detector.threshold = $akkaFailureDetector
|
||||
|akka.actor.provider = "akka.remote.RemoteActorRefProvider"
|
||||
|akka.remote.netty.tcp.transport-class = "akka.remote.transport.netty.NettyTransport"
|
||||
|akka.remote.netty.tcp.hostname = "$host"
|
||||
|
|
|
@ -903,36 +903,24 @@ Apart from these, the following properties are also available, and may be useful
|
|||
<td><code>spark.akka.heartbeat.pauses</code></td>
|
||||
<td>6000</td>
|
||||
<td>
|
||||
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
|
||||
enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause
|
||||
in seconds for akka. This can be used to control sensitivity to gc pauses. Tune this in
|
||||
combination of `spark.akka.heartbeat.interval` and `spark.akka.failure-detector.threshold`
|
||||
if you need to.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.akka.failure-detector.threshold</code></td>
|
||||
<td>300.0</td>
|
||||
<td>
|
||||
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
|
||||
enabled again, if you plan to use this feature (Not recommended). This maps to akka's
|
||||
`akka.remote.transport-failure-detector.threshold`. Tune this in combination of
|
||||
`spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
|
||||
This is set to a larger value to disable the transport failure detector that comes built in to Akka.
|
||||
It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart
|
||||
beat pause in seconds for Akka. This can be used to control sensitivity to GC pauses. Tune
|
||||
this along with `spark.akka.heartbeat.interval` if you need to.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.akka.heartbeat.interval</code></td>
|
||||
<td>1000</td>
|
||||
<td>
|
||||
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
|
||||
enabled again, if you plan to use this feature (Not recommended). A larger interval value in
|
||||
seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative for
|
||||
akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` and
|
||||
`spark.akka.failure-detector.threshold` if you need to. Only positive use case for using
|
||||
failure detector can be, a sensistive failure detector can help evict rogue executors really
|
||||
quick. However this is usually not the case as gc pauses and network lags are expected in a
|
||||
real Spark cluster. Apart from that enabling this leads to a lot of exchanges of heart beats
|
||||
between nodes leading to flooding the network with those.
|
||||
This is set to a larger value to disable the transport failure detector that comes built in to Akka.
|
||||
It can be enabled again, if you plan to use this feature (Not recommended). A larger interval
|
||||
value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative
|
||||
for Akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` if you need
|
||||
to. A likely positive use case for using failure detector would be: a sensistive failure detector
|
||||
can help evict rogue executors quickly. However this is usually not the case as GC pauses
|
||||
and network lags are expected in a real Spark cluster. Apart from that enabling this leads to
|
||||
a lot of exchanges of heart beats between nodes leading to flooding the network with those.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
|
|
Loading…
Reference in a new issue