[SPARK-27868][CORE] Better default value and documentation for socket server backlog.

First, there is currently no public documentation for this setting. So it's hard
to even know that it could be a problem if your application starts failing with
weird shuffle errors.

Second, the javadoc attached to the code was incorrect; the default value just uses
the default value from the JRE, which is 50, instead of having an unbounded queue
as the comment implies.

So use a default that is a "rounded" version of the JRE default, and provide
documentation explaining that this value may need to be adjusted. Also added
a log message that was very helpful in debugging an issue caused by this
problem.

Closes #24732 from vanzin/SPARK-27868.

Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This commit is contained in:
Marcelo Vanzin 2019-05-29 14:56:36 -07:00 committed by Dongjoon Hyun
parent c1007c2f7c
commit 09ed64d795
3 changed files with 24 additions and 2 deletions

View file

@ -133,6 +133,8 @@ public class TransportServer implements Closeable {
bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
@Override
protected void initChannel(SocketChannel ch) {
logger.debug("New connection accepted for remote address {}.", ch.remoteAddress());
RpcHandler rpcHandler = appRpcHandler;
for (TransportServerBootstrap bootstrap : bootstraps) {
rpcHandler = bootstrap.doBootstrap(ch, rpcHandler);

View file

@ -108,8 +108,8 @@ public class TransportConf {
return conf.getInt(SPARK_NETWORK_IO_NUMCONNECTIONSPERPEER_KEY, 1);
}
/** Requested maximum length of the queue of incoming connections. Default -1 for no backlog. */
public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, -1); }
/** Requested maximum length of the queue of incoming connections. Default is 64. */
public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 64); }
/** Number of threads used in the server thread pool. Default to 0, which is 2x#cores. */
public int serverThreads() { return conf.getInt(SPARK_NETWORK_IO_SERVERTHREADS_KEY, 0); }

View file

@ -734,6 +734,17 @@ Apart from these, the following properties are also available, and may be useful
is 15 seconds by default, calculated as <code>maxRetries * retryWait</code>.
</td>
</tr>
<tr>
<td><code>spark.shuffle.io.backLog</code></td>
<td>64</td>
<td>
Length of the accept queue for the shuffle service. For large applications, this value may
need to be increased, so that incoming connections are not dropped if the service cannot keep
up with a large number of connections arriving in a short period of time. This needs to
be configured wherever the shuffle service itself is running, which may be outside of the
application (see <code>spark.shuffle.service.enabled</code> option below).
</td>
</tr>
<tr>
<td><code>spark.shuffle.service.enabled</code></td>
<td>false</td>
@ -1515,6 +1526,15 @@ Apart from these, the following properties are also available, and may be useful
This is used for communicating with the executors and the standalone Master.
</td>
</tr>
<tr>
<td><code>spark.rpc.io.backLog</code></td>
<td>64</td>
<td>
Length of the accept queue for the RPC server. For large applications, this value may
need to be increased, so that incoming connections are not dropped when a large number of
connections arrives in a short period of time.
</td>
</tr>
<tr>
<td><code>spark.network.timeout</code></td>
<td>120s</td>