[SPARK-27868][CORE] Better default value and documentation for socket server backlog.
First, there is currently no public documentation for this setting. So it's hard to even know that it could be a problem if your application starts failing with weird shuffle errors. Second, the javadoc attached to the code was incorrect; the default value just uses the default value from the JRE, which is 50, instead of having an unbounded queue as the comment implies. So use a default that is a "rounded" version of the JRE default, and provide documentation explaining that this value may need to be adjusted. Also added a log message that was very helpful in debugging an issue caused by this problem. Closes #24732 from vanzin/SPARK-27868. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This commit is contained in:
parent
c1007c2f7c
commit
09ed64d795
|
@ -133,6 +133,8 @@ public class TransportServer implements Closeable {
|
|||
bootstrap.childHandler(new ChannelInitializer<SocketChannel>() {
|
||||
@Override
|
||||
protected void initChannel(SocketChannel ch) {
|
||||
logger.debug("New connection accepted for remote address {}.", ch.remoteAddress());
|
||||
|
||||
RpcHandler rpcHandler = appRpcHandler;
|
||||
for (TransportServerBootstrap bootstrap : bootstraps) {
|
||||
rpcHandler = bootstrap.doBootstrap(ch, rpcHandler);
|
||||
|
|
|
@ -108,8 +108,8 @@ public class TransportConf {
|
|||
return conf.getInt(SPARK_NETWORK_IO_NUMCONNECTIONSPERPEER_KEY, 1);
|
||||
}
|
||||
|
||||
/** Requested maximum length of the queue of incoming connections. Default -1 for no backlog. */
|
||||
public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, -1); }
|
||||
/** Requested maximum length of the queue of incoming connections. Default is 64. */
|
||||
public int backLog() { return conf.getInt(SPARK_NETWORK_IO_BACKLOG_KEY, 64); }
|
||||
|
||||
/** Number of threads used in the server thread pool. Default to 0, which is 2x#cores. */
|
||||
public int serverThreads() { return conf.getInt(SPARK_NETWORK_IO_SERVERTHREADS_KEY, 0); }
|
||||
|
|
|
@ -734,6 +734,17 @@ Apart from these, the following properties are also available, and may be useful
|
|||
is 15 seconds by default, calculated as <code>maxRetries * retryWait</code>.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.shuffle.io.backLog</code></td>
|
||||
<td>64</td>
|
||||
<td>
|
||||
Length of the accept queue for the shuffle service. For large applications, this value may
|
||||
need to be increased, so that incoming connections are not dropped if the service cannot keep
|
||||
up with a large number of connections arriving in a short period of time. This needs to
|
||||
be configured wherever the shuffle service itself is running, which may be outside of the
|
||||
application (see <code>spark.shuffle.service.enabled</code> option below).
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.shuffle.service.enabled</code></td>
|
||||
<td>false</td>
|
||||
|
@ -1515,6 +1526,15 @@ Apart from these, the following properties are also available, and may be useful
|
|||
This is used for communicating with the executors and the standalone Master.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.rpc.io.backLog</code></td>
|
||||
<td>64</td>
|
||||
<td>
|
||||
Length of the accept queue for the RPC server. For large applications, this value may
|
||||
need to be increased, so that incoming connections are not dropped when a large number of
|
||||
connections arrives in a short period of time.
|
||||
</td>
|
||||
</tr>
|
||||
<tr>
|
||||
<td><code>spark.network.timeout</code></td>
|
||||
<td>120s</td>
|
||||
|
|
Loading…
Reference in a new issue