[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc

### What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

### Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #29498 from sunchao/SPARK-32674.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
Chao Sun 2020-08-21 16:48:54 +09:00 committed by HyukjinKwon
parent c75a82794f
commit bf221debd0
2 changed files with 33 additions and 0 deletions

View file

@ -114,6 +114,28 @@ that these options will be deprecated in future release as more optimizations ar
</td> </td>
<td>1.1.0</td> <td>1.1.0</td>
</tr> </tr>
<tr>
<td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
<td>32</td>
<td>
Configures the threshold to enable parallel listing for job input paths. If the number of
input paths is larger than this threshold, Spark will list the files by using Spark distributed job.
Otherwise, it will fallback to sequential listing. This configuration is only effective when
using file-based data sources such as Parquet, ORC and JSON.
</td>
<td>1.5.0</td>
</tr>
<tr>
<td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
<td>10000</td>
<td>
Configures the maximum listing parallelism for job input paths. In case the number of input
paths is larger than this value, it will be throttled down to use this value. Same as above,
this configuration is only effective when using file-based data sources such as Parquet, ORC
and JSON.
</td>
<td>2.1.1</td>
</tr>
</table> </table>
## Join Strategy Hints for SQL Queries ## Join Strategy Hints for SQL Queries

View file

@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
or set the config property `spark.default.parallelism` to change the default. or set the config property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster. In general, we recommend 2-3 tasks per CPU core in your cluster.
## Parallel Listing on Input Paths
Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
otherwise the process could take a very long time, especially when against object store like S3.
If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).
For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.
## Memory Usage of Reduce Tasks ## Memory Usage of Reduce Tasks
Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the