[SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc

### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-21 16:48:54 +09:00 · 2020-08-21 16:48:54 +09:00 · bf221debd0
parent c75a82794f
commit bf221debd0
2 changed files with 33 additions and 0 deletions
--- a/docs/sql-performance-tuning.md
+++ b/docs/sql-performance-tuning.md
@ -114,6 +114,28 @@ that these options will be deprecated in future release as more optimizations ar
    </td>
    <td>1.1.0</td>
  </tr>
  <tr>
    <td><code>spark.sql.sources.parallelPartitionDiscovery.threshold</code></td>
    <td>32</td>
    <td>
      Configures the threshold to enable parallel listing for job input paths. If the number of
      input paths is larger than this threshold, Spark will list the files by using Spark distributed job.
      Otherwise, it will fallback to sequential listing. This configuration is only effective when
      using file-based data sources such as Parquet, ORC and JSON.
    </td>
    <td>1.5.0</td>
  </tr>
  <tr>
    <td><code>spark.sql.sources.parallelPartitionDiscovery.parallelism</code></td>
    <td>10000</td>
    <td>
      Configures the maximum listing parallelism for job input paths. In case the number of input
      paths is larger than this value, it will be throttled down to use this value. Same as above,
      this configuration is only effective when using file-based data sources such as Parquet, ORC
      and JSON.
    </td>
    <td>2.1.1</td>
  </tr>
 </table>
 ## Join Strategy Hints for SQL Queries
--- a/docs/tuning.md
+++ b/docs/tuning.md
@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se
 or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 ## Parallel Listing on Input Paths
 Sometimes you may also need to increase directory listing parallelism when job input has large number of directories,
 otherwise the process could take a very long time, especially when against object store like S3.
 If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is
 controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1).
 For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and
 `spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please
 refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details.
 ## Memory Usage of Reduce Tasks
 Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the