From bf221debd02b11003b092201d0326302196e4ba5 Mon Sep 17 00:00:00 2001 From: Chao Sun Date: Fri, 21 Aug 2020 16:48:54 +0900 Subject: [PATCH] [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc ### What changes were proposed in this pull request? This adds some tuning guide for increasing parallelism of directory listing. ### Why are the changes needed? Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #29498 from sunchao/SPARK-32674. Authored-by: Chao Sun Signed-off-by: HyukjinKwon --- docs/sql-performance-tuning.md | 22 ++++++++++++++++++++++ docs/tuning.md | 11 +++++++++++ 2 files changed, 33 insertions(+) diff --git a/docs/sql-performance-tuning.md b/docs/sql-performance-tuning.md index 5e6f049a51..5d8c3b698c 100644 --- a/docs/sql-performance-tuning.md +++ b/docs/sql-performance-tuning.md @@ -114,6 +114,28 @@ that these options will be deprecated in future release as more optimizations ar 1.1.0 + + spark.sql.sources.parallelPartitionDiscovery.threshold + 32 + + Configures the threshold to enable parallel listing for job input paths. If the number of + input paths is larger than this threshold, Spark will list the files by using Spark distributed job. + Otherwise, it will fallback to sequential listing. This configuration is only effective when + using file-based data sources such as Parquet, ORC and JSON. + + 1.5.0 + + + spark.sql.sources.parallelPartitionDiscovery.parallelism + 10000 + + Configures the maximum listing parallelism for job input paths. In case the number of input + paths is larger than this value, it will be throttled down to use this value. Same as above, + this configuration is only effective when using file-based data sources such as Parquet, ORC + and JSON. + + 2.1.1 + ## Join Strategy Hints for SQL Queries diff --git a/docs/tuning.md b/docs/tuning.md index 8e29e5d2e9..18d4a6205f 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -264,6 +264,17 @@ parent RDD's number of partitions. You can pass the level of parallelism as a se or set the config property `spark.default.parallelism` to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. +## Parallel Listing on Input Paths + +Sometimes you may also need to increase directory listing parallelism when job input has large number of directories, +otherwise the process could take a very long time, especially when against object store like S3. +If your job works on RDD with Hadoop input formats (e.g., via `SparkContext.sequenceFile`), the parallelism is +controlled via [`spark.hadoop.mapreduce.input.fileinputformat.list-status.num-threads`](https://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml) (currently default is 1). + +For Spark SQL with file-based data sources, you can tune `spark.sql.sources.parallelPartitionDiscovery.threshold` and +`spark.sql.sources.parallelPartitionDiscovery.parallelism` to improve listing parallelism. Please +refer to [Spark SQL performance tuning guide](sql-performance-tuning.html) for more details. + ## Memory Usage of Reduce Tasks Sometimes, you will get an OutOfMemoryError not because your RDDs don't fit in memory, but because the