[SPARK-8437] [DOCS] Using directory path without wildcard for filename slow for large number of files with wholeTextFiles and binaryFiles
Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 'dir/'
Author: Sean Owen <sowen@cloudera.com>
Closes #7036 from srowen/SPARK-8437 and squashes the following commits:
0e813ae [Sean Owen] Note that 'dir/*' can be more efficient in some Hadoop FS implementations that 'dir/'
(cherry picked from commit 5d30eae560
)
Signed-off-by: Andrew Or <andrew@databricks.com>
This commit is contained in:
parent
cdfa388dd0
commit
b2684557fa
|
@ -824,6 +824,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
|
|||
* }}}
|
||||
*
|
||||
* @note Small files are preferred, large file is also allowable, but may cause bad performance.
|
||||
* @note On some filesystems, `.../path/*` can be a more efficient way to read all files in a directory
|
||||
* rather than `.../path/` or `.../path`
|
||||
*
|
||||
* @param minPartitions A suggestion value of the minimal splitting number for input data.
|
||||
*/
|
||||
|
@ -871,9 +873,11 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli
|
|||
* (a-hdfs-path/part-nnnnn, its content)
|
||||
* }}}
|
||||
*
|
||||
* @param minPartitions A suggestion value of the minimal splitting number for input data.
|
||||
*
|
||||
* @note Small files are preferred; very large files may cause bad performance.
|
||||
* @note On some filesystems, `.../path/*` can be a more efficient way to read all files in a directory
|
||||
* rather than `.../path/` or `.../path`
|
||||
*
|
||||
* @param minPartitions A suggestion value of the minimal splitting number for input data.
|
||||
*/
|
||||
@Experimental
|
||||
def binaryFiles(
|
||||
|
|
Loading…
Reference in a new issue