78a403fab9
## What changes were proposed in this pull request? ### Background: The data source option `pathGlobFilter` is introduced for Binary file format: https://github.com/apache/spark/pull/24354 , which can be used for filtering file names, e.g. reading `.png` files only while there is `.json` files in the same directory. ### Proposal: Make the option `pathGlobFilter` as a general option for all file sources. The path filtering should happen in the path globbing on Driver. ### Motivation: Filtering the file path names in file scan tasks on executors is kind of ugly. ### Impact: 1. The splitting of file partitions will be more balanced. 2. The metrics of file scan will be more accurate. 3. Users can use the option for reading other file sources. ## How was this patch tested? Unit tests Closes #24518 from gengliangwang/globFilter. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
v1.2.1/src | ||
v2.3.4/src | ||
pom.xml |