26d03b62e2
## What changes were proposed in this pull request? Logging in driver when loading single large unsplittable file via `sc.textFile` or csv/json datasouce. Current condition triggering logging is * only generate one partition * file is unsplittable, possible reason is: - compressed by unsplittable compression algo such as gzip. - multiLine mode in csv/json datasource - wholeText mode in text datasource * file size exceed the config threshold `spark.io.warning.largeFileThreshold` (default value is 1GB) ## How was this patch tested? Manually test. Generate one gzip file exceeding 1GB, ``` base64 -b 50 /dev/urandom | head -c 2000000000 > file1.txt cat file1.txt | gzip > file1.gz ``` then launch spark-shell, run ``` sc.textFile("file:///path/to/file1.gz").count() ``` Will print log like: ``` WARN HadoopRDD: Loading one large unsplittable file file:/.../f1.gz with only one partition, because the file is compressed by unsplittable compression codec ``` run ``` sc.textFile("file:///path/to/file1.txt").count() ``` Will print log like: ``` WARN HadoopRDD: Loading one large file file:/.../f1.gz with only one partition, we can increase partition numbers by the `minPartitions` argument in method `sc.textFile ``` run ``` spark.read.csv("file:///path/to/file1.gz").count ``` Will print log like: ``` WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the file is compressed by unsplittable compression codec ``` run ``` spark.read.option("multiLine", true).csv("file:///path/to/file1.gz").count ``` Will print log like: ``` WARN CSVScan: Loading one large unsplittable file file:/.../f1.gz with only one partition, the reason is: the csv datasource is set multiLine mode ``` JSON and Text datasource also tested with similar cases. Please review https://spark.apache.org/contributing.html before opening a pull request. Closes #25134 from WeichenXu123/log_gz. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |