faa61980c4
Prior to this patch, all DataFrameReader.csv() calls would collect the first line from the CSV input iterator. This is done to allow schema inference from the header row. However when schema is already specified this is a wasteful operation. It results in an unncessary compute step on the first partition. This can be expensive if the CSV itself is expensive to generate (e.g. it's the product of a long-running external pipe()). This patch short-circuits the first-line collection in DataFrameReader.csv() when schema is specified. Thereby improving CSV read performance in certain cases. ## What changes were proposed in this pull request? Short-circuiting DataFrameReader.csv() first-line read when schema is user-specified. ## How was this patch tested? Compiled and tested against several CSV datasets. Closes #23830 from Mister-Meeseeks/master. Authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com> |
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |