spark-instrumented-optimizer/sql/core
Douglas R Colkitt faa61980c4 [SPARK-26935][SQL] Skip DataFrameReader's CSV first line scan when not used
Prior to this patch, all DataFrameReader.csv() calls would collect the first
line from the CSV input iterator. This is done to allow schema inference from the
header row.

However when schema is already specified this is a wasteful operation. It results
in an unncessary compute step on the first partition. This can be expensive if
the CSV itself is expensive to generate (e.g. it's the product of a long-running
external pipe()).

This patch short-circuits the first-line collection in DataFrameReader.csv() when
schema is specified. Thereby improving CSV read performance in certain cases.

## What changes were proposed in this pull request?

Short-circuiting DataFrameReader.csv() first-line read when schema is user-specified.

## How was this patch tested?

Compiled and tested against several CSV datasets.

Closes #23830 from Mister-Meeseeks/master.

Authored-by: Douglas R Colkitt <douglas.colkitt@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-02-23 14:00:57 -06:00
..
benchmarks [SPARK-26903][SQL] Remove the TimeZone cache 2019-02-23 09:44:22 -06:00
src [SPARK-26935][SQL] Skip DataFrameReader's CSV first line scan when not used 2019-02-23 14:00:57 -06:00
pom.xml [SPARK-25956] Make Scala 2.12 as default Scala version in Spark 3.0 2018-11-14 16:22:23 -08:00