4e50f0291f
### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in CSV datasource. The reason of pushing a filter up to `UnivocityParser` is to apply the filter as soon as all its attributes become available i.e. converted from CSV fields to desired values according to the schema. This allows to skip conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of CSV string fields to desired values are comparably expensive ( for example, conversion to `TIMESTAMP` values). Here are details of the implementation: - `UnivocityParser.convert()` converts parsed CSV tokens one-by-one sequentially starting from index 0 up to `parsedSchema.length - 1`. At current index `i`, it applies filters that refer to attributes at row fields indexes `0..i`. If any filter returns `false`, it skips conversions of other input tokens. - Pushed filters are converted to expressions. The expressions are bound to row positions according to `requiredSchema`. The expressions are compiled to predicates via generating Java code. - To be able to apply predicates to partially initialized rows, the predicates are grouped, and combined via the `And` expression. Final predicate at index `N` can refer to row fields at the positions `0..N`, and can be applied to a row even if other fields at the positions `N+1..requiredSchema.lenght-1` are not set. ### Why are the changes needed? The changes improve performance on synthetic benchmarks more **than 9 times** (on JDK 8 & 11): ``` OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2 Intel(R) Core(TM) i7-4850HQ CPU 2.30GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 11889 11945 52 0.0 118893.1 1.0X pushdown disabled 11790 11860 115 0.0 117902.3 1.0X w/ filters 1240 1278 33 0.1 12400.8 9.6X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suite `CSVFiltersSuite` - Added tests to `CSVSuite` and `UnivocityParserSuite` Closes #26973 from MaxGekk/csv-filters-pushdown. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
AggregateBenchmark-jdk11-results.txt | ||
AggregateBenchmark-results.txt | ||
BloomFilterBenchmark-jdk11-results.txt | ||
BloomFilterBenchmark-results.txt | ||
BuiltInDataSourceWriteBenchmark-jdk11-results.txt | ||
BuiltInDataSourceWriteBenchmark-results.txt | ||
ColumnarBatchBenchmark-jdk11-results.txt | ||
ColumnarBatchBenchmark-results.txt | ||
CompressionSchemeBenchmark-jdk11-results.txt | ||
CompressionSchemeBenchmark-results.txt | ||
CSVBenchmark-jdk11-results.txt | ||
CSVBenchmark-results.txt | ||
DatasetBenchmark-jdk11-results.txt | ||
DatasetBenchmark-results.txt | ||
DataSourceReadBenchmark-jdk11-results.txt | ||
DataSourceReadBenchmark-results.txt | ||
DateTimeBenchmark-jdk11-results.txt | ||
DateTimeBenchmark-results.txt | ||
ExternalAppendOnlyUnsafeRowArrayBenchmark-jdk11-results.txt | ||
ExternalAppendOnlyUnsafeRowArrayBenchmark-results.txt | ||
ExtractBenchmark-jdk11-results.txt | ||
ExtractBenchmark-results.txt | ||
FilterPushdownBenchmark-jdk11-results.txt | ||
FilterPushdownBenchmark-results.txt | ||
HashedRelationMetricsBenchmark-jdk11-results.txt | ||
HashedRelationMetricsBenchmark-results.txt | ||
InExpressionBenchmark-jdk11-results.txt | ||
InExpressionBenchmark-results.txt | ||
IntervalBenchmark-jdk11-results.txt | ||
IntervalBenchmark-results.txt | ||
JoinBenchmark-jdk11-results.txt | ||
JoinBenchmark-results.txt | ||
JsonBenchmark-jdk11-results.txt | ||
JsonBenchmark-results.txt | ||
MakeDateTimeBenchmark-jdk11-results.txt | ||
MakeDateTimeBenchmark-results.txt | ||
MetricsAggregationBenchmark-jdk11-results.txt | ||
MetricsAggregationBenchmark-results.txt | ||
MiscBenchmark-jdk11-results.txt | ||
MiscBenchmark-results.txt | ||
OrcNestedSchemaPruningBenchmark-jdk11-results.txt | ||
OrcNestedSchemaPruningBenchmark-results.txt | ||
OrcV2NestedSchemaPruningBenchmark-jdk11-results.txt | ||
OrcV2NestedSchemaPruningBenchmark-results.txt | ||
ParquetNestedSchemaPruningBenchmark-jdk11-results.txt | ||
ParquetNestedSchemaPruningBenchmark-results.txt | ||
PrimitiveArrayBenchmark-jdk11-results.txt | ||
PrimitiveArrayBenchmark-results.txt | ||
RangeBenchmark-jdk11-results.txt | ||
RangeBenchmark-results.txt | ||
SortBenchmark-jdk11-results.txt | ||
SortBenchmark-results.txt | ||
TPCDSQueryBenchmark-jdk11-results.txt | ||
TPCDSQueryBenchmark-results.txt | ||
UDFBenchmark-jdk11-results.txt | ||
UDFBenchmark-results.txt | ||
UnsafeArrayDataBenchmark-jdk11-results.txt | ||
UnsafeArrayDataBenchmark-results.txt | ||
WideSchemaBenchmark-jdk11-results.txt | ||
WideSchemaBenchmark-results.txt | ||
WideTableBenchmark-jdk11-results.txt | ||
WideTableBenchmark-results.txt |