spark-instrumented-optimizer/sql/core/benchmarks/CSVBenchmark-results.txt
Maxim Gekk 4e50f0291f [SPARK-30323][SQL] Support filters pushdown in CSV datasource
### What changes were proposed in this pull request?

In the PR, I propose to support pushed down filters in CSV datasource. The reason of pushing a filter up to `UnivocityParser` is to apply the filter as soon as all its attributes become available i.e. converted from CSV fields to desired values according to the schema. This allows to skip conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of CSV string fields to desired values are comparably expensive ( for example, conversion to `TIMESTAMP` values).

Here are details of the implementation:
- `UnivocityParser.convert()` converts parsed CSV tokens one-by-one sequentially starting from index 0 up to `parsedSchema.length - 1`. At current index `i`, it applies filters that refer to attributes at row fields indexes `0..i`. If any filter returns `false`, it skips conversions of other input tokens.
- Pushed filters are converted to expressions. The expressions are bound to row positions according to `requiredSchema`. The expressions are compiled to predicates via generating Java code.
- To be able to apply predicates to partially initialized rows, the predicates are grouped, and combined via the `And` expression. Final predicate at index `N` can refer to row fields at the positions `0..N`, and can be applied to a row even if other fields at the positions `N+1..requiredSchema.lenght-1` are not set.

### Why are the changes needed?
The changes improve performance on synthetic benchmarks more **than 9 times** (on JDK 8 & 11):
```
OpenJDK 64-Bit Server VM 11.0.5+10 on Mac OS X 10.15.2
Intel(R) Core(TM) i7-4850HQ CPU  2.30GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                       11889          11945          52          0.0      118893.1       1.0X
pushdown disabled                                 11790          11860         115          0.0      117902.3       1.0X
w/ filters                                         1240           1278          33          0.1       12400.8       9.6X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added new test suite `CSVFiltersSuite`
- Added tests to `CSVSuite` and `UnivocityParserSuite`

Closes #26973 from MaxGekk/csv-filters-pushdown.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-16 13:10:08 +09:00

68 lines
6.2 KiB
Plaintext

================================================================================================
Benchmark to measure CSV read/write performance
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 51602 51659 59 0.0 1032039.4 1.0X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 191926 192879 1615 0.0 191925.6 1.0X
Select 100 columns 46766 46846 69 0.0 46766.1 4.1X
Select one column 35877 35930 83 0.0 35876.8 5.3X
count() 11186 11262 65 0.1 11186.0 17.2X
Select 100 columns, one bad input field 59943 60107 232 0.0 59943.0 3.2X
Select 100 columns, corrupt record field 73062 73406 479 0.0 73062.2 2.6X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 22389 22447 87 0.4 2238.9 1.0X
Select 1 column + count() 14844 14890 43 0.7 1484.4 1.5X
count() 5519 5538 18 1.8 551.9 4.1X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 1949 1977 25 5.1 194.9 1.0X
to_csv(timestamp) 14944 15702 714 0.7 1494.4 0.1X
write timestamps to files 12983 12998 14 0.8 1298.3 0.2X
Create a dataset of dates 2156 2164 7 4.6 215.6 0.9X
to_csv(date) 9675 9709 41 1.0 967.5 0.2X
write dates to files 7880 7897 15 1.3 788.0 0.2X
OpenJDK 64-Bit Server VM 1.8.0_232-8u232-b09-0ubuntu1~18.04.1-b09 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2235 2245 10 4.5 223.5 1.0X
read timestamps from files 54490 54690 283 0.2 5449.0 0.0X
infer timestamps from files 104501 104737 236 0.1 10450.1 0.0X
read date text from files 2035 2040 6 4.9 203.5 1.1X
read date from files 39650 39707 52 0.3 3965.0 0.1X
infer date from files 29235 29363 164 0.3 2923.5 0.1X
timestamp strings 3412 3426 18 2.9 341.2 0.7X
parse timestamps from Dataset[String] 66864 67804 981 0.1 6686.4 0.0X
infer timestamps from Dataset[String] 118780 119284 837 0.1 11878.0 0.0X
date strings 3730 3734 4 2.7 373.0 0.6X
parse dates from Dataset[String] 48728 49071 309 0.2 4872.8 0.0X
from_csv(timestamp) 62294 62493 260 0.2 6229.4 0.0X
from_csv(date) 44581 44665 117 0.2 4458.1 0.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.2
Intel(R) Core(TM) i7-4850HQ CPU @ 2.30GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 12557 12634 78 0.0 125572.9 1.0X
pushdown disabled 12449 12509 65 0.0 124486.4 1.0X
w/ filters 1372 1393 18 0.1 13724.8 9.1X