d65f534c5a
### What changes were proposed in this pull request? With benchmark original, where the timestamp values are valid to the new parser the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms ``` When we modify the benchmark to ```scala def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } ``` where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4). the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` About 10x perf-regression BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` ### Why are the changes needed? Fix performance regression. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? new tests added. Closes #28181 from yaooqinn/SPARK-31414. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
68 lines
6.1 KiB
Plaintext
68 lines
6.1 KiB
Plaintext
================================================================================================
|
|
Benchmark to measure CSV read/write performance
|
|
================================================================================================
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
One quoted string 24907 29374 NaN 0.0 498130.5 1.0X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 1000 columns 62811 63690 1416 0.0 62811.4 1.0X
|
|
Select 100 columns 23839 24064 230 0.0 23839.5 2.6X
|
|
Select one column 19936 20641 827 0.1 19936.4 3.2X
|
|
count() 4174 4380 206 0.2 4174.4 15.0X
|
|
Select 100 columns, one bad input field 41015 42380 1688 0.0 41015.4 1.5X
|
|
Select 100 columns, corrupt record field 46281 46338 93 0.0 46280.5 1.4X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 10 columns + count() 10810 10997 163 0.9 1081.0 1.0X
|
|
Select 1 column + count() 7608 7641 47 1.3 760.8 1.4X
|
|
count() 2415 2462 77 4.1 241.5 4.5X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Create a dataset of timestamps 874 914 37 11.4 87.4 1.0X
|
|
to_csv(timestamp) 7051 7223 250 1.4 705.1 0.1X
|
|
write timestamps to files 6712 6741 31 1.5 671.2 0.1X
|
|
Create a dataset of dates 909 945 35 11.0 90.9 1.0X
|
|
to_csv(date) 4222 4231 8 2.4 422.2 0.2X
|
|
write dates to files 3799 3813 14 2.6 379.9 0.2X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
read timestamp text from files 1342 1364 35 7.5 134.2 1.0X
|
|
read timestamps from files 20300 20473 247 0.5 2030.0 0.1X
|
|
infer timestamps from files 40705 40744 54 0.2 4070.5 0.0X
|
|
read date text from files 1146 1151 6 8.7 114.6 1.2X
|
|
read date from files 12278 12408 117 0.8 1227.8 0.1X
|
|
infer date from files 12734 12872 220 0.8 1273.4 0.1X
|
|
timestamp strings 1467 1482 15 6.8 146.7 0.9X
|
|
parse timestamps from Dataset[String] 21708 22234 477 0.5 2170.8 0.1X
|
|
infer timestamps from Dataset[String] 42357 43253 922 0.2 4235.7 0.0X
|
|
date strings 1512 1532 18 6.6 151.2 0.9X
|
|
parse dates from Dataset[String] 13436 13470 33 0.7 1343.6 0.1X
|
|
from_csv(timestamp) 20390 20486 95 0.5 2039.0 0.1X
|
|
from_csv(date) 12592 12693 139 0.8 1259.2 0.1X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
w/o filters 12535 12606 67 0.0 125348.8 1.0X
|
|
pushdown disabled 12611 12672 91 0.0 126112.9 1.0X
|
|
w/ filters 1093 1099 11 0.1 10928.3 11.5X
|
|
|
|
|