d65f534c5a
### What changes were proposed in this pull request? With benchmark original, where the timestamp values are valid to the new parser the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms ``` When we modify the benchmark to ```scala def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } ``` where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4). the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` About 10x perf-regression BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` ### Why are the changes needed? Fix performance regression. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? new tests added. Closes #28181 from yaooqinn/SPARK-31414. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
68 lines
6.1 KiB
Plaintext
68 lines
6.1 KiB
Plaintext
================================================================================================
|
|
Benchmark to measure CSV read/write performance
|
|
================================================================================================
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
One quoted string 24073 24109 33 0.0 481463.5 1.0X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 1000 columns 58415 59611 2071 0.0 58414.8 1.0X
|
|
Select 100 columns 22568 23020 594 0.0 22568.0 2.6X
|
|
Select one column 18995 19058 99 0.1 18995.0 3.1X
|
|
count() 5301 5332 30 0.2 5300.9 11.0X
|
|
Select 100 columns, one bad input field 39736 40153 361 0.0 39736.1 1.5X
|
|
Select 100 columns, corrupt record field 47195 47826 590 0.0 47195.2 1.2X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 10 columns + count() 9884 9904 25 1.0 988.4 1.0X
|
|
Select 1 column + count() 6794 6835 46 1.5 679.4 1.5X
|
|
count() 2060 2065 5 4.9 206.0 4.8X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Create a dataset of timestamps 717 732 18 14.0 71.7 1.0X
|
|
to_csv(timestamp) 6994 7100 121 1.4 699.4 0.1X
|
|
write timestamps to files 6417 6435 27 1.6 641.7 0.1X
|
|
Create a dataset of dates 827 855 24 12.1 82.7 0.9X
|
|
to_csv(date) 4408 4438 32 2.3 440.8 0.2X
|
|
write dates to files 3738 3758 28 2.7 373.8 0.2X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
read timestamp text from files 1121 1176 52 8.9 112.1 1.0X
|
|
read timestamps from files 21298 21366 105 0.5 2129.8 0.1X
|
|
infer timestamps from files 41008 41051 39 0.2 4100.8 0.0X
|
|
read date text from files 962 967 5 10.4 96.2 1.2X
|
|
read date from files 11749 11772 22 0.9 1174.9 0.1X
|
|
infer date from files 12426 12459 29 0.8 1242.6 0.1X
|
|
timestamp strings 1508 1519 9 6.6 150.8 0.7X
|
|
parse timestamps from Dataset[String] 21674 21997 455 0.5 2167.4 0.1X
|
|
infer timestamps from Dataset[String] 42141 42230 105 0.2 4214.1 0.0X
|
|
date strings 1694 1701 8 5.9 169.4 0.7X
|
|
parse dates from Dataset[String] 12929 12951 25 0.8 1292.9 0.1X
|
|
from_csv(timestamp) 20603 20786 166 0.5 2060.3 0.1X
|
|
from_csv(date) 12325 12338 12 0.8 1232.5 0.1X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
w/o filters 12455 12474 22 0.0 124553.8 1.0X
|
|
pushdown disabled 12462 12486 29 0.0 124624.9 1.0X
|
|
w/ filters 1073 1092 18 0.1 10727.6 11.6X
|
|
|
|
|