d65f534c5a
### What changes were proposed in this pull request? With benchmark original, where the timestamp values are valid to the new parser the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5781 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 44764 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 93764 ms [info] Running case: from_json(timestamp) [info] Stopped after 3 iterations, 59021 ms ``` When we modify the benchmark to ```scala def timestampStr: Dataset[String] = { spark.range(0, rowsNum, 1, 1).mapPartitions { iter => iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""") }.select($"value".as("timestamp")).as[String] } readBench.addCase("timestamp strings", numIters) { _ => timestampStr.noop() } readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ => spark.read.schema(tsSchema).json(timestampStr).noop() } readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ => spark.read.json(timestampStr).noop() } ``` where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4). the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` About 10x perf-regression BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is ```scala [info] Running benchmark: Read dates and timestamps [info] Running case: timestamp strings [info] Stopped after 3 iterations, 5623 ms [info] Running case: parse timestamps from Dataset[String] [info] Stopped after 3 iterations, 506637 ms [info] Running case: infer timestamps from Dataset[String] [info] Stopped after 3 iterations, 509076 ms ``` ### Why are the changes needed? Fix performance regression. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? new tests added. Closes #28181 from yaooqinn/SPARK-31414. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
113 lines
9.4 KiB
Plaintext
113 lines
9.4 KiB
Plaintext
================================================================================================
|
|
Benchmark for performance of JSON parsing
|
|
================================================================================================
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
JSON schema inferring: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
No encoding 46010 46118 113 2.2 460.1 1.0X
|
|
UTF-8 is set 54407 55427 1718 1.8 544.1 0.8X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
count a short column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
No encoding 26614 28220 1461 3.8 266.1 1.0X
|
|
UTF-8 is set 42765 43400 550 2.3 427.6 0.6X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
count a wide column: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
No encoding 35696 35821 113 0.3 3569.6 1.0X
|
|
UTF-8 is set 55441 56176 1037 0.2 5544.1 0.6X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
select wide row: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
No encoding 61514 62968 NaN 0.0 123027.2 1.0X
|
|
UTF-8 is set 72096 72933 1162 0.0 144192.7 0.9X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Select a subset of 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 10 columns 9859 9913 79 1.0 985.9 1.0X
|
|
Select 1 column 10981 11003 36 0.9 1098.1 0.9X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
creation of JSON parser per line: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Short column without encoding 3555 3579 27 2.8 355.5 1.0X
|
|
Short column with UTF-8 5204 5227 35 1.9 520.4 0.7X
|
|
Wide column without encoding 60458 60637 164 0.2 6045.8 0.1X
|
|
Wide column with UTF-8 77544 78111 551 0.1 7754.4 0.0X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
JSON functions: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Text read 342 346 3 29.2 34.2 1.0X
|
|
from_json 7123 7318 179 1.4 712.3 0.0X
|
|
json_tuple 9843 9957 132 1.0 984.3 0.0X
|
|
get_json_object 7827 8046 194 1.3 782.7 0.0X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Dataset of json strings: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Text read 1856 1884 32 26.9 37.1 1.0X
|
|
schema inferring 16734 16900 153 3.0 334.7 0.1X
|
|
parsing 14884 15203 470 3.4 297.7 0.1X
|
|
|
|
Preparing data for benchmarking ...
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Json files in the per-line mode: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Text read 5932 6148 228 8.4 118.6 1.0X
|
|
Schema inferring 20836 21938 1086 2.4 416.7 0.3X
|
|
Parsing without charset 18134 18661 457 2.8 362.7 0.3X
|
|
Parsing with UTF-8 27734 28069 378 1.8 554.7 0.2X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Create a dataset of timestamps 889 914 28 11.2 88.9 1.0X
|
|
to_json(timestamp) 7920 8172 353 1.3 792.0 0.1X
|
|
write timestamps to files 6726 6822 129 1.5 672.6 0.1X
|
|
Create a dataset of dates 953 963 12 10.5 95.3 0.9X
|
|
to_json(date) 5370 5705 320 1.9 537.0 0.2X
|
|
write dates to files 4109 4166 52 2.4 410.9 0.2X
|
|
|
|
Java HotSpot(TM) 64-Bit Server VM 11.0.5+10-LTS on Mac OS X 10.15.4
|
|
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
read timestamp text from files 1614 1675 55 6.2 161.4 1.0X
|
|
read timestamps from files 16640 16858 209 0.6 1664.0 0.1X
|
|
infer timestamps from files 33239 33388 227 0.3 3323.9 0.0X
|
|
read date text from files 1310 1340 44 7.6 131.0 1.2X
|
|
read date from files 9470 9513 41 1.1 947.0 0.2X
|
|
timestamp strings 1303 1342 47 7.7 130.3 1.2X
|
|
parse timestamps from Dataset[String] 17650 18073 380 0.6 1765.0 0.1X
|
|
infer timestamps from Dataset[String] 32623 34065 1330 0.3 3262.3 0.0X
|
|
date strings 1864 1871 7 5.4 186.4 0.9X
|
|
parse dates from Dataset[String] 10914 11316 482 0.9 1091.4 0.1X
|
|
from_json(timestamp) 21102 21990 929 0.5 2110.2 0.1X
|
|
from_json(date) 15275 15961 598 0.7 1527.5 0.1X
|
|
|
|
|