spark-instrumented-optimizer/sql/core/benchmarks/CSVBenchmark-results.txt
Kent Yao d65f534c5a [SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing
### What changes were proposed in this pull request?

With benchmark original, where the timestamp values are valid to the new parser

the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5781 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 44764 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 93764 ms
[info]   Running case: from_json(timestamp)
[info]   Stopped after 3 iterations, 59021 ms
```
When we modify the benchmark to

```scala
     def timestampStr: Dataset[String] = {
        spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
          iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
        }.select($"value".as("timestamp")).as[String]
      }

      readBench.addCase("timestamp strings", numIters) { _ =>
        timestampStr.noop()
      }

      readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
        spark.read.schema(tsSchema).json(timestampStr).noop()
      }

      readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
        spark.read.json(timestampStr).noop()
      }
```
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is

```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
```
About 10x perf-regression

BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is

```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
```

### Why are the changes needed?

 Fix performance regression.

### Does this PR introduce any user-facing change?

NO
### How was this patch tested?

new tests added.

Closes #28181 from yaooqinn/SPARK-31414.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-13 03:11:28 +00:00

68 lines
6.1 KiB
Plaintext

================================================================================================
Benchmark to measure CSV read/write performance
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 24073 24109 33 0.0 481463.5 1.0X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 58415 59611 2071 0.0 58414.8 1.0X
Select 100 columns 22568 23020 594 0.0 22568.0 2.6X
Select one column 18995 19058 99 0.1 18995.0 3.1X
count() 5301 5332 30 0.2 5300.9 11.0X
Select 100 columns, one bad input field 39736 40153 361 0.0 39736.1 1.5X
Select 100 columns, corrupt record field 47195 47826 590 0.0 47195.2 1.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 9884 9904 25 1.0 988.4 1.0X
Select 1 column + count() 6794 6835 46 1.5 679.4 1.5X
count() 2060 2065 5 4.9 206.0 4.8X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 717 732 18 14.0 71.7 1.0X
to_csv(timestamp) 6994 7100 121 1.4 699.4 0.1X
write timestamps to files 6417 6435 27 1.6 641.7 0.1X
Create a dataset of dates 827 855 24 12.1 82.7 0.9X
to_csv(date) 4408 4438 32 2.3 440.8 0.2X
write dates to files 3738 3758 28 2.7 373.8 0.2X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 1121 1176 52 8.9 112.1 1.0X
read timestamps from files 21298 21366 105 0.5 2129.8 0.1X
infer timestamps from files 41008 41051 39 0.2 4100.8 0.0X
read date text from files 962 967 5 10.4 96.2 1.2X
read date from files 11749 11772 22 0.9 1174.9 0.1X
infer date from files 12426 12459 29 0.8 1242.6 0.1X
timestamp strings 1508 1519 9 6.6 150.8 0.7X
parse timestamps from Dataset[String] 21674 21997 455 0.5 2167.4 0.1X
infer timestamps from Dataset[String] 42141 42230 105 0.2 4214.1 0.0X
date strings 1694 1701 8 5.9 169.4 0.7X
parse dates from Dataset[String] 12929 12951 25 0.8 1292.9 0.1X
from_csv(timestamp) 20603 20786 166 0.5 2060.3 0.1X
from_csv(date) 12325 12338 12 0.8 1232.5 0.1X
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU @ 2.40GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 12455 12474 22 0.0 124553.8 1.0X
pushdown disabled 12462 12486 29 0.0 124624.9 1.0X
w/ filters 1073 1092 18 0.1 10727.6 11.6X