spark-instrumented-optimizer/sql/core/benchmarks/CSVBenchmark-results.txt
Max Gekk 92685c0148 [SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results
### What changes were proposed in this pull request?
Re-generate results of:
- DateTimeBenchmark
- CSVBenchmark
- JsonBenchmark

in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

### Why are the changes needed?
1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing.
2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running benchmarks via the script:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #28613 from MaxGekk/missing-hour-year-benchmarks.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 15:00:11 +00:00

68 lines
6.2 KiB
Plaintext

================================================================================================
Benchmark to measure CSV read/write performance
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 45457 45731 344 0.0 909136.8 1.0X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 129646 130527 1412 0.0 129646.3 1.0X
Select 100 columns 42444 42551 119 0.0 42444.0 3.1X
Select one column 35415 35428 20 0.0 35414.6 3.7X
count() 11114 11128 16 0.1 11113.6 11.7X
Select 100 columns, one bad input field 93353 93670 275 0.0 93352.6 1.4X
Select 100 columns, corrupt record field 113569 113952 373 0.0 113568.8 1.1X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 18498 18589 87 0.5 1849.8 1.0X
Select 1 column + count() 11078 11095 27 0.9 1107.8 1.7X
count() 3928 3950 22 2.5 392.8 4.7X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 1933 1940 11 5.2 193.3 1.0X
to_csv(timestamp) 18078 18243 255 0.6 1807.8 0.1X
write timestamps to files 12668 12786 134 0.8 1266.8 0.2X
Create a dataset of dates 2196 2201 5 4.6 219.6 0.9X
to_csv(date) 9583 9597 21 1.0 958.3 0.2X
write dates to files 7091 7110 20 1.4 709.1 0.3X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2166 2177 10 4.6 216.6 1.0X
read timestamps from files 53212 53402 281 0.2 5321.2 0.0X
infer timestamps from files 109788 110372 570 0.1 10978.8 0.0X
read date text from files 1921 1929 8 5.2 192.1 1.1X
read date from files 25470 25499 25 0.4 2547.0 0.1X
infer date from files 27201 27342 134 0.4 2720.1 0.1X
timestamp strings 3638 3653 19 2.7 363.8 0.6X
parse timestamps from Dataset[String] 61894 62532 555 0.2 6189.4 0.0X
infer timestamps from Dataset[String] 125171 125430 236 0.1 12517.1 0.0X
date strings 3736 3749 14 2.7 373.6 0.6X
parse dates from Dataset[String] 30787 30829 43 0.3 3078.7 0.1X
from_csv(timestamp) 60842 61035 209 0.2 6084.2 0.0X
from_csv(date) 30123 30196 95 0.3 3012.3 0.1X
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 28985 29042 80 0.0 289852.9 1.0X
pushdown disabled 29080 29146 58 0.0 290799.4 1.0X
w/ filters 2072 2084 17 0.0 20722.3 14.0X