92685c0148
### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
68 lines
6.2 KiB
Plaintext
68 lines
6.2 KiB
Plaintext
================================================================================================
|
|
Benchmark to measure CSV read/write performance
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
One quoted string 45457 45731 344 0.0 909136.8 1.0X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 1000 columns 129646 130527 1412 0.0 129646.3 1.0X
|
|
Select 100 columns 42444 42551 119 0.0 42444.0 3.1X
|
|
Select one column 35415 35428 20 0.0 35414.6 3.7X
|
|
count() 11114 11128 16 0.1 11113.6 11.7X
|
|
Select 100 columns, one bad input field 93353 93670 275 0.0 93352.6 1.4X
|
|
Select 100 columns, corrupt record field 113569 113952 373 0.0 113568.8 1.1X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 10 columns + count() 18498 18589 87 0.5 1849.8 1.0X
|
|
Select 1 column + count() 11078 11095 27 0.9 1107.8 1.7X
|
|
count() 3928 3950 22 2.5 392.8 4.7X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Create a dataset of timestamps 1933 1940 11 5.2 193.3 1.0X
|
|
to_csv(timestamp) 18078 18243 255 0.6 1807.8 0.1X
|
|
write timestamps to files 12668 12786 134 0.8 1266.8 0.2X
|
|
Create a dataset of dates 2196 2201 5 4.6 219.6 0.9X
|
|
to_csv(date) 9583 9597 21 1.0 958.3 0.2X
|
|
write dates to files 7091 7110 20 1.4 709.1 0.3X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
read timestamp text from files 2166 2177 10 4.6 216.6 1.0X
|
|
read timestamps from files 53212 53402 281 0.2 5321.2 0.0X
|
|
infer timestamps from files 109788 110372 570 0.1 10978.8 0.0X
|
|
read date text from files 1921 1929 8 5.2 192.1 1.1X
|
|
read date from files 25470 25499 25 0.4 2547.0 0.1X
|
|
infer date from files 27201 27342 134 0.4 2720.1 0.1X
|
|
timestamp strings 3638 3653 19 2.7 363.8 0.6X
|
|
parse timestamps from Dataset[String] 61894 62532 555 0.2 6189.4 0.0X
|
|
infer timestamps from Dataset[String] 125171 125430 236 0.1 12517.1 0.0X
|
|
date strings 3736 3749 14 2.7 373.6 0.6X
|
|
parse dates from Dataset[String] 30787 30829 43 0.3 3078.7 0.1X
|
|
from_csv(timestamp) 60842 61035 209 0.2 6084.2 0.0X
|
|
from_csv(date) 30123 30196 95 0.3 3012.3 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
w/o filters 28985 29042 80 0.0 289852.9 1.0X
|
|
pushdown disabled 29080 29146 58 0.0 290799.4 1.0X
|
|
w/ filters 2072 2084 17 0.0 20722.3 14.0X
|
|
|
|
|