92685c0148
### What changes were proposed in this pull request? Re-generate results of: - DateTimeBenchmark - CSVBenchmark - JsonBenchmark in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | ### Why are the changes needed? 1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing. 2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running benchmarks via the script: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #28613 from MaxGekk/missing-hour-year-benchmarks. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
68 lines
6.2 KiB
Plaintext
68 lines
6.2 KiB
Plaintext
================================================================================================
|
|
Benchmark to measure CSV read/write performance
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
One quoted string 46568 46683 198 0.0 931358.6 1.0X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 1000 columns 129836 130796 1404 0.0 129836.0 1.0X
|
|
Select 100 columns 40444 40679 261 0.0 40443.5 3.2X
|
|
Select one column 33429 33475 73 0.0 33428.6 3.9X
|
|
count() 7967 8047 73 0.1 7966.7 16.3X
|
|
Select 100 columns, one bad input field 90639 90832 266 0.0 90638.6 1.4X
|
|
Select 100 columns, corrupt record field 109023 109084 74 0.0 109023.3 1.2X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 10 columns + count() 20685 20707 35 0.5 2068.5 1.0X
|
|
Select 1 column + count() 13096 13149 49 0.8 1309.6 1.6X
|
|
count() 3994 4001 7 2.5 399.4 5.2X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Create a dataset of timestamps 2169 2203 32 4.6 216.9 1.0X
|
|
to_csv(timestamp) 14401 14591 168 0.7 1440.1 0.2X
|
|
write timestamps to files 13209 13276 59 0.8 1320.9 0.2X
|
|
Create a dataset of dates 2231 2248 17 4.5 223.1 1.0X
|
|
to_csv(date) 10406 10473 68 1.0 1040.6 0.2X
|
|
write dates to files 7970 7976 9 1.3 797.0 0.3X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
read timestamp text from files 2387 2391 6 4.2 238.7 1.0X
|
|
read timestamps from files 53503 53593 124 0.2 5350.3 0.0X
|
|
infer timestamps from files 107988 108668 647 0.1 10798.8 0.0X
|
|
read date text from files 2121 2133 12 4.7 212.1 1.1X
|
|
read date from files 29983 30039 48 0.3 2998.3 0.1X
|
|
infer date from files 30196 30436 218 0.3 3019.6 0.1X
|
|
timestamp strings 3098 3109 10 3.2 309.8 0.8X
|
|
parse timestamps from Dataset[String] 63331 63426 84 0.2 6333.1 0.0X
|
|
infer timestamps from Dataset[String] 124003 124463 490 0.1 12400.3 0.0X
|
|
date strings 3423 3429 11 2.9 342.3 0.7X
|
|
parse dates from Dataset[String] 34235 34314 76 0.3 3423.5 0.1X
|
|
from_csv(timestamp) 60829 61600 668 0.2 6082.9 0.0X
|
|
from_csv(date) 33047 33173 139 0.3 3304.7 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
w/o filters 28752 28765 16 0.0 287516.5 1.0X
|
|
pushdown disabled 28856 28880 22 0.0 288556.3 1.0X
|
|
w/ filters 1714 1731 15 0.1 17137.3 16.8X
|
|
|
|
|