spark-instrumented-optimizer/sql/core/benchmarks/CSVBenchmark-jdk11-results.txt
Max Gekk 92685c0148 [SPARK-31755][SQL][FOLLOWUP] Update date-time, CSV and JSON benchmark results
### What changes were proposed in this pull request?
Re-generate results of:
- DateTimeBenchmark
- CSVBenchmark
- JsonBenchmark

in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

### Why are the changes needed?
1. The PR https://github.com/apache/spark/pull/28576 changed date-time parser. The `DateTimeBenchmark` should confirm that the PR didn't slow down date/timestamp parsing.
2. CSV/JSON datasources are affected by the above PR too. This PR updates the benchmark results in the same environment as other benchmarks to have a base line for future optimizations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running benchmarks via the script:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.benchmark.DateTimeBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #28613 from MaxGekk/missing-hour-year-benchmarks.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 15:00:11 +00:00

68 lines
6.2 KiB
Plaintext

================================================================================================
Benchmark to measure CSV read/write performance
================================================================================================
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
One quoted string 46568 46683 198 0.0 931358.6 1.0X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 1000 columns 129836 130796 1404 0.0 129836.0 1.0X
Select 100 columns 40444 40679 261 0.0 40443.5 3.2X
Select one column 33429 33475 73 0.0 33428.6 3.9X
count() 7967 8047 73 0.1 7966.7 16.3X
Select 100 columns, one bad input field 90639 90832 266 0.0 90638.6 1.4X
Select 100 columns, corrupt record field 109023 109084 74 0.0 109023.3 1.2X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Select 10 columns + count() 20685 20707 35 0.5 2068.5 1.0X
Select 1 column + count() 13096 13149 49 0.8 1309.6 1.6X
count() 3994 4001 7 2.5 399.4 5.2X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
Create a dataset of timestamps 2169 2203 32 4.6 216.9 1.0X
to_csv(timestamp) 14401 14591 168 0.7 1440.1 0.2X
write timestamps to files 13209 13276 59 0.8 1320.9 0.2X
Create a dataset of dates 2231 2248 17 4.5 223.1 1.0X
to_csv(date) 10406 10473 68 1.0 1040.6 0.2X
write dates to files 7970 7976 9 1.3 797.0 0.3X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
read timestamp text from files 2387 2391 6 4.2 238.7 1.0X
read timestamps from files 53503 53593 124 0.2 5350.3 0.0X
infer timestamps from files 107988 108668 647 0.1 10798.8 0.0X
read date text from files 2121 2133 12 4.7 212.1 1.1X
read date from files 29983 30039 48 0.3 2998.3 0.1X
infer date from files 30196 30436 218 0.3 3019.6 0.1X
timestamp strings 3098 3109 10 3.2 309.8 0.8X
parse timestamps from Dataset[String] 63331 63426 84 0.2 6333.1 0.0X
infer timestamps from Dataset[String] 124003 124463 490 0.1 12400.3 0.0X
date strings 3423 3429 11 2.9 342.3 0.7X
parse dates from Dataset[String] 34235 34314 76 0.3 3423.5 0.1X
from_csv(timestamp) 60829 61600 668 0.2 6082.9 0.0X
from_csv(date) 33047 33173 139 0.3 3304.7 0.1X
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 28752 28765 16 0.0 287516.5 1.0X
pushdown disabled 28856 28880 22 0.0 288556.3 1.0X
w/ filters 1714 1731 15 0.1 17137.3 16.8X