854a0f752e
### What changes were proposed in this pull request? This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison. **A. EXPECTED CASES(JDK11 is faster in general)** - [x] BloomFilterBenchmark (JDK11 is faster except one case) - [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC) - [x] CSVBenchmark (JDK11 is faster except five cases) - [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`) - [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type) - [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases) - [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS) - [x] HashedRelationMetricsBenchmark (JDK11 is faster) - [x] JSONBenchmark (JDK11 is much faster except eight cases) - [x] JoinBenchmark (JDK11 is faster except five cases) - [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases) - [x] PrimitiveArrayBenchmark (JDK11 is faster) - [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case) - [x] UDFBenchmark (N/A, values are too small) - [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case) - [x] WideTableBenchmark (JDK11 is faster except two cases) **B. CASES WE NEED TO INVESTIGATE MORE LATER** - [x] AggregateBenchmark (JDK11 is slower in general) - [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`) - [x] DataSourceReadBenchmark (JDK11 is slower in general) - [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`) - [x] MakeDateTimeBenchmark (JDK11 is slower except two cases) - [x] MiscBenchmark (JDK11 is slower except ten cases) - [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower) - [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases) - [x] RangeBenchmark (JDK11 is slower except one case) `FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer. ### Why are the changes needed? According to the result, there are some difference between JDK8/JDK11. This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used. - Instance: `r3.xlarge` - OS: `CentOS Linux release 7.5.1804 (Core)` - JDK: - `OpenJDK Runtime Environment (build 1.8.0_222-b10)` - `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? This is a test-only PR. We need to run benchmark. Closes #26003 from dongjoon-hyun/SPARK-29320. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
60 lines
5.4 KiB
Plaintext
60 lines
5.4 KiB
Plaintext
================================================================================================
|
|
Benchmark to measure CSV read/write performance
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Parsing quoted values: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
One quoted string 62603 62755 133 0.0 1252055.6 1.0X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Wide rows with 1000 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 1000 columns 225032 225919 782 0.0 225031.7 1.0X
|
|
Select 100 columns 51982 52290 286 0.0 51982.1 4.3X
|
|
Select one column 40167 40283 133 0.0 40167.4 5.6X
|
|
count() 11435 11593 176 0.1 11435.1 19.7X
|
|
Select 100 columns, one bad input field 66864 66968 174 0.0 66864.1 3.4X
|
|
Select 100 columns, corrupt record field 79570 80418 1080 0.0 79569.5 2.8X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Count a dataset with 10 columns: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Select 10 columns + count() 23271 23389 103 0.4 2327.1 1.0X
|
|
Select 1 column + count() 18206 19772 NaN 0.5 1820.6 1.3X
|
|
count() 8500 8521 18 1.2 850.0 2.7X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Write dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
Create a dataset of timestamps 2025 2068 66 4.9 202.5 1.0X
|
|
to_csv(timestamp) 22192 22983 879 0.5 2219.2 0.1X
|
|
write timestamps to files 15949 16030 72 0.6 1594.9 0.1X
|
|
Create a dataset of dates 2200 2234 32 4.5 220.0 0.9X
|
|
to_csv(date) 18268 18341 73 0.5 1826.8 0.1X
|
|
write dates to files 10495 10722 214 1.0 1049.5 0.2X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_222-b10 on Linux 3.10.0-862.3.2.el7.x86_64
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Read dates and timestamps: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
read timestamp text from files 6491 6503 18 1.5 649.1 1.0X
|
|
read timestamps from files 56069 56795 874 0.2 5606.9 0.1X
|
|
infer timestamps from files 113383 114203 825 0.1 11338.3 0.1X
|
|
read date text from files 6411 6419 10 1.6 641.1 1.0X
|
|
read date from files 46245 46371 138 0.2 4624.5 0.1X
|
|
infer date from files 43623 43906 291 0.2 4362.3 0.1X
|
|
timestamp strings 4951 4959 7 2.0 495.1 1.3X
|
|
parse timestamps from Dataset[String] 65786 66309 663 0.2 6578.6 0.1X
|
|
infer timestamps from Dataset[String] 130891 133861 1928 0.1 13089.1 0.0X
|
|
date strings 3814 3895 84 2.6 381.4 1.7X
|
|
parse dates from Dataset[String] 52259 52960 614 0.2 5225.9 0.1X
|
|
from_csv(timestamp) 63013 63306 291 0.2 6301.3 0.1X
|
|
from_csv(date) 49840 52352 NaN 0.2 4984.0 0.1X
|
|
|
|
|