spark-instrumented-optimizer/sql/core/benchmarks/DateTimeRebaseBenchmark-results.txt
Maxim Gekk bb0b416f0b [SPARK-31297][SQL] Speed up dates rebasing
### What changes were proposed in this pull request?
In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, 9999-12-31]`:

| date | Proleptic Greg. days | Hybrid (Julian+Greg) days | diff|
| ---- | ----|----|----|
|0001-01-01|-719162|-719164|-2|
|0100-03-01|-682944|-682945|-1|
|0200-03-01|-646420|-646420|0|
|0300-03-01|-609896|-609895|1|
|0500-03-01|-536847|-536845|2|
|0600-03-01|-500323|-500320|3|
|0700-03-01|-463799|-463795|4|
|0900-03-01|-390750|-390745|5|
|1000-03-01|-354226|-354220|6|
|1100-03-01|-317702|-317695|7|
|1300-03-01|-244653|-244645|8|
|1400-03-01|-208129|-208120|9|
|1500-03-01|-171605|-171595|10|
|1582-10-15|-141427|-141427|0|

For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar.

For example, if need to rebase -650000 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -650000, and the result is -650001.

### Why are the changes needed?
To make dates rebasing faster.

### Does this PR introduce any user-facing change?
No, the results should be the same for valid range of the `DATE` type `[0001-01-01, 9999-12-31]`.

### How was this patch tested?
- Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates.
- Re-run `DateTimeRebaseBenchmark` on:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28067 from MaxGekk/optimize-rebasing.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 17:38:47 +08:00

54 lines
5 KiB
Plaintext

================================================================================================
Rebasing dates/timestamps in Parquet datasource
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Save dates to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, noop 9582 9582 0 10.4 95.8 1.0X
before 1582, noop 9473 9473 0 10.6 94.7 1.0X
after 1582, rebase off 21431 21431 0 4.7 214.3 0.4X
after 1582, rebase on 22156 22156 0 4.5 221.6 0.4X
before 1582, rebase off 21399 21399 0 4.7 214.0 0.4X
before 1582, rebase on 22927 22927 0 4.4 229.3 0.4X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off, rebase off 12637 12736 111 7.9 126.4 1.0X
after 1582, vec off, rebase on 13463 13531 61 7.4 134.6 0.9X
after 1582, vec on, rebase off 3693 3703 8 27.1 36.9 3.4X
after 1582, vec on, rebase on 5242 5252 9 19.1 52.4 2.4X
before 1582, vec off, rebase off 13055 13169 126 7.7 130.5 1.0X
before 1582, vec off, rebase on 14067 14270 185 7.1 140.7 0.9X
before 1582, vec on, rebase off 3697 3702 7 27.1 37.0 3.4X
before 1582, vec on, rebase on 6058 6097 34 16.5 60.6 2.1X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Save timestamps to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, noop 2713 2713 0 36.9 27.1 1.0X
before 1582, noop 2715 2715 0 36.8 27.2 1.0X
after 1582, rebase off 16768 16768 0 6.0 167.7 0.2X
after 1582, rebase on 82811 82811 0 1.2 828.1 0.0X
before 1582, rebase off 17052 17052 0 5.9 170.5 0.2X
before 1582, rebase on 95134 95134 0 1.1 951.3 0.0X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off, rebase off 15200 15321 194 6.6 152.0 1.0X
after 1582, vec off, rebase on 63160 63337 177 1.6 631.6 0.2X
after 1582, vec on, rebase off 4891 4928 43 20.4 48.9 3.1X
after 1582, vec on, rebase on 45474 45484 10 2.2 454.7 0.3X
before 1582, vec off, rebase off 15203 15330 110 6.6 152.0 1.0X
before 1582, vec off, rebase on 65588 65664 73 1.5 655.9 0.2X
before 1582, vec on, rebase off 4844 4916 105 20.6 48.4 3.1X
before 1582, vec on, rebase on 47815 47943 162 2.1 478.2 0.3X