820bb9985a
### What changes were proposed in this pull request? 1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4. 2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case. ### Why are the changes needed? This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see <img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png"> Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets. - Run the following code to check that there is no difference between rebased and original micros for modern timestamps: ```scala test("rebasing differences") { withDefaultTimeZone(getZoneId("America/Los_Angeles")) { val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0) .atZone(getZoneId("America/Los_Angeles")) .toInstant) var micros = start var diff = Long.MaxValue var counter = 0 while (micros < end) { val rebased = rebaseGregorianToJulianMicros(micros) val curDiff = rebased - micros if (curDiff != diff) { counter += 1 diff = curDiff val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes") } micros += 30 * MICROS_PER_MINUTE } println(s"counter = $counter") } } ``` ``` local date-time = 0001-01-01T00:00 diff = -2872 minutes local date-time = 0100-03-01T00:00 diff = -1432 minutes local date-time = 0200-03-01T00:00 diff = 7 minutes local date-time = 0300-03-01T00:00 diff = 1447 minutes local date-time = 0500-03-01T00:00 diff = 2887 minutes local date-time = 0600-03-01T00:00 diff = 4327 minutes local date-time = 0700-03-01T00:00 diff = 5767 minutes local date-time = 0900-03-01T00:00 diff = 7207 minutes local date-time = 1000-03-01T00:00 diff = 8647 minutes local date-time = 1100-03-01T00:00 diff = 10087 minutes local date-time = 1300-03-01T00:00 diff = 11527 minutes local date-time = 1400-03-01T00:00 diff = 12967 minutes local date-time = 1500-03-01T00:00 diff = 14407 minutes local date-time = 1582-10-15T00:00 diff = 7 minutes local date-time = 1883-11-18T12:22:58 diff = 0 minutes counter = 15 ``` The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds. - By running the updated benchmark `DateTimeRebaseBenchmark` via the command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 | Closes #28101 from MaxGekk/fix-local-date-overlap. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
95 lines
8.6 KiB
Plaintext
95 lines
8.6 KiB
Plaintext
================================================================================================
|
|
Rebasing dates/timestamps in Parquet datasource
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save dates to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 9488 9488 0 10.5 94.9 1.0X
|
|
before 1582, noop 9301 9301 0 10.8 93.0 1.0X
|
|
after 1582, rebase off 20109 20109 0 5.0 201.1 0.5X
|
|
after 1582, rebase on 20004 20004 0 5.0 200.0 0.5X
|
|
before 1582, rebase off 19906 19906 0 5.0 199.1 0.5X
|
|
before 1582, rebase on 20466 20466 0 4.9 204.7 0.5X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off, rebase off 12593 12653 52 7.9 125.9 1.0X
|
|
after 1582, vec off, rebase on 13350 13489 121 7.5 133.5 0.9X
|
|
after 1582, vec on, rebase off 3665 3681 25 27.3 36.6 3.4X
|
|
after 1582, vec on, rebase on 5193 5210 16 19.3 51.9 2.4X
|
|
before 1582, vec off, rebase off 13023 13059 32 7.7 130.2 1.0X
|
|
before 1582, vec off, rebase on 13855 13937 115 7.2 138.6 0.9X
|
|
before 1582, vec on, rebase off 3651 3665 12 27.4 36.5 3.4X
|
|
before 1582, vec on, rebase on 5623 5671 45 17.8 56.2 2.2X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save timestamps to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 2798 2798 0 35.7 28.0 1.0X
|
|
before 1582, noop 2955 2955 0 33.8 29.6 0.9X
|
|
after 1582, rebase off 15889 15889 0 6.3 158.9 0.2X
|
|
after 1582, rebase on 84247 84247 0 1.2 842.5 0.0X
|
|
before 1582, rebase off 16134 16134 0 6.2 161.3 0.2X
|
|
before 1582, rebase on 100006 100006 0 1.0 1000.1 0.0X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off, rebase off 14920 15045 116 6.7 149.2 1.0X
|
|
after 1582, vec off, rebase on 55062 55171 140 1.8 550.6 0.3X
|
|
after 1582, vec on, rebase off 4871 4952 72 20.5 48.7 3.1X
|
|
after 1582, vec on, rebase on 44955 44981 23 2.2 449.5 0.3X
|
|
before 1582, vec off, rebase off 15236 15386 142 6.6 152.4 1.0X
|
|
before 1582, vec off, rebase on 57290 57368 79 1.7 572.9 0.3X
|
|
before 1582, vec on, rebase off 4919 4930 15 20.3 49.2 3.0X
|
|
before 1582, vec on, rebase on 47351 47713 400 2.1 473.5 0.3X
|
|
|
|
|
|
================================================================================================
|
|
Rebasing dates/timestamps in ORC datasource
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save dates to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 9451 9451 0 10.6 94.5 1.0X
|
|
before 1582, noop 9765 9765 0 10.2 97.7 1.0X
|
|
after 1582 18722 18722 0 5.3 187.2 0.5X
|
|
before 1582 18864 18864 0 5.3 188.6 0.5X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off 24897 25095 247 4.0 249.0 1.0X
|
|
after 1582, vec on 3719 3780 84 26.9 37.2 6.7X
|
|
before 1582, vec off 31290 31347 50 3.2 312.9 0.8X
|
|
before 1582, vec on 4166 4188 25 24.0 41.7 6.0X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 2882 2882 0 34.7 28.8 1.0X
|
|
before 1582, noop 2991 2991 0 33.4 29.9 1.0X
|
|
after 1582 53951 53951 0 1.9 539.5 0.1X
|
|
before 1582 54276 54276 0 1.8 542.8 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off 41411 41514 97 2.4 414.1 1.0X
|
|
after 1582, vec on 32163 32201 36 3.1 321.6 1.3X
|
|
before 1582, vec off 43013 43111 131 2.3 430.1 1.0X
|
|
before 1582, vec on 34114 34152 45 2.9 341.1 1.2X
|
|
|
|
|