a0f8cc08a3
### What changes were proposed in this pull request? Reuse the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` functions introduced by the PR #28119 in `DateTimeUtils`.`toJavaTimestamp()` and `fromJavaTimestamp()`. Actually, new implementation is derived from Spark 2.4 + rebasing via pre-calculated rebasing maps. ### Why are the changes needed? The changes speed up conversions to/from java.sql.Timestamp, and as a consequence the PR improve performance of ORC datasource in loading/saving timestamps: - Saving ~ **x2.8 faster** in master, and -11% against Spark 2.4.6 - Loading - **x3.2-4.5 faster** in master, -5% against Spark 2.4.6 Before: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 59877 59877 0 1.7 598.8 0.0X before 1582 61361 61361 0 1.6 613.6 0.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 48197 48288 118 2.1 482.0 1.0X after 1582, vec on 38247 38351 128 2.6 382.5 1.3X before 1582, vec off 53179 53359 249 1.9 531.8 0.9X before 1582, vec on 44076 44268 269 2.3 440.8 1.1X ``` After: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 21250 21250 0 4.7 212.5 0.1X before 1582 22105 22105 0 4.5 221.0 0.1X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 14903 14933 40 6.7 149.0 1.0X after 1582, vec on 8342 8426 73 12.0 83.4 1.8X before 1582, vec off 15528 15575 76 6.4 155.3 1.0X before 1582, vec on 9025 9075 61 11.1 90.2 1.7X ``` Spark 2.4.6-SNAPSHOT: ``` Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582 18858 18858 0 5.3 188.6 1.0X before 1582 18508 18508 0 5.4 185.1 1.0X Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ after 1582, vec off 14063 14177 143 7.1 140.6 1.0X after 1582, vec on 5955 6029 100 16.8 59.5 2.4X before 1582, vec off 14119 14126 7 7.1 141.2 1.0X before 1582, vec on 5991 6007 25 16.7 59.9 2.3X ``` ### Does this PR introduce any user-facing change? Yes, the `to_utc_timestamp` function returns the later local timestamp in the case of overlapping local timestamps at daylight saving time. it's changed back to the 2.4 behavior. ### How was this patch tested? - By existing test suite `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuites`, `ParquetIOSuite`, `OrcHadoopFsRelationSuite`. - Re-generating results of the benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` in the environment: | Item | Description | | ---- | ----| | Region | us-west-2 (Oregon) | | Instance | r3.xlarge | | AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) | | Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 | Closes #28189 from MaxGekk/optimize-to-from-java-timestamp. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
95 lines
8.6 KiB
Plaintext
95 lines
8.6 KiB
Plaintext
================================================================================================
|
|
Rebasing dates/timestamps in Parquet datasource
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save dates to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 24114 24114 0 4.1 241.1 1.0X
|
|
before 1582, noop 10250 10250 0 9.8 102.5 2.4X
|
|
after 1582, rebase off 36672 36672 0 2.7 366.7 0.7X
|
|
after 1582, rebase on 37123 37123 0 2.7 371.2 0.6X
|
|
before 1582, rebase off 21925 21925 0 4.6 219.2 1.1X
|
|
before 1582, rebase on 22341 22341 0 4.5 223.4 1.1X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off, rebase off 12456 12601 126 8.0 124.6 1.0X
|
|
after 1582, vec off, rebase on 13299 13336 32 7.5 133.0 0.9X
|
|
after 1582, vec on, rebase off 3623 3660 40 27.6 36.2 3.4X
|
|
after 1582, vec on, rebase on 5160 5177 15 19.4 51.6 2.4X
|
|
before 1582, vec off, rebase off 13177 13264 76 7.6 131.8 0.9X
|
|
before 1582, vec off, rebase on 14102 14149 46 7.1 141.0 0.9X
|
|
before 1582, vec on, rebase off 3649 3670 34 27.4 36.5 3.4X
|
|
before 1582, vec on, rebase on 5652 5667 15 17.7 56.5 2.2X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save timestamps to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 2871 2871 0 34.8 28.7 1.0X
|
|
before 1582, noop 2753 2753 0 36.3 27.5 1.0X
|
|
after 1582, rebase off 15927 15927 0 6.3 159.3 0.2X
|
|
after 1582, rebase on 19138 19138 0 5.2 191.4 0.1X
|
|
before 1582, rebase off 16137 16137 0 6.2 161.4 0.2X
|
|
before 1582, rebase on 19584 19584 0 5.1 195.8 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off, rebase off 14995 15047 47 6.7 150.0 1.0X
|
|
after 1582, vec off, rebase on 18111 18146 37 5.5 181.1 0.8X
|
|
after 1582, vec on, rebase off 4837 4873 44 20.7 48.4 3.1X
|
|
after 1582, vec on, rebase on 9542 9669 111 10.5 95.4 1.6X
|
|
before 1582, vec off, rebase off 14993 15090 94 6.7 149.9 1.0X
|
|
before 1582, vec off, rebase on 18675 18712 64 5.4 186.7 0.8X
|
|
before 1582, vec on, rebase off 4908 4923 15 20.4 49.1 3.1X
|
|
before 1582, vec on, rebase on 10128 10148 19 9.9 101.3 1.5X
|
|
|
|
|
|
================================================================================================
|
|
Rebasing dates/timestamps in ORC datasource
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save dates to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 23977 23977 0 4.2 239.8 1.0X
|
|
before 1582, noop 10094 10094 0 9.9 100.9 2.4X
|
|
after 1582 33115 33115 0 3.0 331.2 0.7X
|
|
before 1582 19430 19430 0 5.1 194.3 1.2X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off 10217 10241 21 9.8 102.2 1.0X
|
|
after 1582, vec on 3671 3691 31 27.2 36.7 2.8X
|
|
before 1582, vec off 10800 10874 114 9.3 108.0 0.9X
|
|
before 1582, vec on 4118 4165 74 24.3 41.2 2.5X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 2691 2691 0 37.2 26.9 1.0X
|
|
before 1582, noop 2743 2743 0 36.5 27.4 1.0X
|
|
after 1582 21409 21409 0 4.7 214.1 0.1X
|
|
before 1582 22554 22554 0 4.4 225.5 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off 14752 14855 103 6.8 147.5 1.0X
|
|
after 1582, vec on 8146 8185 34 12.3 81.5 1.8X
|
|
before 1582, vec off 15247 15294 46 6.6 152.5 1.0X
|
|
before 1582, vec on 8414 8466 52 11.9 84.1 1.8X
|
|
|
|
|