bef5828e12
### What changes were proposed in this pull request? Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`. ### Why are the changes needed? The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`. Before (see the PR's results https://github.com/apache/spark/pull/28440): ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 376 388 10 13.3 75.2 1.1X Collect java.sql.Timestamp 1878 1937 64 2.7 375.6 0.2X ``` After: ``` ================================================================================================ Conversion from/to external types ================================================================================================ OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz To/from Java's date-time: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ From java.sql.Timestamp 249 264 24 20.1 49.8 1.7X Collect java.sql.Timestamp 1503 1523 24 3.3 300.5 0.3X ``` Perf improvements in average of: 1. From java.sql.Timestamp is ~ 34% 2. To java.sql.Timestamps is ~16% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`. Closes #28441 from MaxGekk/opt-rebase-common-threshold. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
143 lines
13 KiB
Plaintext
143 lines
13 KiB
Plaintext
================================================================================================
|
|
Rebasing dates/timestamps in Parquet datasource
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save DATE to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 20802 20802 0 4.8 208.0 1.0X
|
|
before 1582, noop 10728 10728 0 9.3 107.3 1.9X
|
|
after 1582, rebase off 32924 32924 0 3.0 329.2 0.6X
|
|
after 1582, rebase on 32627 32627 0 3.1 326.3 0.6X
|
|
before 1582, rebase off 21576 21576 0 4.6 215.8 1.0X
|
|
before 1582, rebase on 23115 23115 0 4.3 231.2 0.9X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load DATE from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off, rebase off 12880 12984 178 7.8 128.8 1.0X
|
|
after 1582, vec off, rebase on 13118 13255 174 7.6 131.2 1.0X
|
|
after 1582, vec on, rebase off 3645 3698 76 27.4 36.4 3.5X
|
|
after 1582, vec on, rebase on 3709 3727 15 27.0 37.1 3.5X
|
|
before 1582, vec off, rebase off 13014 13051 36 7.7 130.1 1.0X
|
|
before 1582, vec off, rebase on 14195 14242 48 7.0 142.0 0.9X
|
|
before 1582, vec on, rebase off 3680 3773 92 27.2 36.8 3.5X
|
|
before 1582, vec on, rebase on 4310 4381 87 23.2 43.1 3.0X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save TIMESTAMP_INT96 to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, noop 3026 3026 0 33.1 30.3 1.0X
|
|
before 1900, noop 2995 2995 0 33.4 30.0 1.0X
|
|
after 1900, rebase off 24294 24294 0 4.1 242.9 0.1X
|
|
after 1900, rebase on 24480 24480 0 4.1 244.8 0.1X
|
|
before 1900, rebase off 31120 31120 0 3.2 311.2 0.1X
|
|
before 1900, rebase on 31201 31201 0 3.2 312.0 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load TIMESTAMP_INT96 from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, vec off, rebase off 18283 18309 39 5.5 182.8 1.0X
|
|
after 1900, vec off, rebase on 18235 18269 53 5.5 182.4 1.0X
|
|
after 1900, vec on, rebase off 9563 9589 27 10.5 95.6 1.9X
|
|
after 1900, vec on, rebase on 9463 9554 81 10.6 94.6 1.9X
|
|
before 1900, vec off, rebase off 21377 21469 118 4.7 213.8 0.9X
|
|
before 1900, vec off, rebase on 21265 21422 156 4.7 212.7 0.9X
|
|
before 1900, vec on, rebase off 12481 12524 46 8.0 124.8 1.5X
|
|
before 1900, vec on, rebase on 12360 12482 105 8.1 123.6 1.5X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save TIMESTAMP_MICROS to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, noop 2984 2984 0 33.5 29.8 1.0X
|
|
before 1900, noop 3003 3003 0 33.3 30.0 1.0X
|
|
after 1900, rebase off 15814 15814 0 6.3 158.1 0.2X
|
|
after 1900, rebase on 16250 16250 0 6.2 162.5 0.2X
|
|
before 1900, rebase off 16026 16026 0 6.2 160.3 0.2X
|
|
before 1900, rebase on 19735 19735 0 5.1 197.3 0.2X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load TIMESTAMP_MICROS from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, vec off, rebase off 15292 15351 57 6.5 152.9 1.0X
|
|
after 1900, vec off, rebase on 15753 15886 173 6.3 157.5 1.0X
|
|
after 1900, vec on, rebase off 4879 4923 52 20.5 48.8 3.1X
|
|
after 1900, vec on, rebase on 5018 5038 18 19.9 50.2 3.0X
|
|
before 1900, vec off, rebase off 15257 15311 53 6.6 152.6 1.0X
|
|
before 1900, vec off, rebase on 18459 18537 90 5.4 184.6 0.8X
|
|
before 1900, vec on, rebase off 4929 4946 15 20.3 49.3 3.1X
|
|
before 1900, vec on, rebase on 8254 8339 93 12.1 82.5 1.9X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save TIMESTAMP_MILLIS to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, noop 2987 2987 0 33.5 29.9 1.0X
|
|
before 1900, noop 3002 3002 0 33.3 30.0 1.0X
|
|
after 1900, rebase off 15215 15215 0 6.6 152.1 0.2X
|
|
after 1900, rebase on 15577 15577 0 6.4 155.8 0.2X
|
|
before 1900, rebase off 15505 15505 0 6.4 155.1 0.2X
|
|
before 1900, rebase on 19143 19143 0 5.2 191.4 0.2X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load TIMESTAMP_MILLIS from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, vec off, rebase off 15330 15436 113 6.5 153.3 1.0X
|
|
after 1900, vec off, rebase on 15515 15549 30 6.4 155.1 1.0X
|
|
after 1900, vec on, rebase off 6056 6074 19 16.5 60.6 2.5X
|
|
after 1900, vec on, rebase on 6376 6390 14 15.7 63.8 2.4X
|
|
before 1900, vec off, rebase off 15490 15523 36 6.5 154.9 1.0X
|
|
before 1900, vec off, rebase on 18613 18685 118 5.4 186.1 0.8X
|
|
before 1900, vec on, rebase off 6065 6109 41 16.5 60.6 2.5X
|
|
before 1900, vec on, rebase on 9052 9082 32 11.0 90.5 1.7X
|
|
|
|
|
|
================================================================================================
|
|
Rebasing dates/timestamps in ORC datasource
|
|
================================================================================================
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save DATE to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, noop 20653 20653 0 4.8 206.5 1.0X
|
|
before 1582, noop 10707 10707 0 9.3 107.1 1.9X
|
|
after 1582 28288 28288 0 3.5 282.9 0.7X
|
|
before 1582 19196 19196 0 5.2 192.0 1.1X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load DATE from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1582, vec off 10596 10621 37 9.4 106.0 1.0X
|
|
after 1582, vec on 3886 3938 61 25.7 38.9 2.7X
|
|
before 1582, vec off 10955 10984 26 9.1 109.6 1.0X
|
|
before 1582, vec on 4236 4258 24 23.6 42.4 2.5X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Save TIMESTAMP to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, noop 2988 2988 0 33.5 29.9 1.0X
|
|
before 1900, noop 3007 3007 0 33.3 30.1 1.0X
|
|
after 1900 18082 18082 0 5.5 180.8 0.2X
|
|
before 1900 22669 22669 0 4.4 226.7 0.1X
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
Load TIMESTAMP from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
------------------------------------------------------------------------------------------------------------------------
|
|
after 1900, vec off 12029 12035 9 8.3 120.3 1.0X
|
|
after 1900, vec on 5194 5197 3 19.3 51.9 2.3X
|
|
before 1900, vec off 14853 14875 23 6.7 148.5 0.8X
|
|
before 1900, vec on 7797 7836 60 12.8 78.0 1.5X
|
|
|
|
|