2020-03-30 04:46:31 -04:00
|
|
|
================================================================================================
|
|
|
|
Rebasing dates/timestamps in Parquet datasource
|
|
|
|
================================================================================================
|
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save DATE to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, noop 23567 23567 0 4.2 235.7 1.0X
|
|
|
|
before 1582, noop 10570 10570 0 9.5 105.7 2.2X
|
|
|
|
after 1582, rebase off 35335 35335 0 2.8 353.3 0.7X
|
|
|
|
after 1582, rebase on 35645 35645 0 2.8 356.5 0.7X
|
|
|
|
before 1582, rebase off 21824 21824 0 4.6 218.2 1.1X
|
|
|
|
before 1582, rebase on 22532 22532 0 4.4 225.3 1.0X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load DATE from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, vec off, rebase off 13194 13266 81 7.6 131.9 1.0X
|
|
|
|
after 1582, vec off, rebase on 13402 13466 89 7.5 134.0 1.0X
|
|
|
|
after 1582, vec on, rebase off 3627 3657 29 27.6 36.3 3.6X
|
|
|
|
after 1582, vec on, rebase on 3818 3839 26 26.2 38.2 3.5X
|
|
|
|
before 1582, vec off, rebase off 13075 13146 115 7.6 130.7 1.0X
|
|
|
|
before 1582, vec off, rebase on 13794 13804 13 7.2 137.9 1.0X
|
|
|
|
before 1582, vec on, rebase off 3655 3675 21 27.4 36.6 3.6X
|
|
|
|
before 1582, vec on, rebase on 4579 4634 72 21.8 45.8 2.9X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save TIMESTAMP_INT96 to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2671 2671 0 37.4 26.7 1.0X
|
|
|
|
before 1900, noop 2685 2685 0 37.2 26.8 1.0X
|
|
|
|
after 1900, rebase off 23899 23899 0 4.2 239.0 0.1X
|
|
|
|
after 1900, rebase on 24030 24030 0 4.2 240.3 0.1X
|
|
|
|
before 1900, rebase off 30178 30178 0 3.3 301.8 0.1X
|
|
|
|
before 1900, rebase on 30127 30127 0 3.3 301.3 0.1X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load TIMESTAMP_INT96 from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off, rebase off 16613 16685 75 6.0 166.1 1.0X
|
|
|
|
after 1900, vec off, rebase on 16487 16541 47 6.1 164.9 1.0X
|
|
|
|
after 1900, vec on, rebase off 8840 8870 49 11.3 88.4 1.9X
|
|
|
|
after 1900, vec on, rebase on 8795 8813 20 11.4 87.9 1.9X
|
|
|
|
before 1900, vec off, rebase off 20400 20441 62 4.9 204.0 0.8X
|
|
|
|
before 1900, vec off, rebase on 20430 20481 60 4.9 204.3 0.8X
|
|
|
|
before 1900, vec on, rebase off 12211 12290 73 8.2 122.1 1.4X
|
|
|
|
before 1900, vec on, rebase on 12231 12321 95 8.2 122.3 1.4X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Save TIMESTAMP_MICROS to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2836 2836 0 35.3 28.4 1.0X
|
|
|
|
before 1900, noop 2812 2812 0 35.6 28.1 1.0X
|
|
|
|
after 1900, rebase off 15976 15976 0 6.3 159.8 0.2X
|
|
|
|
after 1900, rebase on 16197 16197 0 6.2 162.0 0.2X
|
|
|
|
before 1900, rebase off 16140 16140 0 6.2 161.4 0.2X
|
|
|
|
before 1900, rebase on 20410 20410 0 4.9 204.1 0.1X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Load TIMESTAMP_MICROS from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off, rebase off 15297 15324 40 6.5 153.0 1.0X
|
|
|
|
after 1900, vec off, rebase on 15771 15832 59 6.3 157.7 1.0X
|
|
|
|
after 1900, vec on, rebase off 4922 4949 32 20.3 49.2 3.1X
|
|
|
|
after 1900, vec on, rebase on 5392 5411 17 18.5 53.9 2.8X
|
|
|
|
before 1900, vec off, rebase off 15227 15385 141 6.6 152.3 1.0X
|
|
|
|
before 1900, vec off, rebase on 19611 19658 41 5.1 196.1 0.8X
|
|
|
|
before 1900, vec on, rebase off 4965 5013 54 20.1 49.6 3.1X
|
|
|
|
before 1900, vec on, rebase on 9847 9873 43 10.2 98.5 1.6X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Save TIMESTAMP_MILLIS to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2818 2818 0 35.5 28.2 1.0X
|
|
|
|
before 1900, noop 2805 2805 0 35.6 28.1 1.0X
|
|
|
|
after 1900, rebase off 15182 15182 0 6.6 151.8 0.2X
|
|
|
|
after 1900, rebase on 15614 15614 0 6.4 156.1 0.2X
|
|
|
|
before 1900, rebase off 15404 15404 0 6.5 154.0 0.2X
|
|
|
|
before 1900, rebase on 19747 19747 0 5.1 197.5 0.1X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Load TIMESTAMP_MILLIS from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off, rebase off 15622 15649 24 6.4 156.2 1.0X
|
|
|
|
after 1900, vec off, rebase on 15572 15677 119 6.4 155.7 1.0X
|
|
|
|
after 1900, vec on, rebase off 6345 6358 15 15.8 63.5 2.5X
|
|
|
|
after 1900, vec on, rebase on 6780 6834 92 14.8 67.8 2.3X
|
|
|
|
before 1900, vec off, rebase off 15540 15584 38 6.4 155.4 1.0X
|
|
|
|
before 1900, vec off, rebase on 19590 19653 55 5.1 195.9 0.8X
|
|
|
|
before 1900, vec on, rebase off 6374 6381 10 15.7 63.7 2.5X
|
|
|
|
before 1900, vec on, rebase on 10530 10544 25 9.5 105.3 1.5X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
|
|
|
|
|
|
|
================================================================================================
|
|
|
|
Rebasing dates/timestamps in ORC datasource
|
|
|
|
================================================================================================
|
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save DATE to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, noop 23825 23825 0 4.2 238.2 1.0X
|
|
|
|
before 1582, noop 10501 10501 0 9.5 105.0 2.3X
|
|
|
|
after 1582 32134 32134 0 3.1 321.3 0.7X
|
|
|
|
before 1582 19947 19947 0 5.0 199.5 1.2X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load DATE from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, vec off 10034 10056 22 10.0 100.3 1.0X
|
|
|
|
after 1582, vec on 3664 3698 30 27.3 36.6 2.7X
|
|
|
|
before 1582, vec off 10472 10502 30 9.5 104.7 1.0X
|
|
|
|
before 1582, vec on 4052 4098 42 24.7 40.5 2.5X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save TIMESTAMP to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2812 2812 0 35.6 28.1 1.0X
|
|
|
|
before 1900, noop 2801 2801 0 35.7 28.0 1.0X
|
|
|
|
after 1900 18290 18290 0 5.5 182.9 0.2X
|
|
|
|
before 1900 22344 22344 0 4.5 223.4 0.1X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load TIMESTAMP from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off 11257 11279 32 8.9 112.6 1.0X
|
|
|
|
after 1900, vec on 5296 5310 15 18.9 53.0 2.1X
|
|
|
|
before 1900, vec off 14700 14758 72 6.8 147.0 0.8X
|
|
|
|
before 1900, vec on 8576 8665 150 11.7 85.8 1.3X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
|
|
|
|