2020-03-30 04:46:31 -04:00
|
|
|
================================================================================================
|
|
|
|
Rebasing dates/timestamps in Parquet datasource
|
|
|
|
================================================================================================
|
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save DATE to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, noop 20802 20802 0 4.8 208.0 1.0X
|
|
|
|
before 1582, noop 10728 10728 0 9.3 107.3 1.9X
|
|
|
|
after 1582, rebase off 32924 32924 0 3.0 329.2 0.6X
|
|
|
|
after 1582, rebase on 32627 32627 0 3.1 326.3 0.6X
|
|
|
|
before 1582, rebase off 21576 21576 0 4.6 215.8 1.0X
|
|
|
|
before 1582, rebase on 23115 23115 0 4.3 231.2 0.9X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load DATE from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, vec off, rebase off 12880 12984 178 7.8 128.8 1.0X
|
|
|
|
after 1582, vec off, rebase on 13118 13255 174 7.6 131.2 1.0X
|
|
|
|
after 1582, vec on, rebase off 3645 3698 76 27.4 36.4 3.5X
|
|
|
|
after 1582, vec on, rebase on 3709 3727 15 27.0 37.1 3.5X
|
|
|
|
before 1582, vec off, rebase off 13014 13051 36 7.7 130.1 1.0X
|
|
|
|
before 1582, vec off, rebase on 14195 14242 48 7.0 142.0 0.9X
|
|
|
|
before 1582, vec on, rebase off 3680 3773 92 27.2 36.8 3.5X
|
|
|
|
before 1582, vec on, rebase on 4310 4381 87 23.2 43.1 3.0X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save TIMESTAMP_INT96 to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 3026 3026 0 33.1 30.3 1.0X
|
2020-05-05 01:40:15 -04:00
|
|
|
before 1900, noop 2995 2995 0 33.4 30.0 1.0X
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, rebase off 24294 24294 0 4.1 242.9 0.1X
|
|
|
|
after 1900, rebase on 24480 24480 0 4.1 244.8 0.1X
|
|
|
|
before 1900, rebase off 31120 31120 0 3.2 311.2 0.1X
|
|
|
|
before 1900, rebase on 31201 31201 0 3.2 312.0 0.1X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-03-30 04:46:31 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load TIMESTAMP_INT96 from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-03-30 04:46:31 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off, rebase off 18283 18309 39 5.5 182.8 1.0X
|
|
|
|
after 1900, vec off, rebase on 18235 18269 53 5.5 182.4 1.0X
|
|
|
|
after 1900, vec on, rebase off 9563 9589 27 10.5 95.6 1.9X
|
|
|
|
after 1900, vec on, rebase on 9463 9554 81 10.6 94.6 1.9X
|
|
|
|
before 1900, vec off, rebase off 21377 21469 118 4.7 213.8 0.9X
|
|
|
|
before 1900, vec off, rebase on 21265 21422 156 4.7 212.7 0.9X
|
|
|
|
before 1900, vec on, rebase off 12481 12524 46 8.0 124.8 1.5X
|
|
|
|
before 1900, vec on, rebase on 12360 12482 105 8.1 123.6 1.5X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Save TIMESTAMP_MICROS to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2984 2984 0 33.5 29.8 1.0X
|
|
|
|
before 1900, noop 3003 3003 0 33.3 30.0 1.0X
|
|
|
|
after 1900, rebase off 15814 15814 0 6.3 158.1 0.2X
|
|
|
|
after 1900, rebase on 16250 16250 0 6.2 162.5 0.2X
|
|
|
|
before 1900, rebase off 16026 16026 0 6.2 160.3 0.2X
|
|
|
|
before 1900, rebase on 19735 19735 0 5.1 197.3 0.2X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Load TIMESTAMP_MICROS from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off, rebase off 15292 15351 57 6.5 152.9 1.0X
|
|
|
|
after 1900, vec off, rebase on 15753 15886 173 6.3 157.5 1.0X
|
|
|
|
after 1900, vec on, rebase off 4879 4923 52 20.5 48.8 3.1X
|
|
|
|
after 1900, vec on, rebase on 5018 5038 18 19.9 50.2 3.0X
|
|
|
|
before 1900, vec off, rebase off 15257 15311 53 6.6 152.6 1.0X
|
|
|
|
before 1900, vec off, rebase on 18459 18537 90 5.4 184.6 0.8X
|
|
|
|
before 1900, vec on, rebase off 4929 4946 15 20.3 49.3 3.1X
|
|
|
|
before 1900, vec on, rebase on 8254 8339 93 12.1 82.5 1.9X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Save TIMESTAMP_MILLIS to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2987 2987 0 33.5 29.9 1.0X
|
|
|
|
before 1900, noop 3002 3002 0 33.3 30.0 1.0X
|
|
|
|
after 1900, rebase off 15215 15215 0 6.6 152.1 0.2X
|
|
|
|
after 1900, rebase on 15577 15577 0 6.4 155.8 0.2X
|
|
|
|
before 1900, rebase off 15505 15505 0 6.4 155.1 0.2X
|
|
|
|
before 1900, rebase on 19143 19143 0 5.2 191.4 0.2X
|
2020-05-05 01:40:15 -04:00
|
|
|
|
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
|
|
|
Load TIMESTAMP_MILLIS from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off, rebase off 15330 15436 113 6.5 153.3 1.0X
|
|
|
|
after 1900, vec off, rebase on 15515 15549 30 6.4 155.1 1.0X
|
|
|
|
after 1900, vec on, rebase off 6056 6074 19 16.5 60.6 2.5X
|
|
|
|
after 1900, vec on, rebase on 6376 6390 14 15.7 63.8 2.4X
|
|
|
|
before 1900, vec off, rebase off 15490 15523 36 6.5 154.9 1.0X
|
|
|
|
before 1900, vec off, rebase on 18613 18685 118 5.4 186.1 0.8X
|
|
|
|
before 1900, vec on, rebase off 6065 6109 41 16.5 60.6 2.5X
|
|
|
|
before 1900, vec on, rebase on 9052 9082 32 11.0 90.5 1.7X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
|
|
|
|
|
|
|
================================================================================================
|
|
|
|
Rebasing dates/timestamps in ORC datasource
|
|
|
|
================================================================================================
|
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save DATE to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, noop 20653 20653 0 4.8 206.5 1.0X
|
|
|
|
before 1582, noop 10707 10707 0 9.3 107.1 1.9X
|
|
|
|
after 1582 28288 28288 0 3.5 282.9 0.7X
|
|
|
|
before 1582 19196 19196 0 5.2 192.0 1.1X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load DATE from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1582, vec off 10596 10621 37 9.4 106.0 1.0X
|
|
|
|
after 1582, vec on 3886 3938 61 25.7 38.9 2.7X
|
|
|
|
before 1582, vec off 10955 10984 26 9.1 109.6 1.0X
|
|
|
|
before 1582, vec on 4236 4258 24 23.6 42.4 2.5X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Save TIMESTAMP to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, noop 2988 2988 0 33.5 29.9 1.0X
|
|
|
|
before 1900, noop 3007 3007 0 33.3 30.1 1.0X
|
|
|
|
after 1900 18082 18082 0 5.5 180.8 0.2X
|
|
|
|
before 1900 22669 22669 0 4.4 226.7 0.1X
|
2020-04-01 03:02:26 -04:00
|
|
|
|
[SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?
Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.
### Why are the changes needed?
Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2677 2838 142 37.4 26.8 1.0X
[info] after 1582, vec on, rebase on 3828 4331 805 26.1 38.3 0.7X
[info] before 1582, vec on, rebase off 2903 2926 34 34.4 29.0 0.9X
[info] before 1582, vec on, rebase on 4163 4197 38 24.0 41.6 0.6X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3537 3627 104 28.3 35.4 1.0X
[info] after 1900, vec on, rebase on 6891 7010 105 14.5 68.9 0.5X
[info] before 1900, vec on, rebase off 3692 3770 72 27.1 36.9 1.0X
[info] before 1900, vec on, rebase on 7588 7610 30 13.2 75.9 0.5X
```
After this patch
```
[info] Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off 2758 2944 197 36.3 27.6 1.0X
[info] after 1582, vec on, rebase on 2908 2966 51 34.4 29.1 0.9X
[info] before 1582, vec on, rebase off 2840 2878 37 35.2 28.4 1.0X
[info] before 1582, vec on, rebase on 3407 3433 24 29.4 34.1 0.8X
[info] Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off 3861 4003 139 25.9 38.6 1.0X
[info] after 1900, vec on, rebase on 4194 4283 77 23.8 41.9 0.9X
[info] before 1900, vec on, rebase off 3849 3937 79 26.0 38.5 1.0X
[info] before 1900, vec on, rebase on 7512 7546 55 13.3 75.1 0.5X
```
Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.
Closes #28406 from cloud-fan/perf.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 02:30:10 -04:00
|
|
|
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
|
2020-04-01 03:02:26 -04:00
|
|
|
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
|
2020-05-05 01:40:15 -04:00
|
|
|
Load TIMESTAMP from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
|
2020-04-01 03:02:26 -04:00
|
|
|
------------------------------------------------------------------------------------------------------------------------
|
2020-05-05 10:11:53 -04:00
|
|
|
after 1900, vec off 12029 12035 9 8.3 120.3 1.0X
|
|
|
|
after 1900, vec on 5194 5197 3 19.3 51.9 2.3X
|
|
|
|
before 1900, vec off 14853 14875 23 6.7 148.5 0.8X
|
|
|
|
before 1900, vec on 7797 7836 60 12.8 78.0 1.5X
|
2020-03-30 04:46:31 -04:00
|
|
|
|
|
|
|
|