spark-instrumented-optimizer/sql/core/benchmarks/DateTimeRebaseBenchmark-results.txt
Max Gekk cac8d1b352 [SPARK-31398][SQL] Fix perf regression of loading dates before 1582 year by non-vectorized ORC reader
### What changes were proposed in this pull request?
In regular ORC reader when `spark.sql.orc.enableVectorizedReader` is set to `false`, I propose to use `DaysWritable` in reading DATE values from ORC files. Currently, days from ORC files are converted to java.sql.Date, and then to days in Proleptic Gregorian calendar. So, the conversion to Java type can be eliminated.

### Why are the changes needed?
- The PR fixes regressions in loading dates before the 1582 year from ORC files by when vectorised ORC reader is off.
- The changes improve performance of regular ORC reader for DATE columns.
  - x3.6 faster comparing to the current master
  - x1.9-x4.3 faster against Spark 2.4.6

Before (on JDK 8):
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               39651          39686          31          2.5         396.5       1.0X
after 1582, vec on                                 3647           3660          13         27.4          36.5      10.9X
before 1582, vec off                              38155          38219          61          2.6         381.6       1.0X
before 1582, vec on                                4041           4046           6         24.7          40.4       9.8X
```

After (on JDK 8):
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               10947          10971          28          9.1         109.5       1.0X
after 1582, vec on                                 3677           3702          36         27.2          36.8       3.0X
before 1582, vec off                              11456          11472          21          8.7         114.6       1.0X
before 1582, vec on                                4079           4103          21         24.5          40.8       2.7X
```

Spark 2.4.6:
```
Load dates from ORC:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48169          48276          96          2.1         481.7       1.0X
after 1582, vec on                                 5375           5410          41         18.6          53.7       9.0X
before 1582, vec off                              22353          22482         198          4.5         223.5       2.2X
before 1582, vec on                                5474           5475           1         18.3          54.7       8.8X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites like `DateTimeUtilsSuite`
- Checked for `hive-1.2` by:
```
./build/sbt -Phive-1.2 "test:testOnly *OrcHadoopFsRelationSuite"
```
- Re-run `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28169 from MaxGekk/orc-optimize-dates.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-13 05:29:54 +00:00

95 lines
8.6 KiB
Plaintext

================================================================================================
Rebasing dates/timestamps in Parquet datasource
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Save dates to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, noop 23453 23453 0 4.3 234.5 1.0X
before 1582, noop 10821 10821 0 9.2 108.2 2.2X
after 1582, rebase off 35558 35558 0 2.8 355.6 0.7X
after 1582, rebase on 35892 35892 0 2.8 358.9 0.7X
before 1582, rebase off 22700 22700 0 4.4 227.0 1.0X
before 1582, rebase on 23247 23247 0 4.3 232.5 1.0X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load dates from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off, rebase off 12641 12690 43 7.9 126.4 1.0X
after 1582, vec off, rebase on 13318 13380 67 7.5 133.2 0.9X
after 1582, vec on, rebase off 3648 3659 11 27.4 36.5 3.5X
after 1582, vec on, rebase on 5160 5212 69 19.4 51.6 2.4X
before 1582, vec off, rebase off 13024 13065 36 7.7 130.2 1.0X
before 1582, vec off, rebase on 13810 13932 106 7.2 138.1 0.9X
before 1582, vec on, rebase off 3631 3695 57 27.5 36.3 3.5X
before 1582, vec on, rebase on 5791 5860 71 17.3 57.9 2.2X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Save timestamps to parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, noop 2750 2750 0 36.4 27.5 1.0X
before 1582, noop 2833 2833 0 35.3 28.3 1.0X
after 1582, rebase off 16832 16832 0 5.9 168.3 0.2X
after 1582, rebase on 19688 19688 0 5.1 196.9 0.1X
before 1582, rebase off 17548 17548 0 5.7 175.5 0.2X
before 1582, rebase on 21343 21343 0 4.7 213.4 0.1X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load timestamps from parquet: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off, rebase off 15243 15329 82 6.6 152.4 1.0X
after 1582, vec off, rebase on 18296 18330 54 5.5 183.0 0.8X
after 1582, vec on, rebase off 4925 4927 2 20.3 49.2 3.1X
after 1582, vec on, rebase on 9647 9686 35 10.4 96.5 1.6X
before 1582, vec off, rebase off 14880 15105 267 6.7 148.8 1.0X
before 1582, vec off, rebase on 18474 18514 51 5.4 184.7 0.8X
before 1582, vec on, rebase off 4970 4978 10 20.1 49.7 3.1X
before 1582, vec on, rebase on 9938 10012 64 10.1 99.4 1.5X
================================================================================================
Rebasing dates/timestamps in ORC datasource
================================================================================================
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Save dates to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, noop 23500 23500 0 4.3 235.0 1.0X
before 1582, noop 10788 10788 0 9.3 107.9 2.2X
after 1582 32237 32237 0 3.1 322.4 0.7X
before 1582 20187 20187 0 5.0 201.9 1.2X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load dates from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off 10947 10971 28 9.1 109.5 1.0X
after 1582, vec on 3677 3702 36 27.2 36.8 3.0X
before 1582, vec off 11456 11472 21 8.7 114.6 1.0X
before 1582, vec on 4079 4103 21 24.5 40.8 2.7X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Save timestamps to ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, noop 2891 2891 0 34.6 28.9 1.0X
before 1582, noop 2906 2906 0 34.4 29.1 1.0X
after 1582 55812 55812 0 1.8 558.1 0.1X
before 1582 57512 57512 0 1.7 575.1 0.1X
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz
Load timestamps from ORC: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off 46376 46410 33 2.2 463.8 1.0X
after 1582, vec on 35003 35189 163 2.9 350.0 1.3X
before 1582, vec off 52942 52979 34 1.9 529.4 0.9X
before 1582, vec on 42596 42747 158 2.3 426.0 1.1X