spark-instrumented-optimizer/sql/core
Max Gekk ce63bef1da [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns
### What changes were proposed in this pull request?
Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`.

### Why are the changes needed?
This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding:
```scala
spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true)
Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS")
  .select($"dateS".cast("date").as("date")).repartition(1)
  .write
  .option("parquet.enable.dictionary", true)
  .parquet(path)
```
Load the dates back:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-07|
...
|1001-01-07|
+----------+
```
Expected values **must be 1000-01-01** but not 1001-01-07.

### Does this PR introduce _any_ user-facing change?
Yes. After the changes:
```scala
spark.read.parquet(path).show(false)
+----------+
|date      |
+----------+
|1001-01-01|
...
|1001-01-01|
+----------+
```

### How was this patch tested?
Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates.

Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-10 13:31:26 +09:00
..
benchmarks [SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold 2020-05-05 14:11:53 +00:00
src [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns 2020-05-10 13:31:26 +09:00
v1.2/src [SPARK-31489][SQL] Fix pushing down filters with java.time.LocalDate values in ORC 2020-04-26 15:49:00 -07:00
v2.3/src [SPARK-31489][SQL] Fix pushing down filters with java.time.LocalDate values in ORC 2020-04-26 15:49:00 -07:00
pom.xml [SPARK-31272][SQL] Support DB2 Kerberos login in JDBC connector 2020-04-22 17:10:30 -07:00