spark-instrumented-optimizer

History

Max Gekk ce63bef1da [SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns ### What changes were proposed in this pull request? Modified the `decodeDictionaryIds()` method `VectorizedColumnReader` to handle especially the `DateType` when passed parameter `rebaseDateTime` is true. In that case, decoded days are rebased from the hybrid calendar to Proleptic Gregorian calendar using `RebaseDateTime`.`rebaseJulianToGregorianDays()`. ### Why are the changes needed? This fixes the bug of loading dates before the cutover day from dictionary encoded column in parquet files. The code below forces dictionary encoding: ```scala spark.conf.set("spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled", true) Seq.tabulate(8)(_ => "1001-01-01").toDF("dateS") .select($"dateS".cast("date").as("date")).repartition(1) .write .option("parquet.enable.dictionary", true) .parquet(path) ``` Load the dates back: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-07\| ... \|1001-01-07\| +----------+ ``` Expected values must be 1000-01-01 but not 1001-01-07. ### Does this PR introduce _any_ user-facing change? Yes. After the changes: ```scala spark.read.parquet(path).show(false) +----------+ \|date \| +----------+ \|1001-01-01\| ... \|1001-01-01\| +----------+ ``` ### How was this patch tested? Modified the test `SPARK-31159: rebasing dates in write` in `ParquetIOSuite` to checked reading dictionary encoded dates. Closes #28479 from MaxGekk/fix-datetime-rebase-parquet-dict-enc. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-05-10 13:31:26 +09:00
..
benchmarks	[SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold	2020-05-05 14:11:53 +00:00
src	[SPARK-31662][SQL] Fix loading of dates before 1582-10-15 from dictionary encoded Parquet columns	2020-05-10 13:31:26 +09:00
v1.2/src	[SPARK-31489][SQL] Fix pushing down filters with `java.time.LocalDate` values in ORC	2020-04-26 15:49:00 -07:00
v2.3/src	[SPARK-31489][SQL] Fix pushing down filters with `java.time.LocalDate` values in ORC	2020-04-26 15:49:00 -07:00
pom.xml	[SPARK-31272][SQL] Support DB2 Kerberos login in JDBC connector	2020-04-22 17:10:30 -07:00