spark-instrumented-optimizer

History

Maxim Gekk db6247faa8 [SPARK-31211][SQL] Fix rebasing of 29 February of Julian leap years ### What changes were proposed in this pull request? In the PR, I propose to fix the issue of rebasing leap years in Julian calendar to Proleptic Gregorian calendar in which the years are not leap years. In the Julian calendar, every four years is a leap year, with a leap day added to the month of February. In Proleptic Gregorian calendar, every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years, if they are exactly divisible by 400. In this ways, the date 1000-02-29 exists in the Julian calendar but not in Proleptic Gregorian calendar. I modified the `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` in `DateTimeUtils` by passing 1 as a day number of month while forming `LocalDate` or `LocalDateTime`, and adding the number of days using the `plusDays()` method. For example, 1000-02-29 doesn't exist in Proleptic Gregorian calendar, and `LocalDate.of(1000, 2, 29)` throws an exception. To avoid the issue, I build the `LocalDate.of(1000, 2, 1)` date and add 28 days. The `plusDays(28)` method produces the next valid date after `1000-02-28` which is 1000-03-01. ### Why are the changes needed? Before the changes, the `java.time.DateTimeException` exception is raised while loading the date `1000-02-29` from parquet files saved by Spark 2.4.5: ```scala scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show 20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year ``` The parquet files were saved via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show +----------+ \| date\| +----------+ \|1000-02-29\| +----------+ ``` ### Does this PR introduce any user-facing change? Yes, after the fix: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show +----------+ \| date\| +----------+ \|1000-03-01\| +----------+ ``` ### How was this patch tested? Added tests to `DateTimeUtilsSuite`. Closes #27974 from MaxGekk/julian-date-29-feb. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2020-03-23 14:21:24 +08:00
..
benchmarks	[SPARK-30413][SQL] Avoid WrappedArray roundtrip in GenericArrayData constructor, plus related optimization in ParquetMapConverter	2020-01-19 19:12:19 -08:00
src	[SPARK-31211][SQL] Fix rebasing of 29 February of Julian leap years	2020-03-23 14:21:24 +08:00
pom.xml	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00