From 1f3bb5175749816be1f0bc793ed5239abf986000 Mon Sep 17 00:00:00 2001 From: Kent Yao Date: Tue, 25 Aug 2020 13:17:03 +0000 Subject: [PATCH] [SPARK-32683][DOCS][SQL] Fix doc error and add migration guide for datetime pattern F ### What changes were proposed in this pull request? This PR fixes the doc error and add a migration guide for datetime pattern. ### Why are the changes needed? This is a bug of the doc that we inherited from JDK https://bugs.openjdk.java.net/browse/JDK-8169482 The SimpleDateFormatter(**F Day of week in month**) we used in 2.x and the DatetimeFormatter(**F week-of-month**) we use now both have the opposite meanings to what they declared in the java docs. And unfortunately, this also leads to silent data change in Spark too. The `week-of-month` is actually the pattern `W` in DatetimeFormatter, which is banned to use in Spark 3.x. If we want to keep pattern `F`, we need to accept the behavior change with proper migration guide and fix the doc in Spark ### Does this PR introduce _any_ user-facing change? Yes, doc changed ### How was this patch tested? passing ci doc generating job Closes #29538 from yaooqinn/SPARK-32683. Authored-by: Kent Yao Signed-off-by: Wenchen Fan --- docs/sql-migration-guide.md | 2 ++ docs/sql-ref-datetime-pattern.md | 2 +- 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/docs/sql-migration-guide.md b/docs/sql-migration-guide.md index d2671f266f..3b66694556 100644 --- a/docs/sql-migration-guide.md +++ b/docs/sql-migration-guide.md @@ -191,6 +191,8 @@ license: | - Since Spark 3.0, when using `EXTRACT` expression to extract the second field from date/timestamp values, the result will be a `DecimalType(8, 6)` value with 2 digits for second part, and 6 digits for the fractional part with microsecond precision. e.g. `extract(second from to_timestamp('2019-09-20 10:10:10.1'))` results `10.100000`. In Spark version 2.4 and earlier, it returns an `IntegerType` value and the result for the former example is `10`. + - In Spark 3.0, datetime pattern letter `F` is **aligned day of week in month** that represents the concept of the count of days within the period of a week where the weeks are aligned to the start of the month. In Spark version 2.4 and earlier, it is **week of month** that represents the concept of the count of weeks within the month where weeks start on a fixed day-of-week, e.g. `2020-07-30` is 30 days (4 weeks and 2 days) after the first day of the month, so `date_format(date '2020-07-30', 'F')` returns 2 in Spark 3.0, but as a week count in Spark 2.x, it returns 5 because it locates in the 5th week of July 2020, where week one is 2020-07-01 to 07-04. + ### Data Sources - In Spark version 2.4 and below, when reading a Hive SerDe table with Spark native data sources(parquet/orc), Spark infers the actual file schema and update the table schema in metastore. In Spark 3.0, Spark doesn't infer the schema anymore. This should not cause any problems to end users, but if it does, set `spark.sql.hive.caseSensitiveInferenceMode` to `INFER_AND_SAVE`. diff --git a/docs/sql-ref-datetime-pattern.md b/docs/sql-ref-datetime-pattern.md index d0299e5a99..4b02cdad36 100644 --- a/docs/sql-ref-datetime-pattern.md +++ b/docs/sql-ref-datetime-pattern.md @@ -37,7 +37,7 @@ Spark uses pattern letters in the following table for date and timestamp parsing |**d**|day-of-month|number(3)|28| |**Q/q**|quarter-of-year|number/text|3; 03; Q3; 3rd quarter| |**E**|day-of-week|text|Tue; Tuesday| -|**F**|week-of-month|number(1)|3| +|**F**|aligned day of week in month|number(1)|3| |**a**|am-pm-of-day|am-pm|PM| |**h**|clock-hour-of-am-pm (1-12)|number(2)|12| |**K**|hour-of-am-pm (0-11)|number(2)|0|