[SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10

### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075 In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below he new datetime formatter introduces silent data change like, ```sql spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); NULL spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); 00000001970-01-01 spark-sql> ``` For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it ```java switch (signStyle) { case EXCEEDS_PAD: if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) { buf.append(decimalStyle.getPositiveSign()); } break; .... ``` the `minWidth` == `len(y..y)` the `EXCEED_POINTS` is ```java /** * Array of 10 to the power of n. */ static final long[] EXCEED_POINTS = new long[] { 0L, 10L, 100L, 1000L, 10000L, 100000L, 1000000L, 10000000L, 100000000L, 1000000000L, 10000000000L, }; ``` So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised. And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue. ### Why are the changes needed? fix silent data change ### Does this PR introduce _any_ user-facing change? Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous 'y' or 'Y' ### How was this patch tested? new tests Closes #28684 from yaooqinn/SPARK-31867-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-31 12:34:39 +00:00 · 2020-05-31 12:34:39 +00:00 · 547c5bf552
parent 1b780f364b
commit 547c5bf552
8 changed files with 99 additions and 10 deletions
--- a/docs/sql-ref-datetime-pattern.md
+++ b/docs/sql-ref-datetime-pattern.md
@ -74,7 +74,7 @@ The count of pattern letters determines the format.
  For formatting, the fraction length would be padded to the number of contiguous 'S' with zeros.
  Spark supports datetime of micro-of-second precision, which has up to 6 significant digits, but can parse nano-of-second with exceeded part truncated.

- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present.
+- Year: The count of letters determines the minimum field width below which padding is used. If the count of letters is two, then a reduced two digit form is used. For printing, this outputs the rightmost two digits. For parsing, this will parse using the base value of 2000, resulting in a year within the range 2000 to 2099 inclusive. If the count of letters is less than four (but not two), then the sign is only output for negative years. Otherwise, the sign is output if the pad width is exceeded when 'G' is not present. 11 or more letters will fail.

 - Month: It follows the rule of Number/Text. The text form is depend on letters - 'M' denotes the 'standard' form, and 'L' is for 'stand-alone' form. These two forms are different only in some certain languages. For example, in Russian, 'Июль' is the stand-alone form of July, and 'Июля' is the standard form. Here are examples for all supported pattern letters:
  - `'M'` or `'L'`: Month number in a year starting from 1. There is no difference between 'M' and 'L'. Month from 1 to 9 are printed without padding.
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelper.scala
@ -227,8 +227,16 @@ private object DateTimeFormatterHelper {
    formatter.format(LocalDate.of(2000, 1, 1)) == "1 1"
  }
  final val unsupportedLetters = Set('A', 'c', 'e', 'n', 'N', 'p')
-  final val unsupportedNarrowTextStyle =
-    Seq("G", "M", "L", "E", "u", "Q", "q").map(_ * 5).toSet
+  final val unsupportedPatternLengths = {
+    // SPARK-31771: Disable Narrow-form TextStyle to avoid silent data change, as it is Full-form in
+    // 2.4
+    Seq("G", "M", "L", "E", "u", "Q", "q").map(_ * 5) ++
+      // SPARK-31867: Disable year pattern longer than 10 which will cause Java time library throw
+      // unchecked `ArrayIndexOutOfBoundsException` by the `NumberPrinterParser` for formatting. It
+      // makes the call side difficult to handle exceptions and easily leads to silent data change
+      // because of the exceptions being suppressed.
+      Seq("y", "Y").map(_ * 11)
+  }.toSet

  /**
   * In Spark 3.0, we switch to the Proleptic Gregorian calendar and use DateTimeFormatter for
@ -250,7 +258,7 @@ private object DateTimeFormatterHelper {
          for (c <- patternPart if unsupportedLetters.contains(c)) {
            throw new IllegalArgumentException(s"Illegal pattern character: $c")
          }
-          for (style <- unsupportedNarrowTextStyle if patternPart.contains(style)) {
+          for (style <- unsupportedPatternLengths if patternPart.contains(style)) {
            throw new IllegalArgumentException(s"Too many pattern letters: ${style.head}")
          }
          if (bugInStandAloneForm && (patternPart.contains("LLL") || patternPart.contains("qqq"))) {
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/DateTimeFormatterHelperSuite.scala
@ -40,7 +40,7 @@ class DateTimeFormatterHelperSuite extends SparkFunSuite {
      val e = intercept[IllegalArgumentException](convertIncompatiblePattern(s"yyyy-MM-dd $l G"))
      assert(e.getMessage === s"Illegal pattern character: $l")
    }
-    unsupportedNarrowTextStyle.foreach { style =>
+    unsupportedPatternLengths.foreach { style =>
      val e1 = intercept[IllegalArgumentException] {
        convertIncompatiblePattern(s"yyyy-MM-dd $style")
      }
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/util/TimestampFormatterSuite.scala
@ -406,8 +406,9 @@ class TimestampFormatterSuite extends SparkFunSuite with SQLHelper with Matchers
      intercept[IllegalArgumentException](TimestampFormatter(pattern, UTC).format(0))
    }
    // supported by the legacy one, then we will suggest users with SparkUpgradeException
-    Seq("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa").foreach { pattern =>
-      intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0))
+    Seq("GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa", "y" * 11, "y" * 11)
+      .foreach { pattern =>
+        intercept[SparkUpgradeException](TimestampFormatter(pattern, UTC).format(0))
    }
  }
 }
--- a/sql/core/src/test/resources/sql-tests/inputs/datetime.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/datetime.sql
@ -160,3 +160,7 @@ select from_json('{"time":"26/October/2015"}', 'time Timestamp', map('timestampF
 select from_json('{"date":"26/October/2015"}', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));
 select from_csv('26/October/2015', 'time Timestamp', map('timestampFormat', 'dd/MMMMM/yyyy'));
 select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy'));
+
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss');
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd');
--- a/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/datetime.sql.out
@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
-- Number of queries: 116
+-- Number of queries: 119


 -- !query
@ -999,3 +999,29 @@ struct<>
 -- !query output
 org.apache.spark.SparkUpgradeException
 You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss')
+-- !query schema
+struct<date_format(TIMESTAMP '2018-11-17 13:33:33', yyyyyyyyyy-MM-dd HH:mm:ss):string>
+-- !query output
+0000002018-11-17 13:33:33
+
+
+-- !query
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
--- a/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/datetime-legacy.sql.out
@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
-- Number of queries: 116
+-- Number of queries: 119


 -- !query
@ -956,3 +956,27 @@ select from_csv('26/October/2015', 'date Date', map('dateFormat', 'dd/MMMMM/yyyy
 struct<from_csv(26/October/2015):struct<date:date>>
 -- !query output
 {"date":2015-10-26}
+
+
+-- !query
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<from_unixtime(CAST(1 AS BIGINT), yyyyyyyyyyy-MM-dd):string>
+-- !query output
+00000001969-12-31
+
+
+-- !query
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss')
+-- !query schema
+struct<date_format(TIMESTAMP '2018-11-17 13:33:33', yyyyyyyyyy-MM-dd HH:mm:ss):string>
+-- !query output
+0000002018-11-17 13:33:33
+
+
+-- !query
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<date_format(CAST(DATE '2018-11-17' AS TIMESTAMP), yyyyyyyyyyy-MM-dd):string>
+-- !query output
+00000002018-11-17
--- a/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/datetime.sql.out
@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
-- Number of queries: 116
+-- Number of queries: 119


 -- !query
@ -971,3 +971,29 @@ struct<>
 -- !query output
 org.apache.spark.SparkUpgradeException
 You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'dd/MMMMM/yyyy' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select from_unixtime(1, 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
+
+
+-- !query
+select date_format(timestamp '2018-11-17 13:33:33', 'yyyyyyyyyy-MM-dd HH:mm:ss')
+-- !query schema
+struct<date_format(TIMESTAMP '2018-11-17 13:33:33', yyyyyyyyyy-MM-dd HH:mm:ss):string>
+-- !query output
+0000002018-11-17 13:33:33
+
+
+-- !query
+select date_format(date '2018-11-17', 'yyyyyyyyyyy-MM-dd')
+-- !query schema
+struct<>
+-- !query output
+org.apache.spark.SparkUpgradeException
+You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-dd' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html