spark-instrumented-optimizer

History

yi.wu ff39c9271c [SPARK-30252][SQL] Disallow negative scale of Decimal ### What changes were proposed in this pull request? This PR propose to disallow negative `scale` of `Decimal` in Spark. And this PR brings two behavior changes: 1) for literals like `1.23E4BD` or `1.23E4`(with `spark.sql.legacy.exponentLiteralAsDecimal.enabled`=true, see [SPARK-29956](https://issues.apache.org/jira/browse/SPARK-29956)), we set its `(precision, scale)` to (5, 0) rather than (3, -2); 2) add negative `scale` check inside the decimal method if it exposes to set `scale` explicitly. If check fails, `AnalysisException` throws. And user could still use `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to restore the previous behavior. ### Why are the changes needed? According to SQL standard, > 4.4.2 Characteristics of numbers An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer. scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale. Presto: ``` presto:default> create table t (i decimal(2, -1)); Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type> create table t (i decimal(2, -1)) ``` PostgrelSQL: ``` postgres=# create table t(i decimal(2, -1)); ERROR: NUMERIC scale -1 must be between 0 and precision 2 LINE 1: create table t(i decimal(2, -1)); ^ ``` And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL: ``` scala> spark.sql("create table t(i decimal(2, -1))"); org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28) == SQL == create table t(i decimal(2, -1)) ----------------------------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 35 elided ``` However, it is still possible to create such table or `DatFrame` using Spark SQL programming API: ``` scala> val tb = CatalogTable( TableIdentifier("test", None), CatalogTableType.MANAGED, CatalogStorageFormat.empty, StructType(StructField("i", DecimalType(2, -1) ) :: Nil)) ``` ``` scala> spark.sql("SELECT 1.23E4BD") res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)] ``` while, these two different behavior could make user confused. On the other side, even if user creates such table or `DataFrame` with negative scale decimal type, it can't write data out if using format, like `parquet` or `orc`. Because these formats have their own check for negative scale and fail on it. ``` scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet") 19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2 at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309) at org.apache.parquet.schema.Types$Builder.named(Types.java:290) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` So, I think it would be better to disallow negative scale totally and make behaviors above be consistent. ### Does this PR introduce any user-facing change? Yes, if `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false`, user couldn't create Decimal value with negative scale anymore. ### How was this patch tested? Added new tests in `ExpressionParserSuite` and `DecimalSuite`; Updated `SQLQueryTestSuite`. Closes #26881 from Ngone51/nonnegative-scale. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2020-01-21 21:09:48 +08:00
..
__init__.py	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files	2018-11-14 14:51:11 +08:00
test_arrow.py	[SPARK-30499][SQL] Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone	2020-01-17 11:44:49 +09:00
test_catalog.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_column.py	[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala	2019-11-01 12:25:48 +09:00
test_conf.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_context.py	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3	2019-09-09 10:19:40 -05:00
test_dataframe.py	[SPARK-29188][PYTHON][FOLLOW-UP] Explicitly disable Arrow execution for all test of toPandas empty types	2020-01-17 15:00:18 +09:00
test_datasources.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_functions.py	[SPARK-28153][PYTHON] Use AtomicReference at InputFileBlockHolder (to support input_file_name with Python UDF)	2019-07-31 22:40:01 +08:00
test_group.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_pandas_udf.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_pandas_udf_cogrouped_map.py	[SPARK-27463][PYTHON][FOLLOW-UP] Miscellaneous documentation and code cleanup of cogroup pandas UDF	2019-09-30 22:25:35 +09:00
test_pandas_udf_grouped_agg.py	[SPARK-30188][SQL] Resolve the failed unit tests when enable AQE	2020-01-13 22:55:19 +08:00
test_pandas_udf_grouped_map.py	[SPARK-29402][PYTHON][TESTS] Added tests for grouped map pandas_udf with window	2019-10-11 16:19:13 -07:00
test_pandas_udf_iter.py	[SPARK-28198][PYTHON][FOLLOW-UP] Rename mapPartitionsInPandas to mapInPandas with a separate evaluation type	2019-07-05 09:22:41 +09:00
test_pandas_udf_scalar.py	[SPARK-30499][SQL] Remove SQL config spark.sql.execution.pandas.respectSessionTimeZone	2020-01-17 11:44:49 +09:00
test_pandas_udf_window.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_readwriter.py	[SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored	2019-07-18 13:37:59 +09:00
test_serde.py	[SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type	2019-09-12 08:52:25 +09:00
test_session.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_streaming.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_types.py	[SPARK-30252][SQL] Disallow negative scale of Decimal	2020-01-21 21:09:48 +08:00
test_udf.py	[SPARK-28978][ ] Support > 256 args to python udf	2019-11-08 19:19:14 -08:00
test_utils.py	[SPARK-19926][PYSPARK] make captured exception from JVM side user friendly	2019-09-18 23:32:10 +09:00