spark-instrumented-optimizer/sql/core
Dongjoon Hyun dbf051c50a [SPARK-34212][SQL] Fix incorrect decimal reading from Parquet files
### What changes were proposed in this pull request?

This PR aims to the correctness issues during reading decimal values from Parquet files.
- For **MR** code path, `ParquetRowConverter` can read Parquet's decimal values with the original precision and scale written in the corresponding footer.
- For **Vectorized** code path, `VectorizedColumnReader` throws `SchemaColumnConvertNotSupportedException`.

### Why are the changes needed?

Currently, Spark returns incorrect results when the Parquet file's decimal precision and scale are different from the Spark's schema. This happens when there is multiple files with different decimal schema or HiveMetastore has a new schema.

**BEFORE (Simplified example for correctness)**

```scala
scala> sql("SELECT 1.0 a").write.parquet("/tmp/decimal")
scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show
+----+
|   a|
+----+
|0.10|
+----+
```

This works correctly in the other data sources, `ORC/JSON/CSV`, like the following.
```scala
scala> sql("SELECT 1.0 a").write.orc("/tmp/decimal_orc")
scala> spark.read.schema("a DECIMAL(3,2)").orc("/tmp/decimal_orc").show
+----+
|   a|
+----+
|1.00|
+----+
```

**AFTER**
1. **Vectorized** path: Instead of incorrect result, we will raise an explicit exception.
```scala
scala> spark.read.schema("a DECIMAL(3,2)").parquet("/tmp/decimal").show
java.lang.UnsupportedOperationException: Schema evolution not supported.
```

2. **MR** path (complex schema or explicit configuration): Spark returns correct results.
```scala
scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").show
+----+-------+--------+
|   a|      b|       c|
+----+-------+--------+
|1.00|100.000|{1 -> 2}|
+----+-------+--------+

scala> spark.read.schema("a DECIMAL(3,2), b DECIMAL(18, 3), c MAP<INT,INT>").parquet("/tmp/decimal").printSchema
root
 |-- a: decimal(3,2) (nullable = true)
 |-- b: decimal(18,3) (nullable = true)
 |-- c: map (nullable = true)
 |    |-- key: integer
 |    |-- value: integer (valueContainsNull = true)
```

### Does this PR introduce _any_ user-facing change?

Yes. This fixes the correctness issue.

### How was this patch tested?

Pass with the newly added test case.

Closes #31319 from dongjoon-hyun/SPARK-34212.

Lead-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-01-26 15:13:39 -08:00
..
benchmarks [SPARK-34192][SQL] Move char padding to write side and remove length check on read side too 2021-01-26 02:08:35 +08:00
src [SPARK-34212][SQL] Fix incorrect decimal reading from Parquet files 2021-01-26 15:13:39 -08:00
pom.xml [SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT 2020-12-04 14:10:42 -08:00