spark-instrumented-optimizer/sql/core
Fu Chen 09bebc8bde [SPARK-35912][SQL] Fix nullability of spark.read.json/spark.read.csv
### What changes were proposed in this pull request?

Rework [PR](https://github.com/apache/spark/pull/33212) with suggestions.

This PR make `spark.read.json()` has the same behavior with Datasource API `spark.read.format("json").load("path")`. Spark should turn a non-nullable schema into nullable when using API `spark.read.json()` by default.

Here is an example:

```scala
  val schema = StructType(Seq(StructField("value",
    StructType(Seq(
      StructField("x", IntegerType, nullable = false),
      StructField("y", IntegerType, nullable = false)
    )),
    nullable = true
  )))

  val testDS = Seq("""{"value":{"x":1}}""").toDS
  spark.read
    .schema(schema)
    .json(testDS)
    .printSchema()

  spark.read
    .schema(schema)
    .format("json")
    .load("/tmp/json/t1")
    .printSchema()
  // root
  //  |-- value: struct (nullable = true)
  //  |    |-- x: integer (nullable = true)
  //  |    |-- y: integer (nullable = true)
```

Before this pr:
```
// output of spark.read.json()
root
 |-- value: struct (nullable = true)
 |    |-- x: integer (nullable = false)
 |    |-- y: integer (nullable = false)
```

After this pr:
```
// output of spark.read.json()
root
 |-- value: struct (nullable = true)
 |    |-- x: integer (nullable = true)
 |    |-- y: integer (nullable = true)
```

- `spark.read.csv()` also has the same problem.
- Datasource API `spark.read.format("json").load("path")` do this logical when resolve relation.

c77acf0bbc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L415-L421)

### Does this PR introduce _any_ user-facing change?

Yes, `spark.read.json()` and `spark.read.csv()` not respect the user-given schema and always turn it into a nullable schema by default.

### How was this patch tested?

New test.

Closes #33436 from cfmcgrady/SPARK-35912-v3.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 11:12:36 +09:00
..
benchmarks [SPARK-34981][SQL][FOLLOWUP] Use SpecificInternalRow in ApplyFunctionExpression 2021-05-24 17:25:24 +09:00
src [SPARK-35912][SQL] Fix nullability of spark.read.json/spark.read.csv 2021-07-22 11:12:36 +09:00
pom.xml [SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT 2021-07-02 13:47:36 -07:00