spark-instrumented-optimizer

History

Fu Chen 09bebc8bde [SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv` ### What changes were proposed in this pull request? Rework [PR](https://github.com/apache/spark/pull/33212) with suggestions. This PR make `spark.read.json()` has the same behavior with Datasource API `spark.read.format("json").load("path")`. Spark should turn a non-nullable schema into nullable when using API `spark.read.json()` by default. Here is an example: ```scala val schema = StructType(Seq(StructField("value", StructType(Seq( StructField("x", IntegerType, nullable = false), StructField("y", IntegerType, nullable = false) )), nullable = true ))) val testDS = Seq("""{"value":{"x":1}}""").toDS spark.read .schema(schema) .json(testDS) .printSchema() spark.read .schema(schema) .format("json") .load("/tmp/json/t1") .printSchema() // root // \|-- value: struct (nullable = true) // \| \|-- x: integer (nullable = true) // \| \|-- y: integer (nullable = true) ``` Before this pr: ``` // output of spark.read.json() root \|-- value: struct (nullable = true) \| \|-- x: integer (nullable = false) \| \|-- y: integer (nullable = false) ``` After this pr: ``` // output of spark.read.json() root \|-- value: struct (nullable = true) \| \|-- x: integer (nullable = true) \| \|-- y: integer (nullable = true) ``` - `spark.read.csv()` also has the same problem. - Datasource API `spark.read.format("json").load("path")` do this logical when resolve relation. `c77acf0bbc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L415-L421)` ### Does this PR introduce _any_ user-facing change? Yes, `spark.read.json()` and `spark.read.csv()` not respect the user-given schema and always turn it into a nullable schema by default. ### How was this patch tested? New test. Closes #33436 from cfmcgrady/SPARK-35912-v3. Authored-by: Fu Chen <cfmcgrady@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2021-07-22 11:12:36 +09:00
..
benchmarks	[SPARK-34981][SQL][FOLLOWUP] Use SpecificInternalRow in ApplyFunctionExpression	2021-05-24 17:25:24 +09:00
src	[SPARK-35912][SQL] Fix nullability of `spark.read.json/spark.read.csv`	2021-07-22 11:12:36 +09:00
pom.xml	[SPARK-35996][BUILD] Setting version to 3.3.0-SNAPSHOT	2021-07-02 13:47:36 -07:00