spark-instrumented-optimizer

History

Wenchen Fan 0025ddeb1d [SPARK-22472][SQL] add null check for top-level primitive values ## What changes were proposed in this pull request? One powerful feature of `Dataset` is, we can easily map SQL rows to Scala/Java objects and do runtime null check automatically. For example, let's say we have a parquet file with schema `<a: int, b: string>`, and we have a `case class Data(a: Int, b: String)`. Users can easily read this parquet file into `Data` objects, and Spark will throw NPE if column `a` has null values. However the null checking is left behind for top-level primitive values. For example, let's say we have a parquet file with schema `<a: Int>`, and we read it into Scala `Int`. If column `a` has null values, we will get some weird results. ``` scala> val ds = spark.read.parquet(...).as[Int] scala> ds.show() +----+ \|v \| +----+ \|null\| \|1 \| +----+ scala> ds.collect res0: Array[Long] = Array(0, 1) scala> ds.map(_ * 2).show +-----+ \|value\| +-----+ \|-2 \| \|2 \| +-----+ ``` This is because internally Spark use some special default values for primitive types, but never expect users to see/operate these default value directly. This PR adds null check for top-level primitive values ## How was this patch tested? new test Author: Wenchen Fan <wenchen@databricks.com> Closes #19707 from cloud-fan/bug.		2017-11-09 21:56:20 -08:00
..
benchmarks	[SPARK-17335][SQL] Fix ArrayType and MapType CatalogString.	2016-09-03 19:02:20 +02:00
src	[SPARK-22472][SQL] add null check for top-level primitive values	2017-11-09 21:56:20 -08:00
pom.xml	[SPARK-20978][SQL] Bump up Univocity version to 2.5.4	2017-09-05 23:21:43 +08:00