spark-instrumented-optimizer/sql/core
Liang-Chi Hsieh 3030b82c89
[SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields
## What changes were proposed in this pull request?

Schema pruning doesn't work if nested column is used in where clause.

For example,
```
sql("select name.first from contacts where name.first = 'David'")

== Physical Plan ==
*(1) Project [name#19.first AS first#40]
+- *(1) Filter (isnotnull(name#19) && (name#19.first = David))
   +- *(1) FileScan parquet [name#19] Batched: false, Format: Parquet, PartitionFilters: [],
    PushedFilters: [IsNotNull(name)], ReadSchema: struct<name:struct<first:string,middle:string,last:string>>
```

In above query plan, the scan node reads the entire schema of `name` column.

This issue is reported by:
https://github.com/apache/spark/pull/21320#issuecomment-419290197

The cause is that we infer a root field from expression `IsNotNull(name)`. However, for such expression, we don't really use the nested fields of this root field, so we can ignore the unnecessary nested fields.

## How was this patch tested?

Unit tests.

Closes #22357 from viirya/SPARK-25363.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2018-09-12 17:43:40 +00:00
..
benchmarks [SPARK-25306][SQL] Avoid skewed filter trees to speed up createFilter in ORC 2018-09-05 10:24:13 +08:00
src [SPARK-25363][SQL] Fix schema pruning in where clause by ignoring unnecessary root fields 2018-09-12 17:43:40 +00:00
pom.xml [SPARK-25019][BUILD] Fix orc dependency to use the same exclusion rules 2018-08-06 12:00:39 -07:00