4cc7d9b8f1
### What changes were proposed in this pull request?
Fix the skipping values logic in Parquet vectorized reader when column index is effective, by considering nulls and only call `ParquetVectorUpdater.skipValues` when the values are non-null.
### Why are the changes needed?
Currently, the Parquet vectorized reader may not work correctly if column index filtering is effective, and the data page contains null values. For instance, let's say we have two columns `c1: BIGINT` and `c2: STRING`, and the following pages:
```
* c1 500 500 500 500
* |---------|---------|---------|---------|
* |-------|-----|-----|---|---|---|---|---|
* c2 400 300 300 200 200 200 200 200
```
and suppose we have a query like the following:
```sql
SELECT * FROM t WHERE c1 = 500
```
this will create a Parquet row range `[500, 1000)` which, when applied to `c2`, will require us to skip all the rows in `[400,500)`. However the current logic for skipping rows is via `updater.skipValues(n, valueReader)` which is incorrect since this skips the next `n` non-null values. In the case when nulls are present, this will not work correctly.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a new test in `ParquetColumnIndexSuite`.
Closes #33330 from sunchao/SPARK-36123-skip-nulls.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit
|
||
---|---|---|
.. | ||
benchmarks | ||
src | ||
pom.xml |