spark-instrumented-optimizer/sql/core/src/main
Sameer Agarwal a2c9acb0e5 [SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error
## What changes were proposed in this pull request?

This patch fixes a bug in the vectorized parquet reader that's caused by re-using the same dictionary column vector while reading consecutive row groups. Specifically, this issue manifests for a certain distribution of dictionary/plain encoded data while we read/populate the underlying bit packed dictionary data into a column-vector based data structure.

## How was this patch tested?

Manually tested on datasets provided by the community. Thanks to Chris Perluss and Keith Kraus for their invaluable help in tracking down this issue!

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14941 from sameeragarwal/parquet-exception-2.
2016-09-02 15:16:16 -07:00
..
java/org/apache/spark/sql [SPARK-16334] Reusing same dictionary column for decoding consecutive row groups shouldn't throw an error 2016-09-02 15:16:16 -07:00
resources [SPARK-16031] Add debug-only socket source in Structured Streaming 2016-06-19 21:27:04 -07:00
scala/org/apache/spark/sql [SPARK-17230] [SQL] Should not pass optimized query into QueryExecution in DataFrameWriter 2016-09-02 15:10:12 -07:00