071bbad5db
This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists |
||
---|---|---|
.. | ||
avro | ||
gen-java/org/apache/spark/sql/execution/datasources/parquet/test/avro | ||
java/test/org/apache/spark/sql | ||
resources | ||
scala/org/apache/spark/sql | ||
scripts | ||
thrift | ||
README.md |
Notes for Parquet compatibility tests
The following directories and files are used for Parquet compatibility tests:
.
├── README.md # This file
├── avro
│ ├── parquet-compat.avdl # Testing Avro IDL
│ └── parquet-compat.avpr # !! NO TOUCH !! Protocol file generated from parquet-compat.avdl
├── gen-java # !! NO TOUCH !! Generated Java code
├── scripts
│ └── gen-code.sh # Script used to generate Java code for Thrift and Avro
└── thrift
└── parquet-compat.thrift # Testing Thrift schema
Generated Java code are used in the following test suites:
org.apache.spark.sql.parquet.ParquetAvroCompatibilitySuite
org.apache.spark.sql.parquet.ParquetThriftCompatibilitySuite
To avoid code generation during build time, Java code generated from testing Thrift schema and Avro IDL are also checked in.
When updating the testing Thrift schema and Avro IDL, please run gen-code.sh
to update all the generated Java code.
Prerequisites
Please ensure avro-tools
and thrift
are installed. You may install these two on Mac OS X via:
$ brew install thrift avro-tools