spark-instrumented-optimizer/external
Erik Krogen 4dd41b9678 [SPARK-34365][AVRO] Add support for positional Catalyst-to-Avro schema matching
### What changes were proposed in this pull request?
Provide the (configurable) ability to perform Avro-to-Catalyst schema field matching using the position of the fields instead of their names. A new `option` is added for the Avro datasource, `positionalFieldMatching`, which instructs `AvroSerializer`/`AvroDeserializer` to perform positional field matching instead of matching by name.

### Why are the changes needed?
This by-name matching is somewhat recent; prior to PR #24635, at least on the write path, schemas were matched by positionally ("structural" comparison). While by-name is better behavior as a default, it will be better to make this configurable by a user. Even at the time that PR #24635 was handled, there was [interest in making this behavior configurable](https://github.com/apache/spark/pull/24635#issuecomment-494205251), but it appears it went unaddressed.

There is precedence for configurability of this behavior as seen in PR #29737, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally ([ref](https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles)), so this is behavior that Hadoop/Hive ecosystem users are familiar with.

### Does this PR introduce _any_ user-facing change?
Yes, a new option is provided for the Avro datasource, `positionalFieldMatching`, which provides compatibility with Hive and pre-3.0.0 Spark behavior.

### How was this patch tested?
New unit tests are added within `AvroSuite`, `AvroSchemaHelperSuite`, and `AvroSerdeSuite`; and most of the existing tests within `AvroSerdeSuite` are adapted to perform the same test using by-name and positional matching to ensure feature parity.

Closes #31490 from xkrogen/xkrogen-SPARK-34365-avro-positional-field-matching.

Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-30 16:20:45 +08:00
..
avro [SPARK-34365][AVRO] Add support for positional Catalyst-to-Avro schema matching 2021-06-30 16:20:45 +08:00
docker [SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff 2020-07-17 12:05:45 -07:00
docker-integration-tests [SPARK-34302][FOLLOWUP][SQL][TESTS] Update jdbc.v2.*IntegrationSuite 2021-06-28 23:01:54 -07:00
kafka-0-10 [SPARK-35532][TESTS] Ensure mllib and kafka-0-10 module can be maven test independently in Scala 2.13 2021-05-30 16:36:17 -07:00
kafka-0-10-assembly [SPARK-27733][CORE] Upgrade Avro to version 1.10.1 2021-01-20 15:42:27 -08:00
kafka-0-10-sql [SPARK-35838][BUILD][TESTS] Ensure all modules can be maven test independently in Scala 2.13 2021-06-22 06:31:24 -07:00
kafka-0-10-token-provider [SPARK-35747][CORE] Avoid printing full Exception stack trace, if Hbase/Kafka/Hive services are not running in a secure cluster 2021-06-23 23:12:02 -07:00
kinesis-asl [SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT 2020-12-04 14:10:42 -08:00
kinesis-asl-assembly [SPARK-27733][CORE] Upgrade Avro to version 1.10.1 2021-01-20 15:42:27 -08:00
spark-ganglia-lgpl [SPARK-34520][CORE][FOLLOW-UP] Remove SecurityManager in GangliaSink 2021-03-01 11:18:57 +09:00