spark-instrumented-optimizer

History

Erik Krogen 4dd41b9678 [SPARK-34365][AVRO] Add support for positional Catalyst-to-Avro schema matching ### What changes were proposed in this pull request? Provide the (configurable) ability to perform Avro-to-Catalyst schema field matching using the position of the fields instead of their names. A new `option` is added for the Avro datasource, `positionalFieldMatching`, which instructs `AvroSerializer`/`AvroDeserializer` to perform positional field matching instead of matching by name. ### Why are the changes needed? This by-name matching is somewhat recent; prior to PR #24635, at least on the write path, schemas were matched by positionally ("structural" comparison). While by-name is better behavior as a default, it will be better to make this configurable by a user. Even at the time that PR #24635 was handled, there was [interest in making this behavior configurable](https://github.com/apache/spark/pull/24635#issuecomment-494205251), but it appears it went unaddressed. There is precedence for configurability of this behavior as seen in PR #29737, which added this support for ORC. Besides this precedence, the behavior of Hive is to perform matching positionally ([ref](https://cwiki.apache.org/confluence/display/Hive/AvroSerDe#AvroSerDe-WritingtablestoAvrofiles)), so this is behavior that Hadoop/Hive ecosystem users are familiar with. ### Does this PR introduce _any_ user-facing change? Yes, a new option is provided for the Avro datasource, `positionalFieldMatching`, which provides compatibility with Hive and pre-3.0.0 Spark behavior. ### How was this patch tested? New unit tests are added within `AvroSuite`, `AvroSchemaHelperSuite`, and `AvroSerdeSuite`; and most of the existing tests within `AvroSerdeSuite` are adapted to perform the same test using by-name and positional matching to ensure feature parity. Closes #31490 from xkrogen/xkrogen-SPARK-34365-avro-positional-field-matching. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Gengliang Wang <gengliang@apache.org>		2021-06-30 16:20:45 +08:00
..
avro	[SPARK-34365][AVRO] Add support for positional Catalyst-to-Avro schema matching	2021-06-30 16:20:45 +08:00
docker	[SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff	2020-07-17 12:05:45 -07:00
docker-integration-tests	[SPARK-34302][FOLLOWUP][SQL][TESTS] Update jdbc.v2.*IntegrationSuite	2021-06-28 23:01:54 -07:00
kafka-0-10	[SPARK-35532][TESTS] Ensure mllib and kafka-0-10 module can be maven test independently in Scala 2.13	2021-05-30 16:36:17 -07:00
kafka-0-10-assembly	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1	2021-01-20 15:42:27 -08:00
kafka-0-10-sql	[SPARK-35838][BUILD][TESTS] Ensure all modules can be maven test independently in Scala 2.13	2021-06-22 06:31:24 -07:00
kafka-0-10-token-provider	[SPARK-35747][CORE] Avoid printing full Exception stack trace, if Hbase/Kafka/Hive services are not running in a secure cluster	2021-06-23 23:12:02 -07:00
kinesis-asl	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT	2020-12-04 14:10:42 -08:00
kinesis-asl-assembly	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1	2021-01-20 15:42:27 -08:00
spark-ganglia-lgpl	[SPARK-34520][CORE][FOLLOW-UP] Remove SecurityManager in GangliaSink	2021-03-01 11:18:57 +09:00