spark-instrumented-optimizer/external
Bruce Robbins 66d5a0049a [SPARK-35817][SQL] Restore performance of queries against wide Avro tables
### What changes were proposed in this pull request?

When creating a record writer in an AvroDeserializer, or creating a struct converter in an AvroSerializer, look up Avro fields using a map rather than scanning the entire list of Avro fields.

### Why are the changes needed?

A query against an Avro table can be quite slow when all are true:

* There are many columns in the Avro file
* The query contains a wide projection
* There are many splits in the input
* Some of the splits are read serially (e.g., less executors than there are tasks)

A write to an Avro table can be quite slow when all are true:

* There are many columns in the new rows
* The operation is creating many files

For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT.

This PR restores the faster time.

For the 1000 column read benchmark:
Before patch: 108447 ms
After patch: 35925 ms
percent improvement: 66%

For the 1000 column write benchmark:
Before patch: 123307
After patch: 42313
percent improvement: 65%

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

* Ran existing unit tests
* Added new unit tests
* Added new benchmarks

Closes #32969 from bersprockets/SPARK-35817.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-06-23 22:36:56 +08:00
..
avro [SPARK-35817][SQL] Restore performance of queries against wide Avro tables 2021-06-23 22:36:56 +08:00
docker [SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff 2020-07-17 12:05:45 -07:00
docker-integration-tests [SPARK-35577][TESTS] Allow to log container output for docker integration tests 2021-06-01 22:44:48 +09:00
kafka-0-10 [SPARK-35532][TESTS] Ensure mllib and kafka-0-10 module can be maven test independently in Scala 2.13 2021-05-30 16:36:17 -07:00
kafka-0-10-assembly [SPARK-27733][CORE] Upgrade Avro to version 1.10.1 2021-01-20 15:42:27 -08:00
kafka-0-10-sql [SPARK-35838][BUILD][TESTS] Ensure all modules can be maven test independently in Scala 2.13 2021-06-22 06:31:24 -07:00
kafka-0-10-token-provider [SPARK-34650][BUILD][SS] Exclude zstd-jni transitive dependency from Kafka Client 2021-03-07 13:53:55 +09:00
kinesis-asl [SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT 2020-12-04 14:10:42 -08:00
kinesis-asl-assembly [SPARK-27733][CORE] Upgrade Avro to version 1.10.1 2021-01-20 15:42:27 -08:00
spark-ganglia-lgpl [SPARK-34520][CORE][FOLLOW-UP] Remove SecurityManager in GangliaSink 2021-03-01 11:18:57 +09:00