spark-instrumented-optimizer

History

Bruce Robbins 66d5a0049a [SPARK-35817][SQL] Restore performance of queries against wide Avro tables ### What changes were proposed in this pull request? When creating a record writer in an AvroDeserializer, or creating a struct converter in an AvroSerializer, look up Avro fields using a map rather than scanning the entire list of Avro fields. ### Why are the changes needed? A query against an Avro table can be quite slow when all are true: * There are many columns in the Avro file * The query contains a wide projection * There are many splits in the input * Some of the splits are read serially (e.g., less executors than there are tasks) A write to an Avro table can be quite slow when all are true: * There are many columns in the new rows * The operation is creating many files For example, a single-threaded query against a 6000 column Avro data set with 50K rows and 20 files takes less than a minute with Spark 3.0.1 but over 7 minutes with Spark 3.2.0-SNAPSHOT. This PR restores the faster time. For the 1000 column read benchmark: Before patch: 108447 ms After patch: 35925 ms percent improvement: 66% For the 1000 column write benchmark: Before patch: 123307 After patch: 42313 percent improvement: 65% ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? * Ran existing unit tests * Added new unit tests * Added new benchmarks Closes #32969 from bersprockets/SPARK-35817. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Gengliang Wang <gengliang@apache.org>		2021-06-23 22:36:56 +08:00
..
benchmarks	[SPARK-35817][SQL] Restore performance of queries against wide Avro tables	2021-06-23 22:36:56 +08:00
src	[SPARK-35817][SQL] Restore performance of queries against wide Avro tables	2021-06-23 22:36:56 +08:00
pom.xml	[SPARK-35838][BUILD][TESTS] Ensure all modules can be maven test independently in Scala 2.13	2021-06-22 06:31:24 -07:00