spark-instrumented-optimizer

History

Bruce Robbins 7e8eb0447b [SPARK-33314][SQL] Avoid dropping rows in Avro reader ### What changes were proposed in this pull request? This PR adds a check to RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record. This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row). ### Why are the changes needed? SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row. For example (which assumes V1 Avro reader): ```scala val df = spark.range(0, 25).toDF("index") df.write.mode("overwrite").format("avro").save("index_avro") val loaded = spark.read.format("avro").load("index_avro") // The following will give the expected size loaded.collect.size // The following will give the wrong size loaded.orderBy("index").collect.size ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests, which fail without the fix. Closes #30221 from bersprockets/avro_iterator_play. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-11-05 11:50:11 +09:00
..
avro	[SPARK-33314][SQL] Avoid dropping rows in Avro reader	2020-11-05 11:50:11 +09:00
docker	[SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff	2020-07-17 12:05:45 -07:00
docker-integration-tests	[SPARK-33265][TEST] Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13	2020-11-04 17:39:06 +09:00
kafka-0-10	[SPARK-32873][BUILD] Fix code which causes error when build with sbt and Scala 2.13	2020-09-14 15:34:58 +09:00
kafka-0-10-assembly	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile	2020-10-22 03:21:34 +00:00
kafka-0-10-sql	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile	2020-10-22 03:21:34 +00:00
kafka-0-10-token-provider	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile	2020-10-22 03:21:34 +00:00
kinesis-asl	[SPARK-33079][TESTS] Replace the existing Maven job for Scala 2.13 in Github Actions with SBT job	2020-10-15 20:51:20 +09:00
kinesis-asl-assembly	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile	2020-10-22 03:21:34 +00:00
spark-ganglia-lgpl	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00