spark-instrumented-optimizer/external
Bruce Robbins 7e8eb0447b [SPARK-33314][SQL] Avoid dropping rows in Avro reader
### What changes were proposed in this pull request?

This PR adds a check to  RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record.

This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row).

### Why are the changes needed?

SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row.

For example (which assumes V1 Avro reader):
```scala
val df = spark.range(0, 25).toDF("index")
df.write.mode("overwrite").format("avro").save("index_avro")
val loaded = spark.read.format("avro").load("index_avro")
// The following will give the expected size
loaded.collect.size
// The following will give the wrong size
loaded.orderBy("index").collect.size
```
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests, which fail without the fix.

Closes #30221 from bersprockets/avro_iterator_play.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-05 11:50:11 +09:00
..
avro [SPARK-33314][SQL] Avoid dropping rows in Avro reader 2020-11-05 11:50:11 +09:00
docker [SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff 2020-07-17 12:05:45 -07:00
docker-integration-tests [SPARK-33265][TEST] Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13 2020-11-04 17:39:06 +09:00
kafka-0-10 [SPARK-32873][BUILD] Fix code which causes error when build with sbt and Scala 2.13 2020-09-14 15:34:58 +09:00
kafka-0-10-assembly [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile 2020-10-22 03:21:34 +00:00
kafka-0-10-sql [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile 2020-10-22 03:21:34 +00:00
kafka-0-10-token-provider [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile 2020-10-22 03:21:34 +00:00
kinesis-asl [SPARK-33079][TESTS] Replace the existing Maven job for Scala 2.13 in Github Actions with SBT job 2020-10-15 20:51:20 +09:00
kinesis-asl-assembly [SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile 2020-10-22 03:21:34 +00:00
spark-ganglia-lgpl [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT 2020-02-25 19:44:31 -08:00