spark-instrumented-optimizer/external
Shixiong Zhu 1149c4efbc
[SPARK-25005][SS] Support non-consecutive offsets for Kafka
## What changes were proposed in this pull request?

As the user uses Kafka transactions to write data, the offsets in Kafka will be non-consecutive. It will contains some transaction (commit or abort) markers. In addition, if the consumer's `isolation.level` is `read_committed`, `poll` will not return aborted messages either. Hence, we will see non-consecutive offsets in the date returned by `poll`. However, as `seekToEnd` may move the offset point to these missing offsets, there are 4 possible corner cases we need to support:

- The whole batch contains no data messages
- The first offset in a batch is not a committed data message
- The last offset in a batch is not a committed data message
- There is a gap in the middle of a batch

They are all covered by the new unit tests.

## How was this patch tested?

The new unit tests.

Closes #22042 from zsxwing/kafka-transaction-read.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2018-08-28 08:38:07 -07:00
..
avro [SPARK-24811][FOLLOWUP][SQL] Revise package of AvroDataToCatalyst and CatalystDataToAvro 2018-08-23 15:08:46 +08:00
docker [SPARK-23038][TEST] Update docker/spark-test (JDK/OS) 2018-01-13 23:26:12 -08:00
docker-integration-tests [SPARK-22814][SQL] Support Date/Timestamp in a JDBC partition column 2018-07-30 07:42:00 -07:00
flume [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00
flume-assembly [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00
flume-sink [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00
kafka-0-8 [SPARK-21168] KafkaRDD should always set kafka clientId. 2018-04-23 13:56:11 -05:00
kafka-0-8-assembly [SPARK-23654][BUILD] remove jets3t as a dependency of spark 2018-08-16 12:34:23 -07:00
kafka-0-10 [SPARK-25116][TESTS] Fix the Kafka cluster leak and clean up cached producers 2018-08-17 14:21:08 -07:00
kafka-0-10-assembly [SPARK-23654][BUILD] remove jets3t as a dependency of spark 2018-08-16 12:34:23 -07:00
kafka-0-10-sql [SPARK-25005][SS] Support non-consecutive offsets for Kafka 2018-08-28 08:38:07 -07:00
kinesis-asl [SPARK-20168][STREAMING KINESIS] Setting the timestamp directly would cause exception on … 2018-07-12 10:04:47 -07:00
kinesis-asl-assembly [SPARK-23654][BUILD] remove jets3t as a dependency of spark 2018-08-16 12:34:23 -07:00
spark-ganglia-lgpl [SPARK-23028] Bump master branch version to 2.4.0-SNAPSHOT 2018-01-13 00:37:59 +08:00