e5bb2937f6
### What changes were proposed in this pull request? Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details. The PR contains the following changes: * Added `AdminClient` based offset fetching * GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * GroupId override feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId) * Additional unit tests * Code comment changes * Minor bugfixes here and there * Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone. * Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration` ### Why are the changes needed? Driver may hang forever. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Cluster test with simple Kafka topic to another topic query. Documentation: ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #29729 from gaborgsomogyi/SPARK-32032. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> |
||
---|---|---|
.. | ||
src | ||
pom.xml |