### What changes were proposed in this pull request?
This patch introduces a new option to specify the minimum number of offsets to read per trigger i.e. minOffsetsPerTrigger and maxTriggerDelay to avoid the infinite wait for the trigger.
This new option will allow skipping trigger/batch when the number of records available in Kafka is low. This is a very useful feature in cases where we have a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day.
'maxTriggerDelay' option will help to avoid cases of infinite delay in scheduling trigger and the trigger will happen irrespective of records available if the maxTriggerDelay time exceeds the last trigger. It would be an optional parameter with a default value of 15 mins. This option will be only applicable if minOffsetsPerTrigger is set.
minOffsetsPerTrigger option would be optional of course, but once specified it would take precedence over maxOffestsPerTrigger which will be honored only after minOffsetsPerTrigger is satisfied.
### Why are the changes needed?
There are many scenarios where there is a sudden burst of data at certain intervals in a day and data volume is low for the rest of the day. Tunning such jobs is difficult as decreasing trigger processing time increasing the number of batches and hence cluster resource usage and adds to small file issues. Increasing trigger processing time adds consumer lag. This patch tries to address this issue.
### How was this patch tested?
This patch was tested by adding test cases as well as manually on a cluster where the job was running for a full one day with a data burst happening once a day.
Here is the picture of databurst and hence consumer lag:
<img width="1198" alt="Screenshot 2021-04-29 at 11 39 35 PM" src="https://user-images.githubusercontent.com/1044003/116997587-9b2ab180-acfa-11eb-91fd-524802ce3316.png">
This is how the job behaved at burst time running every 4.5 mins (which is the specified trigger time):
<img width="1154" alt="Burst Time" src="https://user-images.githubusercontent.com/1044003/116997919-12f8dc00-acfb-11eb-9b0a-98387fc67560.png">
This is job behavior during the non-burst time where it is skipping 2 to 3 triggers and running once every 9 to 13.5 mins
<img width="1154" alt="Non Burst Time" src="https://user-images.githubusercontent.com/1044003/116998244-8b5f9d00-acfb-11eb-8340-33d47149ef81.png">
Here are some more stats from the two-run i.e. one normal run and the other with minOffsetsperTrigger set:
| Run | Data Size | Number of Batch Runs | Number of Files |
| ------------- | ------------- |------------- |------------- |
| Normal Run | 54.2 GB | 320 | 21968 |
| Run with minOffsetsperTrigger | 54.2 GB | 120 | 12104 |
Closes#32653 from satishgopalani/SPARK-35312.
Authored-by: Satish Gopalani <satish.gopalani@pubmatic.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This patch is a follow-up of SPARK-26848 (#23747). In SPARK-26848, we decided to open possibility to let end users set individual timestamp per partition. But in many cases, specifying timestamp represents the intention that we would want to go back to specific timestamp and reprocess records, which should be applied to all topics and partitions.
This patch proposes to provide a way to set a global timestamp across topic-partitions which the source is subscribing to, so that end users can set all offsets by specific timestamp easily. To provide the way to config the timestamp easier, the new options only receive "a" timestamp for start/end timestamp.
New options introduced in this PR:
* startingTimestamp
* endingTimestamp
All two options receive timestamp as string.
There're priorities for options regarding starting/ending offset as we will have three options for start offsets and another three options for end offsets. Priorities are following:
* starting offsets: startingTimestamp -> startingOffsetsByTimestamp -> startingOffsets
* ending offsets: startingTimestamp -> startingOffsetsByTimestamp -> startingOffsets
### Why are the changes needed?
Existing option to specify timestamp as offset is quite verbose if there're a lot of partitions across topics. Suppose there're 100s of partitions in a topic, the json should contain 100s of times of the same timestamp.
Also, the number of partitions can also change, which requires either:
* fixing the code if the json is statically created
* introducing the dependencies on Kafka client and deal with Kafka API on crafting json programmatically
Both approaches are even not "acceptable" if we're dealing with ad-hoc query; anyone doesn't want to write the code more complicated than the query itself. Flink [provides the option](https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/connectors/datastream/kafka/#kafka-consumers-start-position-configuration) to specify a timestamp for all topic-partitions like this PR, and even doesn't provide the option to specify the timestamp per topic-partition.
With this PR, end users are only required to provide a single timestamp value. No more complicated JSON format end users need to know about the structure.
### Does this PR introduce _any_ user-facing change?
Yes, this PR introduces two new options, described in above section.
Doc changes are following:
![스크린샷 2021-05-21 오후 12 01 02](https://user-images.githubusercontent.com/1317309/119076244-3034e680-ba2d-11eb-8323-0e227932d2e5.png)
![스크린샷 2021-05-21 오후 12 01 12](https://user-images.githubusercontent.com/1317309/119076255-35923100-ba2d-11eb-9d79-538a7f9ee738.png)
![스크린샷 2021-05-21 오후 12 01 24](https://user-images.githubusercontent.com/1317309/119076264-39be4e80-ba2d-11eb-8265-ac158f55c360.png)
![스크린샷 2021-05-21 오후 12 06 01](https://user-images.githubusercontent.com/1317309/119076271-3d51d580-ba2d-11eb-98ea-35fd72b1bbfc.png)
### How was this patch tested?
New UTs covering new functionalities. Also manually tested via simple batch & streaming queries.
Closes#32609 from HeartSaVioR/SPARK-29223-v2.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This patch fixes wrong Python code sample for doc.
### Why are the changes needed?
Sample code is wrong.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Doc only.
Closes#32119 from Hisssy/ss-doc-typo-1.
Authored-by: hissy <aozora@live.cn>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
This patch adds a few description to SS doc about offset reset and data loss.
### Why are the changes needed?
During recent SS test, the behavior of gradual reducing input rows are confusing me. Comparing with Flink, I do not see a similar behavior. After looking into the code and doing some tests, I feel it is better to add some more description there in SS doc.
### Does this PR introduce _any_ user-facing change?
No, doc only.
### How was this patch tested?
Doc only.
Closes#31089 from viirya/ss-minor-5.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Kafka delegation token is obtained with `AdminClient` where security settings can be set. Keystore and trustrore type however can't be set. In this PR I've added these new configurations. This can be useful when the type is different. A good example is to make Spark FIPS compliant where the default JKS is not accepted.
### Why are the changes needed?
Missing configurations.
### Does this PR introduce _any_ user-facing change?
Yes, adding 2 additional config parameters.
### How was this patch tested?
Existing + modified unit tests + simple Kafka to Kafka app on cluster.
Closes#31070 from gaborgsomogyi/SPARK-34032.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Update kafka headers documentation, type is not longer a map but an array
[jira](https://issues.apache.org/jira/browse/SPARK-33660)
### Why are the changes needed?
To help users
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
It is only documentation
Closes#30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation.
Authored-by: german <germanschiavon@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Deprecated `KafkaConsumer.poll(long)` API calls may cause infinite wait in the driver. In this PR I've added a new `AdminClient` based offset fetching which is turned off by default. There is a new flag named `spark.sql.streaming.kafka.useDeprecatedOffsetFetching` (default: `true`) which can be set to `false` to reach the newly added functionality. The Structured Streaming migration guide contains more information what migration consideration must be done. Please see the following [doc](https://docs.google.com/document/d/1gAh0pKgZUgyqO2Re3sAy-fdYpe_SxpJ6DkeXE8R1P7E/edit?usp=sharing) for further details.
The PR contains the following changes:
* Added `AdminClient` based offset fetching
* GroupId prefix feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
* GroupId override feature removed from driver but only in `AdminClient` based approach (`AdminClient` doesn't need any GroupId)
* Additional unit tests
* Code comment changes
* Minor bugfixes here and there
* Removed Kafka auto topic creation feature but only in `AdminClient` based approach (please see doc for rationale). In short, it's super hidden, not sure anybody ever used in production + error prone.
* Added documentation to `ss-migration-guide` and `structured-streaming-kafka-integration`
### Why are the changes needed?
Driver may hang forever.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing + additional unit tests.
Cluster test with simple Kafka topic to another topic query.
Documentation:
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.
Closes#29729 from gaborgsomogyi/SPARK-32032.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Spark uses an old and deprecated API named `KafkaConsumer.poll(long)` which never returns and stays in live lock if metadata is not updated (for instance when broker disappears at consumer creation). Please see [Kafka documentation](https://kafka.apache.org/25/javadoc/org/apache/kafka/clients/consumer/KafkaConsumer.html#poll-long-) and [standalone test application](https://github.com/gaborgsomogyi/kafka-get-assignment) for further details.
In this PR I've applied the new `KafkaConsumer.poll(Duration)` API on executor side. Please note driver side still uses the old API which will be fixed in SPARK-32032.
### Why are the changes needed?
Infinite wait in `KafkaConsumer.poll(long)`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#28871 from gaborgsomogyi/SPARK-32033.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
- Add `minPartitions` support for Kafka Streaming V1 source.
- Add `minPartitions` support for Kafka batch V1 and V2 source.
- There is lots of refactoring (moving codes to KafkaOffsetReader) to reuse codes.
### Why are the changes needed?
Right now, the "minPartitions" option only works in Kafka streaming source v2. It would be great that we can support it in batch and streaming source v1 (v1 is the fallback mode when a user hits a regression in v2) as well.
### Does this PR introduce any user-facing change?
Yep. The `minPartitions` options is supported in Kafka batch and streaming queries for both data source V1 and V2.
### How was this patch tested?
New unit tests are added to test "minPartitions".
Closes#27388 from zsxwing/kafka-min-partitions.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
### What changes were proposed in this pull request?
This patch documents the configuration for the Kafka producer pool, newly revised via SPARK-21869 (#26845)
### Why are the changes needed?
The explanation of new Kafka producer pool configuration is missing, whereas the doc has Kafka
consumer pool configuration.
### Does this PR introduce any user-facing change?
Yes. This is a documentation change.
![Screen Shot 2020-01-12 at 11 16 19 PM](https://user-images.githubusercontent.com/9700541/72238148-c8959e00-3591-11ea-87fc-a8918792017e.png)
### How was this patch tested?
N/A
Closes#27146 from HeartSaVioR/SPARK-21869-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fixed typo in `docs` directory and in other directories
1. Find typo in `docs` and apply fixes to files in all directories
2. Fix `the the` -> `the`
### Why are the changes needed?
Better readability of documents
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test needed
Closes#26976 from kiszk/typo_20191221.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the availability of `minPartitions` option for Kafka source, as it is only supported by micro-batch for now. There's a WIP PR for batch (#25436) as well but there's no progress on the PR so far, so safer to fix the doc first, and let it be added later when we address it with batch case as well.
### Why are the changes needed?
The doc is wrong and misleading.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Just a doc change.
Closes#26849 from HeartSaVioR/MINOR-FIX-minPartition-availability-doc.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR fix and use proper html tag in docs
### Why are the changes needed?
Fix documentation format error.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26302 from uncleGen/minor-doc.
Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-29500
`KafkaRowWriter` now supports setting the Kafka partition by reading a "partition" column in the input dataframe.
Code changes in commit nr. 1.
Test changes in commit nr. 2.
Doc changes in commit nr. 3.
tcondie dongjinleekr srowen
### Why are the changes needed?
While it is possible to configure a custom Kafka Partitioner with
`.option("kafka.partitioner.class", "my.custom.Partitioner")`, this is not enough for certain use cases. See the Jira issue.
### Does this PR introduce any user-facing change?
No, as this behaviour is optional.
### How was this patch tested?
Two new UT were added and one was updated.
Closes#26153 from redsk/feature/SPARK-29500.
Authored-by: redsk <nicola.bova@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This patch introduces new options "startingOffsetsByTimestamp" and "endingOffsetsByTimestamp" to set specific timestamp per topic (since we're unlikely to set the different value per partition) to let source starts reading from offsets which have equal of greater timestamp, and ends reading until offsets which have equal of greater timestamp.
The new option would be optional of course, and take preference over existing offset options.
## How was this patch tested?
New unit tests added. Also manually tested basic functionality with Kafka 2.0.0 server.
Running query below
```
val df = spark.read.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "spark_26848_test_v1,spark_26848_test_2_v1")
.option("startingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669142193, "spark_26848_test_2_v1": 1549669240965}""")
.option("endingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669265676, "spark_26848_test_2_v1": 1549699265676}""")
.load().selectExpr("CAST(value AS STRING)")
df.show()
```
with below records (one string which number part remarks when they're put after such timestamp) in
topic `spark_26848_test_v1`
```
hello1 1549669142193
world1 1549669142193
hellow1 1549669240965
world1 1549669240965
hello1 1549669265676
world1 1549669265676
```
topic `spark_26848_test_2_v1`
```
hello2 1549669142193
world2 1549669142193
hello2 1549669240965
world2 1549669240965
hello2 1549669265676
world2 1549669265676
```
the result of `df.show()` follows:
```
+--------------------+
| value|
+--------------------+
|world1 1549669240965|
|world1 1549669142193|
|world2 1549669240965|
|hello2 1549669240965|
|hellow1 154966924...|
|hello2 1549669265676|
|hello1 1549669142193|
|world2 1549669265676|
+--------------------+
```
Note that endingOffsets (as well as endingOffsetsByTimestamp) are exclusive.
Closes#23747 from HeartSaVioR/SPARK-26848.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This update adds support for Kafka Headers functionality in Structured Streaming.
## How was this patch tested?
With following unit tests:
- KafkaRelationSuite: "default starting and ending offsets with headers" (new)
- KafkaSinkSuite: "batch - write to kafka" (updated)
Closes#22282 from dongjinleekr/feature/SPARK-23539.
Lead-authored-by: Lee Dongjin <dongjin@apache.org>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
At the moment there are 3 places where communication protocol with Kafka cluster has to be set when delegation token used:
* On delegation token
* On source
* On sink
Most of the time users are using the same protocol on all these places (within one Kafka cluster). It would be better to declare it in one place (delegation token side) and Kafka sources/sinks can take this config over.
In this PR I've I've modified the code in a way that Kafka sources/sinks are taking over delegation token side `security.protocol` configuration when the token and the source/sink matches in `bootstrap.servers` configuration. This default configuration can be overwritten on each source/sink independently by using `kafka.security.protocol` configuration.
### Why are the changes needed?
The actual configuration's default behavior represents the minority of the use-cases and inconvenient.
### Does this PR introduce any user-facing change?
Yes, with this change users need to provide less configuration parameters by default.
### How was this patch tested?
Existing + additional unit tests.
Closes#25631 from gaborgsomogyi/SPARK-28928.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
This patch does pooling for both kafka consumers as well as fetched data. The overall benefits of the patch are following:
* Both pools support eviction on idle objects, which will help closing invalid idle objects which topic or partition are no longer be assigned to any tasks.
* It also enables applying different policies on pool, which helps optimization of pooling for each pool.
* We concerned about multiple tasks pointing same topic partition as well as same group id, and existing code can't handle this hence excess seek and fetch could happen. This patch properly handles the case.
* It also makes the code always safe to leverage cache, hence no need to maintain reuseCache parameter.
Moreover, pooling kafka consumers is implemented based on Apache Commons Pool, which also gives couple of benefits:
* We can get rid of synchronization of KafkaDataConsumer object while acquiring and returning InternalKafkaConsumer.
* We can extract the feature of object pool to outside of the class, so that the behaviors of the pool can be tested easily.
* We can get various statistics for the object pool, and also be able to enable JMX for the pool.
FetchedData instances are pooled by custom implementation of pool instead of leveraging Apache Commons Pool, because they have CacheKey as first key and "desired offset" as second key which "desired offset" is changing - I haven't found any general pool implementations supporting this.
This patch brings additional dependency, Apache Commons Pool 2.6.0 into `spark-sql-kafka-0-10` module.
## How was this patch tested?
Existing unit tests as well as new tests for object pool.
Also did some experiment regarding proving concurrent access of consumers for same topic partition.
* Made change on both sides (master and patch) to log when creating Kafka consumer or fetching records from Kafka is happening.
* branches
* master: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-master-ref-debugging
* patch: https://github.com/HeartSaVioR/spark/tree/SPARK-25151-debugging
* Test query (doing self-join)
* https://gist.github.com/HeartSaVioR/d831974c3f25c02846f4b15b8d232cc2
* Ran query from spark-shell, with using `local[*]` to maximize the chance to have concurrent access
* Collected the count of fetch requests on Kafka via command: `grep "creating new Kafka consumer" logfile | wc -l`
* Collected the count of creating Kafka consumers via command: `grep "fetching data from Kafka consumer" logfile | wc -l`
Topic and data distribution is follow:
```
truck_speed_events_stream_spark_25151_v1:0:99440
truck_speed_events_stream_spark_25151_v1:1:99489
truck_speed_events_stream_spark_25151_v1:2:397759
truck_speed_events_stream_spark_25151_v1:3:198917
truck_speed_events_stream_spark_25151_v1:4:99484
truck_speed_events_stream_spark_25151_v1:5:497320
truck_speed_events_stream_spark_25151_v1:6:99430
truck_speed_events_stream_spark_25151_v1:7:397887
truck_speed_events_stream_spark_25151_v1:8:397813
truck_speed_events_stream_spark_25151_v1:9:0
```
The experiment only used smallest 4 partitions (0, 1, 4, 6) from these partitions to finish the query earlier.
The result of experiment is below:
branch | create Kafka consumer | fetch request
-- | -- | --
master | 1986 | 2837
patch | 8 | 1706
Closes#22138 from HeartSaVioR/SPARK-25151.
Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
`minPartitions` has been used as a hint and relevant method (KafkaOffsetRangeCalculator.getRanges) doesn't guarantee the behavior that partitions will be equal or more than given value.
d67b98ea01/external/kafka-0-10-sql/src/main/scala/org/apache/spark/sql/kafka010/KafkaOffsetRangeCalculator.scala (L32-L46)
This patch makes clear the configuration is a hint, and actual partitions could be less or more.
## How was this patch tested?
Just a documentation change.
Closes#25332 from HeartSaVioR/MINOR-correct-kafka-structured-streaming-doc-minpartition.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
These are what I found during working on #22282.
- Remove unused value: `UnsafeArraySuite#defaultTz`
- Remove redundant new modifier to the case class, `KafkaSourceRDDPartition`
- Remove unused variables from `RDD.scala`
- Remove trailing space from `structured-streaming-kafka-integration.md`
- Remove redundant parameter from `ArrowConvertersSuite`: `nullable` is `true` by default.
- Remove leading empty line: `UnsafeRow`
- Remove trailing empty line: `KafkaTestUtils`
- Remove unthrown exception type: `UnsafeMapData`
- Replace unused declarations: `expressions`
- Remove duplicated default parameter: `AnalysisErrorSuite`
- `ObjectExpressionsSuite`: remove duplicated parameters, conversions and unused variable
Closes#25251 from dongjinleekr/cleanup/201907.
Authored-by: Lee Dongjin <dongjin@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
At the moment Kafka delegation tokens are fetched through `AdminClient` but there is no possibility to add custom configuration parameters. In [options](https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html#kafka-specific-configurations) there is already a possibility to add custom configurations.
In this PR I've added similar this possibility to `AdminClient`.
## How was this patch tested?
Existing + added unit tests.
```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.
Closes#24875 from gaborgsomogyi/SPARK-28055.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Kafka related Spark parameters has to start with `spark.kafka.` and not with `spark.sql.`. Because of this I've renamed `spark.sql.kafkaConsumerCache.capacity`.
Since Kafka consumer caching is not documented I've added this also.
## How was this patch tested?
Existing + added unit test.
```
cd docs
SKIP_API=1 jekyll build
```
and manual webpage check.
Closes#24590 from gaborgsomogyi/SPARK-27687.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The actual implementation doesn't support multi-cluster Kafka connection with delegation token. In this PR I've added this functionality.
What this PR contains:
* New way of configuration
* Multiple delegation token obtain/store/use functionality
* Documentation
* The change works on DStreams also
## How was this patch tested?
Existing + additional unit tests.
Additionally tested on cluster.
Test scenario:
* 2 * 4 node clusters
* The 4-4 nodes are in different kerberos realms
* Cross-Realm trust between the 2 realms
* Yarn
* Kafka broker version 2.1.0
* security.protocol = SASL_SSL
* sasl.mechanism = SCRAM-SHA-512
* Artificial exceptions during processing
* Source reads from realm1 sink writes to realm2
Kafka broker settings:
* delegation.token.expiry.time.ms=600000 (10 min)
* delegation.token.max.lifetime.ms=1200000 (20 min)
* delegation.token.expiry.check.interval.ms=300000 (5 min)
Closes#24305 from gaborgsomogyi/SPARK-27294.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.
## How was this patch tested?
Doc build
Closes#24243 from srowen/SPARK-26918.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
Since this caveat added to the DStreams documentation it would be good to add to Structured Streaming as well.
## How was this patch tested?
cd docs/
SKIP_API=1 jekyll build
Manual webpage check.
Closes#23974 from gaborgsomogyi/SPARK-26592_.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
This PR allows the user to override `kafka.group.id` for better monitoring or security. The user needs to make sure there are not multiple queries or sources using the same group id.
It also fixes a bug that the `groupIdPrefix` option cannot be retrieved.
## How was this patch tested?
The new added unit tests.
Closes#23301 from zsxwing/SPARK-26350.
Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
## What changes were proposed in this pull request?
When Kafka delegation token obtained, SCRAM `sasl.mechanism` has to be configured for authentication. This can be configured on the related source/sink which is inconvenient from user perspective. Such granularity is not required and this configuration can be implemented with one central parameter.
In this PR `spark.kafka.sasl.token.mechanism` added to configure this centrally (default: `SCRAM-SHA-512`).
## How was this patch tested?
Existing unit tests + on cluster.
Closes#23274 from gaborgsomogyi/SPARK-26322.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Kafka delegation token support implemented in [PR#22598](https://github.com/apache/spark/pull/22598) but that didn't contain documentation because of rapid changes. Because it has been merged in this PR I've documented it.
## How was this patch tested?
jekyll build + manual html check
Closes#23195 from gaborgsomogyi/SPARK-26236.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
Allow the Spark Structured Streaming user to specify the prefix of the consumer group (group.id), compared to force consumer group ids of the form `spark-kafka-source-*`
## How was this patch tested?
Unit tests provided by Spark (backwards compatible change, i.e., user can optionally use the functionality)
`mvn test -pl external/kafka-0-10`
Closes#23103 from zouzias/SPARK-26121.
Authored-by: Anastasios Zouzias <anastasios@sqooba.io>
Signed-off-by: cody koeninger <cody@koeninger.org>
## What changes were proposed in this pull request?
Easy fix in the documentation.
## How was this patch tested?
N/A
Closes#20948
Author: Daniel Sakuma <dsakuma@gmail.com>
Closes#20928 from dsakuma/fix_typo_configuration_docs.
## What changes were proposed in this pull request?
Fix spelling in quick-start doc.
## How was this patch tested?
Doc only.
Author: Shashwat Anand <me@shashwat.me>
Closes#20336 from ashashwat/SPARK-23165.
## What changes were proposed in this pull request?
In latest structured-streaming-kafka-integration document, Java code example for Kafka integration is using `DataFrame<Row>`, shouldn't it be changed to `DataSet<Row>`?
## How was this patch tested?
manual test has been performed to test the updated example Java code in Spark 2.2.1 with Kafka 1.0
Author: brandonJY <brandonJY@users.noreply.github.com>
Closes#20312 from brandonJY/patch-2.
## What changes were proposed in this pull request?
Minor change to kafka integration document for structured streaming.
## How was this patch tested?
N/A, doc change only.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes#18550 from viirya/minor-ss-kafka-doc.
## What changes were proposed in this pull request?
Add documentation that describes how to write streaming and batch queries to Kafka.
zsxwing tdas
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Tyson Condie <tcondie@gmail.com>
Closes#17246 from tcondie/kafka-write-docs.
## What changes were proposed in this pull request?
Revision to structured-streaming-kafka-integration.md to reflect new Batch query specification and options.
zsxwing tdas
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Tyson Condie <tcondie@gmail.com>
Closes#16918 from tcondie/kafka-docs.
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.
## How was this patch tested?
N/A since only docs or comments were updated.
Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>
Closes#16455 from neurons/np.structure_streaming_doc.
## What changes were proposed in this pull request?
This patch uses `{% highlight lang %}...{% endhighlight %}` to highlight code snippets in the `Structured Streaming Kafka010 integration doc` and the `Spark Streaming Kafka010 integration doc`.
This patch consists of two commits:
- the first commit fixes only the leading spaces -- this is large
- the second commit adds the highlight instructions -- this is much simpler and easier to review
## How was this patch tested?
SKIP_API=1 jekyll build
## Screenshots
**Before**
![snip20161101_3](https://cloud.githubusercontent.com/assets/15843379/19894258/47746524-a087-11e6-9a2a-7bff2d428d44.png)
**After**
![snip20161101_1](https://cloud.githubusercontent.com/assets/15843379/19894324/8bebcd1e-a087-11e6-835b-88c4d2979cfa.png)
Author: Liwei Lin <lwlin7@gmail.com>
Closes#15715 from lw-lin/doc-highlight-code-snippet.
## What changes were proposed in this pull request?
maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.
## How was this patch tested?
Added unit test
Author: cody koeninger <cody@koeninger.org>
Closes#15527 from koeninger/SPARK-17813.
## What changes were proposed in this pull request?
startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy
assign with specific topicpartitions as a consumer strategy
## How was this patch tested?
Unit tests
Author: cody koeninger <cody@koeninger.org>
Closes#15504 from koeninger/SPARK-17812.
## What changes were proposed in this pull request?
This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.
It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing
tdas did most of work and part of them was inspired by koeninger's work.
### Introduction
The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:
Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int
The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.
### Configuration
The user can use `DataStreamReader.option` to set the following configurations.
Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets
Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`
### Usage
* Subscribe to 1 topic
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1")
.load()
```
* Subscribe to multiple topics
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribe", "topic1,topic2")
.load()
```
* Subscribe to a pattern
```Scala
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "host:port")
.option("subscribePattern", "topic.*")
.load()
```
## How was this patch tested?
The new unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>
Closes#15102 from zsxwing/kafka-source.