Commit graph

24 commits

Author SHA1 Message Date
Gabor Somogyi d47c219f94 [SPARK-28055][SS][DSTREAMS] Add delegation token custom AdminClient configurations.
## What changes were proposed in this pull request?

At the moment Kafka delegation tokens are fetched through `AdminClient` but there is no possibility to add custom configuration parameters. In [options](https://spark.apache.org/docs/2.4.3/structured-streaming-kafka-integration.html#kafka-specific-configurations) there is already a possibility to add custom configurations.

In this PR I've added similar this possibility to `AdminClient`.

## How was this patch tested?

Existing + added unit tests.

```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #24875 from gaborgsomogyi/SPARK-28055.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-07-11 09:36:24 -07:00
Gabor Somogyi 1a915bf20f [MINOR][SQL][DOCS] failOnDataLoss has effect on batch queries so fix the doc
## What changes were proposed in this pull request?

According to the [Kafka integration document](https://spark.apache.org/docs/2.4.0/structured-streaming-kafka-integration.html) `failOnDataLoss` has effect only on streaming queries. While I was implementing the DSv2 Kafka batch sources I've realized it's not true. This feature is covered in [KafkaDontFailOnDataLossSuite](54da3bbfb2/external/kafka-0-10-sql/src/test/scala/org/apache/spark/sql/kafka010/KafkaDontFailOnDataLossSuite.scala (L180)).

In this PR I've updated the doc to reflect this behavior.

## How was this patch tested?

```
cd docs/
SKIP_API=1 jekyll build
```
Manual webpage check.

Closes #24932 from gaborgsomogyi/failOnDataLoss.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-06-23 19:23:57 -05:00
Gabor Somogyi efa303581a [SPARK-27687][SS] Rename Kafka consumer cache capacity conf and document caching
## What changes were proposed in this pull request?

Kafka related Spark parameters has to start with `spark.kafka.` and not with `spark.sql.`. Because of this I've renamed `spark.sql.kafkaConsumerCache.capacity`.

Since Kafka consumer caching is not documented I've added this also.

## How was this patch tested?

Existing + added unit test.

```
cd docs
SKIP_API=1 jekyll build
```
and manual webpage check.

Closes #24590 from gaborgsomogyi/SPARK-27687.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-05-15 10:42:09 -07:00
Gabor Somogyi 2f55809425 [SPARK-27294][SS] Add multi-cluster Kafka delegation token
## What changes were proposed in this pull request?

The actual implementation doesn't support multi-cluster Kafka connection with delegation token. In this PR I've added this functionality.

What this PR contains:
* New way of configuration
* Multiple delegation token obtain/store/use functionality
* Documentation
* The change works on DStreams also

## How was this patch tested?

Existing + additional unit tests.
Additionally tested on cluster.

Test scenario:

* 2 * 4 node clusters
* The 4-4 nodes are in different kerberos realms
* Cross-Realm trust between the 2 realms
* Yarn
* Kafka broker version 2.1.0
* security.protocol = SASL_SSL
* sasl.mechanism = SCRAM-SHA-512
* Artificial exceptions during processing
* Source reads from realm1 sink writes to realm2

Kafka broker settings:

* delegation.token.expiry.time.ms=600000 (10 min)
* delegation.token.max.lifetime.ms=1200000 (20 min)
* delegation.token.expiry.check.interval.ms=300000 (5 min)

Closes #24305 from gaborgsomogyi/SPARK-27294.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-05-07 11:40:43 -07:00
Sean Owen 754f820035 [SPARK-26918][DOCS] All .md should have ASF license header
## What changes were proposed in this pull request?

Add AL2 license to metadata of all .md files.
This seemed to be the tidiest way as it will get ignored by .md renderers and other tools. Attempts to write them as markdown comments revealed that there is no such standard thing.

## How was this patch tested?

Doc build

Closes #24243 from srowen/SPARK-26918.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-03-30 19:49:45 -05:00
Gabor Somogyi b99610e9ed [SPARK-26592][SS][DOC] Add Kafka proxy user caveat to documentation
## What changes were proposed in this pull request?

Since this caveat added to the DStreams documentation it would be good to add to Structured Streaming as well.

## How was this patch tested?

cd docs/
SKIP_API=1 jekyll build
Manual webpage check.

Closes #23974 from gaborgsomogyi/SPARK-26592_.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-03-05 09:58:51 -08:00
Gabor Somogyi 59eb34b82c [SPARK-26889][SS][DOCS] Fix timestamp type in Structured Streaming + Kafka Integration Guide
## What changes were proposed in this pull request?

```
$ spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:3.0.0-SNAPSHOT
...
scala> val df = spark.read.format("kafka").option("kafka.bootstrap.servers", "foo").option("subscribe", "bar").load().printSchema()
root
 |-- key: binary (nullable = true)
 |-- value: binary (nullable = true)
 |-- topic: string (nullable = true)
 |-- partition: integer (nullable = true)
 |-- offset: long (nullable = true)
 |-- timestamp: timestamp (nullable = true)
 |-- timestampType: integer (nullable = true)

df: Unit = ()
```
In the doc timestamp type is `long` and in this PR I've changed it to `timestamp`.

## How was this patch tested?

cd docs/
SKIP_API=1 jekyll build
Manual webpage check.

Closes #23796 from gaborgsomogyi/SPARK-26889.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2019-02-18 17:22:06 +08:00
Shixiong Zhu bafc7ac025
[SPARK-26350][SS] Allow to override group id of the Kafka consumer
## What changes were proposed in this pull request?

This PR allows the user to override `kafka.group.id` for better monitoring or security. The user needs to make sure there are not multiple queries or sources using the same group id.

It also fixes a bug that the `groupIdPrefix` option cannot be retrieved.

## How was this patch tested?

The new added unit tests.

Closes #23301 from zsxwing/SPARK-26350.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-01-14 13:37:24 -08:00
Gabor Somogyi 6daa783094 [SPARK-26322][SS] Add spark.kafka.sasl.token.mechanism to ease delegation token configuration.
## What changes were proposed in this pull request?

When Kafka delegation token obtained, SCRAM `sasl.mechanism` has to be configured for authentication. This can be configured on the related source/sink which is inconvenient from user perspective. Such granularity is not required and this configuration can be implemented with one central parameter.

In this PR `spark.kafka.sasl.token.mechanism` added to configure this centrally (default: `SCRAM-SHA-512`).

## How was this patch tested?

Existing unit tests + on cluster.

Closes #23274 from gaborgsomogyi/SPARK-26322.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2018-12-12 16:45:50 -08:00
Gabor Somogyi b14a26ee57 [SPARK-26236][SS] Add kafka delegation token support documentation.
## What changes were proposed in this pull request?

Kafka delegation token support implemented in [PR#22598](https://github.com/apache/spark/pull/22598) but that didn't contain documentation because of rapid changes. Because it has been merged in this PR I've documented it.

## How was this patch tested?
jekyll build + manual html check

Closes #23195 from gaborgsomogyi/SPARK-26236.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2018-12-06 10:59:20 -08:00
Anastasios Zouzias 2512a1d429 [SPARK-26121][STRUCTURED STREAMING] Allow users to define prefix of Kafka's consumer group (group.id)
## What changes were proposed in this pull request?

Allow the Spark Structured Streaming user to specify the prefix of the consumer group (group.id), compared to force consumer group ids of the form `spark-kafka-source-*`

## How was this patch tested?

Unit tests provided by Spark (backwards compatible change, i.e., user can optionally use the functionality)

`mvn test -pl external/kafka-0-10`

Closes #23103 from zouzias/SPARK-26121.

Authored-by: Anastasios Zouzias <anastasios@sqooba.io>
Signed-off-by: cody koeninger <cody@koeninger.org>
2018-11-26 11:10:38 -06:00
Daniel Sakuma 6ade5cbb49 [MINOR][DOC] Fix some typos and grammar issues
## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes #20928 from dsakuma/fix_typo_configuration_docs.
2018-04-06 13:37:08 +08:00
Shashwat Anand 84a076e0e9 [SPARK-23165][DOC] Spelling mistake fix in quick-start doc.
## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.
2018-01-20 14:34:37 -08:00
brandonJY 6121e91b7f [DOCS] change to dataset for java code in structured-streaming-kafka-integration document
## What changes were proposed in this pull request?

In latest structured-streaming-kafka-integration document, Java code example for Kafka integration is using `DataFrame<Row>`, shouldn't it be changed to `DataSet<Row>`?

## How was this patch tested?

manual test has been performed to test the updated example Java code in Spark 2.2.1 with Kafka 1.0

Author: brandonJY <brandonJY@users.noreply.github.com>

Closes #20312 from brandonJY/patch-2.
2018-01-18 18:57:49 -06:00
Liang-Chi Hsieh c57dfaef30 [MINOR][SS][DOCS] Minor doc change for kafka integration
## What changes were proposed in this pull request?

Minor change to kafka integration document for structured streaming.

## How was this patch tested?

N/A, doc change only.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18550 from viirya/minor-ss-kafka-doc.
2017-07-21 08:43:38 +01:00
Tyson Condie c2d1761a57 [SPARK-19906][SS][DOCS] Documentation describing how to write queries to Kafka
## What changes were proposed in this pull request?

Add documentation that describes how to write streaming and batch queries to Kafka.

zsxwing tdas

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #17246 from tcondie/kafka-write-docs.
2017-03-20 17:18:59 -07:00
Tyson Condie 447b2b5309 [SPARK-19584][SS][DOCS] update structured streaming documentation around batch mode
## What changes were proposed in this pull request?

Revision to structured-streaming-kafka-integration.md to reflect new Batch query specification and options.

zsxwing tdas

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tyson Condie <tcondie@gmail.com>

Closes #16918 from tcondie/kafka-docs.
2017-02-14 18:50:14 -08:00
manugarri 5a0569ce69 [MINOR][DOC] Remove parenthesis in readStream() on kafka structured streaming doc
There is a typo in http://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#creating-a-kafka-source-stream , python example n1 uses `readStream()` instead of `readStream`

Just removed the parenthesis.

Author: manugarri <manuel.garrido.pena@gmail.com>

Closes #16836 from manugarri/fix_kafka_python_doc.
2017-02-07 21:45:57 -08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Liwei Lin 3e01f12828
[DOC][MINOR] Kafka doc: breakup into lines
## Before

![before](https://cloud.githubusercontent.com/assets/15843379/20340231/99b039fe-ac1b-11e6-9ba9-b44582427459.png)

## After

![after](https://cloud.githubusercontent.com/assets/15843379/20340236/9d5796e2-ac1b-11e6-92bb-6da40ba1a383.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15903 from lw-lin/kafka-doc-lines.
2016-11-16 09:51:59 +00:00
Liwei Lin 98ede49496
[SPARK-18198][DOC][STREAMING] Highlight code snippets
## What changes were proposed in this pull request?

This patch uses `{% highlight lang %}...{% endhighlight %}` to highlight code snippets in the `Structured Streaming Kafka010 integration doc` and the `Spark Streaming Kafka010 integration doc`.

This patch consists of two commits:
- the first commit fixes only the leading spaces -- this is large
- the second commit adds the highlight instructions -- this is much simpler and easier to review

## How was this patch tested?

SKIP_API=1 jekyll build

## Screenshots

**Before**

![snip20161101_3](https://cloud.githubusercontent.com/assets/15843379/19894258/47746524-a087-11e6-9a2a-7bff2d428d44.png)

**After**

![snip20161101_1](https://cloud.githubusercontent.com/assets/15843379/19894324/8bebcd1e-a087-11e6-835b-88c4d2979cfa.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #15715 from lw-lin/doc-highlight-code-snippet.
2016-11-02 09:10:34 +00:00
cody koeninger 1042325805 [SPARK-17813][SQL][KAFKA] Maximum data per trigger
## What changes were proposed in this pull request?

maxOffsetsPerTrigger option for rate limiting, proportionally based on volume of different topicpartitions.

## How was this patch tested?

Added unit test

Author: cody koeninger <cody@koeninger.org>

Closes #15527 from koeninger/SPARK-17813.
2016-10-27 10:30:59 -07:00
cody koeninger 268ccb9a48 [SPARK-17812][SQL][KAFKA] Assign and specific startingOffsets for structured stream
## What changes were proposed in this pull request?

startingOffsets takes specific per-topicpartition offsets as a json argument, usable with any consumer strategy

assign with specific topicpartitions as a consumer strategy

## How was this patch tested?

Unit tests

Author: cody koeninger <cody@koeninger.org>

Closes #15504 from koeninger/SPARK-17812.
2016-10-21 15:55:04 -07:00
Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00