spark-instrumented-optimizer/docs/streaming-kafka-integration.md at c9e05ca27c9c702b510d424e3befc87213f24e0f

Tathagata Das a522407928 [SPARK-2419][Streaming][Docs] Updates to the streaming programming guide

Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis.

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Jacek Laskowski <jacek@japila.pl>

Closes #2254 from tdas/streaming-doc-fix and squashes the following commits:

e45c6d7 [Jacek Laskowski] More fixes from an old PR
5125316 [Tathagata Das] Fixed links
dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes.
acbc3e3 [Tathagata Das] Fixed links between streaming guides.
cb7007f [Tathagata Das] Added Streaming + Flume integration guide.
9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419.

2014-09-03 17:38:01 -07:00

2.8 KiB

Raw Blame History

layout	title
global	Spark Streaming + Kafka Integration Guide

Apache Kafka is publish-subscribe messaging rethought as a distributed, partitioned, replicated commit log service. Here we explain how to configure Spark Streaming to receive data from Kafka.

Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact (see Linking section in the main programming guide for further information).
```
 groupId = org.apache.spark
 artifactId = spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}}
 version = {{site.SPARK_VERSION_SHORT}}
```
Programming: In the streaming application code, import KafkaUtils and create input DStream as follows.
import org.apache.spark.streaming.kafka._
```
 val kafkaStream = KafkaUtils.createStream(
 	streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume])
```
See the API docs and the example.
import org.apache.spark.streaming.kafka.*;
```
 JavaPairReceiverInputDStream<String, String> kafkaStream = KafkaUtils.createStream(
 	streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume]);
```
See the API docs and the example.
Points to remember:
- Topic partitions in Kafka does not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the KafkaUtils.createStream() only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that.
- Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.
Deploying: Package spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}} and its dependencies (except spark-core_{{site.SCALA_BINARY_VERSION}} and spark-streaming_{{site.SCALA_BINARY_VERSION}} which are provided by spark-submit) into the application JAR. Then use spark-submit to launch your application (see Deploying section in the main programming guide).

2.8 KiB Raw Blame History

2.8 KiB

Raw Blame History