Updated the main streaming programming guide, and also added source-specific guides for Kafka, Flume, Kinesis. Author: Tathagata Das <tathagata.das1565@gmail.com> Author: Jacek Laskowski <jacek@japila.pl> Closes #2254 from tdas/streaming-doc-fix and squashes the following commits: e45c6d7 [Jacek Laskowski] More fixes from an old PR 5125316 [Tathagata Das] Fixed links dc02f26 [Tathagata Das] Refactored streaming kinesis guide and made many other changes. acbc3e3 [Tathagata Das] Fixed links between streaming guides. cb7007f [Tathagata Das] Added Streaming + Flume integration guide. 9bd9407 [Tathagata Das] Updated streaming programming guide with additional information from SPARK-2419.
2.8 KiB
layout | title |
---|---|
global | Spark Streaming + Kafka Integration Guide |
Apache Kafka is publish-subscribe messaging rethought as a distributed, partitioned, replicated commit log service. Here we explain how to configure Spark Streaming to receive data from Kafka.
-
Linking: In your SBT/Maven projrect definition, link your streaming application against the following artifact (see Linking section in the main programming guide for further information).
groupId = org.apache.spark artifactId = spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}} version = {{site.SPARK_VERSION_SHORT}}
-
Programming: In the streaming application code, import
KafkaUtils
and create input DStream as follows.import org.apache.spark.streaming.kafka._val kafkaStream = KafkaUtils.createStream( streamingContext, [zookeeperQuorum], [group id of the consumer], [per-topic number of Kafka partitions to consume])
Points to remember:
-
Topic partitions in Kafka does not correlate to partitions of RDDs generated in Spark Streaming. So increasing the number of topic-specific partitions in the
KafkaUtils.createStream()
only increases the number of threads using which topics that are consumed within a single receiver. It does not increase the parallelism of Spark in processing the data. Refer to the main document for more information on that. -
Multiple Kafka input DStreams can be created with different groups and topics for parallel receiving of data using multiple receivers.
-
-
Deploying: Package
spark-streaming-kafka_{{site.SCALA_BINARY_VERSION}}
and its dependencies (exceptspark-core_{{site.SCALA_BINARY_VERSION}}
andspark-streaming_{{site.SCALA_BINARY_VERSION}}
which are provided byspark-submit
) into the application JAR. Then usespark-submit
to launch your application (see Deploying section in the main programming guide).