Shixiong Zhu 9293734d35 [SPARK-17346][SQL] Add Kafka source for Structured Streaming
## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc:

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")

* Subscribe to multiple topics
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")

* Subscribe to a pattern
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <>
Author: Tathagata Das <>
Author: Shixiong Zhu <>
Author: cody koeninger <>

Closes #15102 from zsxwing/kafka-source.
2016-10-05 16:45:45 -07:00

2718 lines
95 KiB

<?xml version="1.0" encoding="UTF-8"?>
~ Licensed to the Apache Software Foundation (ASF) under one or more
~ contributor license agreements. See the NOTICE file distributed with
~ this work for additional information regarding copyright ownership.
~ The ASF licenses this file to You under the Apache License, Version 2.0
~ (the "License"); you may not use this file except in compliance with
~ the License. You may obtain a copy of the License at
~ Unless required by applicable law or agreed to in writing, software
~ distributed under the License is distributed on an "AS IS" BASIS,
~ See the License for the specific language governing permissions and
~ limitations under the License.
<project xmlns="" xmlns:xsi=""
<name>Spark Project Parent POM</name>
<name>Apache 2.0 License</name>
<name>Matei Zaharia</name>
<organization>Apache Software Foundation</organization>
<name>Dev Mailing List</name>
<name>User Mailing List</name>
<name>Commits Mailing List</name>
<!-- Version used in Maven Hive dependency -->
<!-- Version used for internal directory structure -->
<!-- the producer is used in tests -->
<!-- org.apache.httpcomponents/httpclient-->
<!-- commons-httpclient/commons-httpclient-->
<!-- managed up from 3.2.1 for SPARK-11652 -->
<!-- org.apache.commons/commons-lang/-->
<!-- org.apache.commons/commons-lang3/-->
<!-- When using different JDKs for the build, we can't use Zinc for the jdk8 part. -->
<!-- Package to use when relocating shaded classes. -->
<!-- Modules that copy jars to the build directory should do so under this location. -->
<!-- Allow modules to enable / disable certain build plugins easily. -->
Dependency scopes that can be overridden by enabling certain profiles. These profiles are
declared in the projects that build assemblies.
For other projects the scope should remain as "compile", otherwise they are not available
during compilation if the dependency is transivite (e.g. "graphx/" depending on "core/" and
needing Hadoop classes in the classpath to compile).
Overridable test home. So that you can call individual pom files directly without
things breaking.
<!-- This should be at top, it makes maven try the central repo first and then others and hence faster dep resolution -->
<name>Maven Repository</name>
This is a dummy dependency that is used to trigger the maven-shade plugin so that Spark's
published POMs are flattened and do not contain variables. Without this dependency, some
subprojects' published POMs would contain variables like ${scala.binary.version} that will
be substituted according to the default properties instead of the ones determined by the
profiles that were active during publishing, causing the Scala 2.10 build's POMs to have 2.11
dependencies due to the incorrect substitutions. By ensuring that maven-shade runs for all
subprojects, we eliminate this problem because the substitutions are baked into the final POM.
For more details, see SPARK-3812 and MNG-2971.
This is needed by the scalatest plugin, and so is declared here to be available in
all child modules, just as scalatest is run in all children
<!-- This artifact is a shaded version of ASM 5.0.4. The POM that was used to produce this
is at
For context on why we shade ASM, see SPARK-782 and SPARK-6152. -->
<!-- Shaded deps marked as provided. These are promoted to compile scope
in the modules where we want the shaded classes to appear in the
associated jar. -->
<!-- End of shaded deps -->
<!-- Added for selenium only, and should match its dependent version: -->
<!-- <scope>runtime</scope> --> <!-- more correct, but scalac 2.10.3 doesn't like it -->
<!-- Only HyperLogLogPlus is used, which doesn't depend on fastutil -->
<!-- In theory we need not directly depend on protobuf since Spark does not directly
use it. However, when building with Hadoop/YARN 2.2 Maven doesn't correctly bump
the protobuf version up from the one Mesos gives. For now we include this variable
to explicitly bump the version when building with YARN. It would be nice to figure
out why Maven can't resolve this correctly (like SBT does). -->
<!-- Guava is excluded because of SPARK-6149. The Guava version referenced in this module is
15.0, which causes runtime incompatibility issues. -->
<!-- This is included as a compile-scoped dependency by jtransforms, which is
a dependency of breeze. -->
<version>1.12.5</version> <!-- 1.13.0 appears incompatible with scalatest 2.2.6 -->
<!-- avro-mapred for some reason depends on avro-ipc's test jar, so undo that. -->
<!-- See SPARK-1556 for info on this dependency: -->
<!-- pull this in when needed; the explicit definition culls the surplis-->
<!-- break the loop -->
<!-- excluded dependencies & transitive.
Some may be needed to be explicitly included-->
<!-- this is needed and must be explicitly included later-->
<!-- hive shims pulls in hive 0.23 and a transitive dependency of the Hadoop version
Hive was built against. This dependency cuts out the YARN/hadoop dependency, which
is needed by Hive to submit work to a YARN cluster.-->
<!-- hsqldb interferes with the use of derby as the default db
in hive's use of datanucleus.
Akka depends on io.netty:netty, which puts classes under the org.jboss.netty
package. This conflicts with the classes in org.jboss.netty:netty
artifact, so we have to ban that artifact here. In Netty 4.x, the classes
are under the io.netty package, so it's fine for us to depend on both
io.netty:netty and io.netty:netty-all.
<!-- Surefire runs all Java tests -->
<!-- Note config is repeated in scalatest config -->
<argLine>-Xmx3g -Xss4096k -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=512m</argLine>
Setting SPARK_DIST_CLASSPATH is a simple way to make sure any child processes
launched by the tests have access to the correct test-time classpath.
<!-- Needed by sql/hive tests. -->
<!-- Scalatest runs all Scala tests -->
<!-- Note config is repeated in surefire config -->
<argLine>-ea -Xmx3g -XX:MaxPermSize=${MaxPermGen} -XX:ReservedCodeCacheSize=${CodeCacheSize}</argLine>
Setting SPARK_DIST_CLASSPATH is a simple way to make sure any child processes
launched by the tests have access to the correct test-time classpath.
<!-- Needed by sql/hive tests. -->
<!-- This includes dependencies with 'runtime' and 'compile' scopes;
see the docs for includeScope for more details -->
<!-- This plugin's configuration is used to store Eclipse m2e settings only. -->
<!-- It has no influence on the Maven build itself. -->
<!-- This plugin dumps the test classpath into a file -->
The shade plug-in is used here to create effective pom's (see SPARK-3812), and also
remove references from the shaded libraries from artifacts published by Spark.
<mkdir dir="${}/tmp" />
<!-- Enable surefire and scalatest in all children, in one place: -->
<!-- Build test-jar's for all projects, since some projects depend on tests from others -->
This profile is enabled automatically by the sbt built. It changes the scope for the guava
dependency, since we don't shade it in the artifacts generated by the sbt build.
<!-- Ganglia integration is not included by default due to LGPL-licensed code -->
<!-- Kinesis integration is not included by default due to ASL-licensed code -->
<additionalparam>-Xdoclint:all -Xdoclint:-missing</additionalparam>
<!-- A series of build profiles where customizations for particular Hadoop releases can be made -->
<!-- Hadoop-a.b.c dependencies can be found at
<!-- SPARK-7249: Default hadoop profile. Uses global properties. -->
<excludes combine.children="append">
<compilerArgs combine.children="append">
<!-- Note: -javabootclasspath is set on a per-execution basis rather than as a
plugin-wide configuration because doc-jar generation will break if it's
set; see SPARK-15839 for more details -->
<args combine.children="append">
<args combine.children="append">
<excludes combine.children="append">
These empty profiles are available in some sub-modules. Declare them here so that
maven does not complain when they're provided on the command line for a sub-module
that does not have them.