import org.apache.spark.streaming.flume._
val flumeStream = FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port])
See the [API docs](api/scala/index.html#org.apache.spark.streaming.flume.FlumeUtils$)
and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/streaming/FlumeEventCount.scala).
import org.apache.spark.streaming.flume.*;
JavaReceiverInputDStream flumeStream =
FlumeUtils.createStream(streamingContext, [chosen machine's hostname], [chosen port]);
See the [API docs](api/java/index.html?org/apache/spark/streaming/flume/FlumeUtils.html)
and the [example]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples/streaming/JavaFlumeEventCount.java).
Note that the hostname should be the same as the one used by the resource manager in the
cluster (Mesos, YARN or Spark Standalone), so that resource allocation can match the names and launch
the receiver in the right machine.
3. **Deploying:** Package `spark-streaming-flume_{{site.SCALA_BINARY_VERSION}}` and its dependencies (except `spark-core_{{site.SCALA_BINARY_VERSION}}` and `spark-streaming_{{site.SCALA_BINARY_VERSION}}` which are provided by `spark-submit`) into the application JAR. Then use `spark-submit` to launch your application (see [Deploying section](streaming-programming-guide.html#deploying-applications) in the main programming guide).
## Approach 2: Pull-based Approach using a Custom Sink
Instead of Flume pushing data directly to Spark Streaming, this approach runs a custom Flume sink that allows the following.
- Flume pushes data into the sink, and the data stays buffered.
- Spark Streaming uses a [reliable Flume receiver](streaming-programming-guide.html#receiver-reliability)
and transactions to pull data from the sink. Transactions succeed only after data is received and
replicated by Spark Streaming.
This ensures stronger reliability and
[fault-tolerance guarantees](streaming-programming-guide.html#fault-tolerance-semantics)
than the previous approach. However, this requires configuring Flume to run a custom sink.
Here are the configuration steps.
#### General Requirements
Choose a machine that will run the custom sink in a Flume agent. The rest of the Flume pipeline is configured to send data to that agent. Machines in the Spark cluster should have access to the chosen machine running the custom sink.
#### Configuring Flume
Configuring Flume on the chosen machine requires the following two steps.
1. **Sink JARs**: Add the following JARs to Flume's classpath (see [Flume's documentation](https://flume.apache.org/documentation.html) to see how) in the machine designated to run the custom sink .
(i) *Custom sink JAR*: Download the JAR corresponding to the following artifact (or [direct link](http://search.maven.org/remotecontent?filepath=org/apache/spark/spark-streaming-flume-sink_{{site.SCALA_BINARY_VERSION}}/{{site.SPARK_VERSION_SHORT}}/spark-streaming-flume-sink_{{site.SCALA_BINARY_VERSION}}-{{site.SPARK_VERSION_SHORT}}.jar)).
groupId = org.apache.spark
artifactId = spark-streaming-flume-sink_{{site.SCALA_BINARY_VERSION}}
version = {{site.SPARK_VERSION_SHORT}}
(ii) *Scala library JAR*: Download the Scala library JAR for Scala {{site.SCALA_VERSION}}. It can be found with the following artifact detail (or, [direct link](http://search.maven.org/remotecontent?filepath=org/scala-lang/scala-library/{{site.SCALA_VERSION}}/scala-library-{{site.SCALA_VERSION}}.jar)).
groupId = org.scala-lang
artifactId = scala-library
version = {{site.SCALA_VERSION}}
2. **Configuration file**: On that machine, configure Flume agent to send data to an Avro sink by having the following in the configuration file.
agent.sinks = spark
agent.sinks.spark.type = org.apache.spark.streaming.flume.sink.SparkSink
agent.sinks.spark.hostname =