spark-instrumented-optimizer/docs/structured-streaming-programming-guide.md

---
layout: global
displayTitle: Structured Streaming Programming Guide
title: Structured Streaming Programming Guide
---

* This will become a table of contents (this text will be scraped).
{:toc}

# Overview
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the [Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, *Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.*

Internally, by default, Structured Streaming queries are processed using a *micro-batch processing* engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called **Continuous Processing**, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements. 

In this guide, we are going to walk you through the programming model and the APIs. We are going to explain the concepts mostly using the default micro-batch processing model, and then [later](#continuous-processing-experimental) discuss Continuous Processing model. First, let's start with a simple example of a Structured Streaming query - a streaming word count.

# Quick Example
Let’s say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Let’s see how you can express this using Structured Streaming. You can see the full code in
[Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java)/[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount.py)/[R]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/r/streaming/structured_network_wordcount.R).
And if you [download Spark](https://spark.apache.org/downloads.html), you can directly [run the example](index.html#running-the-examples-and-shell). In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.SparkSession

val spark = SparkSession
  .builder
  .appName("StructuredNetworkWordCount")
  .getOrCreate()
  
import spark.implicits._
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.sql.*;
import org.apache.spark.sql.streaming.StreamingQuery;

import java.util.Arrays;
import java.util.Iterator;

SparkSession spark = SparkSession
  .builder()
  .appName("JavaStructuredNetworkWordCount")
  .getOrCreate();
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split

spark = SparkSession \
    .builder \
    .appName("StructuredNetworkWordCount") \
    .getOrCreate()
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
sparkR.session(appName = "StructuredNetworkWordCount")
{% endhighlight %}

</div>
</div>

Next, let’s create a streaming DataFrame that represents text data received from a server listening on localhost:9999, and transform the DataFrame to calculate word counts.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
// Create DataFrame representing the stream of input lines from connection to localhost:9999
val lines = spark.readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))

// Generate running word count
val wordCounts = words.groupBy("value").count()
{% endhighlight %}

This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a  Dataset of String using `.as[String]`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
// Create DataFrame representing the stream of input lines from connection to localhost:9999
Dataset<Row> lines = spark
  .readStream()
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load();

// Split the lines into words
Dataset<String> words = lines
  .as(Encoders.STRING())
  .flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(" ")).iterator(), Encoders.STRING());

// Generate running word count
Dataset<Row> wordCounts = words.groupBy("value").count();
{% endhighlight %}

This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a  Dataset of String using `.as(Encoders.STRING())`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

# Split the lines into words
words = lines.select(
   explode(
       split(lines.value, " ")
   ).alias("word")
)

# Generate running word count
wordCounts = words.groupBy("word").count()
{% endhighlight %}

This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have used two built-in SQL functions - split and explode, to split each line into multiple rows with a word each. In addition, we use the function `alias` to name the new column as "word". Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
# Create DataFrame representing the stream of input lines from connection to localhost:9999
lines <- read.stream("socket", host = "localhost", port = 9999)

# Split the lines into words
words <- selectExpr(lines, "explode(split(value, ' ')) as word")

# Generate running word count
wordCounts <- count(group_by(words, "word"))
{% endhighlight %}

This `lines` SparkDataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have a SQL expression with two SQL functions - split and explode, to split each line into multiple rows with a word each. In addition, we name the new column as "word". Finally, we have defined the `wordCounts` SparkDataFrame by grouping by the unique values in the SparkDataFrame and counting them. Note that this is a streaming SparkDataFrame which represents the running word counts of the stream.

</div>
</div>

We have now set up the query on the streaming data. All that is left is to actually start receiving data and computing the counts. To do this, we set it up to print the complete set of counts (specified by `outputMode("complete")`) to the console every time they are updated. And then start the streaming computation using `start()`.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
// Start running the query that prints the running counts to the console
val query = wordCounts.writeStream
  .outputMode("complete")
  .format("console")
  .start()

query.awaitTermination()
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
// Start running the query that prints the running counts to the console
StreamingQuery query = wordCounts.writeStream()
  .outputMode("complete")
  .format("console")
  .start();

query.awaitTermination();
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
 # Start running the query that prints the running counts to the console
query = wordCounts \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

query.awaitTermination()
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
# Start running the query that prints the running counts to the console
query <- write.stream(wordCounts, "console", outputMode = "complete")

awaitTermination(query)
{% endhighlight %}

</div>
</div>

After this code is executed, the streaming computation will have started in the background. The `query` object is a handle to that active streaming query, and we have decided to wait for the termination of the query using `awaitTermination()` to prevent the process from exiting while the query is active.

To actually execute this example code, you can either compile the code in your own 
[Spark application](quick-start.html#self-contained-applications), or simply 
[run the example](index.html#running-the-examples-and-shell) once you have downloaded Spark. We are showing the latter. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using


    $ nc -lk 9999

Then, in a different terminal, you can start the example by using

<div class="codetabs">
<div data-lang="scala"  markdown="1">
{% highlight bash %}
$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999
{% endhighlight %}
</div>
<div data-lang="java"  markdown="1">
{% highlight bash %}
$ ./bin/run-example org.apache.spark.examples.sql.streaming.JavaStructuredNetworkWordCount localhost 9999
{% endhighlight %}
</div>
<div data-lang="python"  markdown="1">
{% highlight bash %}
$ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
{% endhighlight %}
</div>
<div data-lang="r"  markdown="1">
{% highlight bash %}
$ ./bin/spark-submit examples/src/main/r/streaming/structured_network_wordcount.R localhost 9999
{% endhighlight %}
</div>
</div>

Then, any lines typed in the terminal running the netcat server will be counted and printed on screen every second. It will look something like the following.

<table width="100%">
    <td>
{% highlight bash %}
# TERMINAL 1:
# Running Netcat

$ nc -lk 9999
apache spark
apache hadoop


...
{% endhighlight %}
    </td>
    <td width="2%"></td>
    <td>
<div class="codetabs">

<div data-lang="scala" markdown="1">
{% highlight bash %}
# TERMINAL 2: RUNNING StructuredNetworkWordCount

$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    1|
| spark|    1|
+------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    2|
| spark|    1|
|hadoop|    1|
+------+-----+
...
{% endhighlight %}
</div>

<div data-lang="java" markdown="1">
{% highlight bash %}
# TERMINAL 2: RUNNING JavaStructuredNetworkWordCount

$ ./bin/run-example org.apache.spark.examples.sql.streaming.JavaStructuredNetworkWordCount localhost 9999

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    1|
| spark|    1|
+------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    2|
| spark|    1|
|hadoop|    1|
+------+-----+
...
{% endhighlight %}
</div>
<div data-lang="python" markdown="1">
{% highlight bash %}
# TERMINAL 2: RUNNING structured_network_wordcount.py

$ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    1|
| spark|    1|
+------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    2|
| spark|    1|
|hadoop|    1|
+------+-----+
...
{% endhighlight %}
</div>
<div data-lang="r" markdown="1">
{% highlight bash %}
# TERMINAL 2: RUNNING structured_network_wordcount.R

$ ./bin/spark-submit examples/src/main/r/streaming/structured_network_wordcount.R localhost 9999

-------------------------------------------
Batch: 0
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    1|
| spark|    1|
+------+-----+

-------------------------------------------
Batch: 1
-------------------------------------------
+------+-----+
| value|count|
+------+-----+
|apache|    2|
| spark|    1|
|hadoop|    1|
+------+-----+
...
{% endhighlight %}
</div>
</div>
    </td>
</table>


# Programming Model

The key idea in Structured Streaming is to treat a live data stream as a 
table that is being continuously appended. This leads to a new stream 
processing model that is very similar to a batch processing model. You will 
express your streaming computation as standard batch-like query as on a static 
table, and Spark runs it as an *incremental* query on the *unbounded* input 
table. Let’s understand this model in more detail.

## Basic Concepts
Consider the input data stream as the "Input Table". Every data item that is 
arriving on the stream is like a new row being appended to the Input Table.

![Stream as a Table](img/structured-streaming-stream-as-a-table.png "Stream as a Table")

A query on the input will generate the "Result Table". Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink. 

![Model](img/structured-streaming-model.png)

The "Output" is defined as what gets written out to the external storage. The output can be defined in a different mode:

  - *Complete Mode* - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table. 

  - *Append Mode* - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
  
  - *Update Mode* - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn't contain aggregations, it will be equivalent to Append mode.

Note that each mode is applicable on certain types of queries. This is discussed in detail [later](#output-modes).

To illustrate the use of this model, let’s understand the model in context of 
the [Quick Example](#quick-example) above. The first `lines` DataFrame is the input table, and 
the final `wordCounts` DataFrame is the result table. Note that the query on 
streaming `lines` DataFrame to generate `wordCounts` is *exactly the same* as 
it would be a static DataFrame. However, when this query is started, Spark 
will continuously check for new data from the socket connection. If there is 
new data, Spark will run an "incremental" query that combines the previous 
running counts with the new data to compute updated counts, as shown below.

![Model](img/structured-streaming-example-model.png)

**Note that Structured Streaming does not materialize the entire table**. It reads the latest
available data from the streaming data source, processes it incrementally to update the result,
and then discards the source data. It only keeps around the minimal intermediate *state* data as
required to update the result (e.g. intermediate counts in the earlier example).

This model is significantly different from many other stream processing
engines. Many streaming systems require the user to maintain running 
aggregations themselves, thus having to reason about fault-tolerance, and 
data consistency (at-least-once, or at-most-once, or exactly-once). In this 
model, Spark is responsible for updating the Result Table when there is new 
data, thus relieving the users from reasoning about it. As an example, let’s 
see how this model handles event-time based processing and late arriving data.

## Handling Event-time and Late Data
Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.

Furthermore, this model naturally handles data that has arrived later than 
expected based on its event-time. Since Spark is updating the Result Table, 
it has full control over updating old aggregates when there is late data, 
as well as cleaning up old aggregates to limit the size of intermediate
state data. Since Spark 2.1, we have support for watermarking which 
allows the user to specify the threshold of late data, and allows the engine
to accordingly clean up old state. These are explained later in more 
detail in the [Window Operations](#window-operations-on-event-time) section.

## Fault Tolerance Semantics
Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure **end-to-end exactly-once semantics** under any failure.

# API using Datasets and DataFrames
Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point `SparkSession`
([Scala](api/scala/index.html#org.apache.spark.sql.SparkSession)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)/[R](api/R/sparkR.session.html) docs)
to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the
[DataFrame/Dataset Programming Guide](sql-programming-guide.html).

## Creating streaming DataFrames and streaming Datasets
Streaming DataFrames can be created through the `DataStreamReader` interface
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) docs)
returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with the `read.stream()` method. Similar to the read interface for creating static DataFrame, you can specify the details of the source – data format, schema, options, etc.

#### Input Sources
There are a few built-in sources.

  - **File source** - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.

  - **Kafka source** - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-0-10-integration.html) for more details.

  - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees. 

  - **Rate source (for testing)** - Generates data at the specified number of rows per second, each output row contains a `timestamp` and `value`. Where `timestamp` is a `Timestamp` type containing the time of message dispatch, and `value` is of `Long` type containing the message count, starting from 0 as the first row. This source is intended for testing and benchmarking.

Some sources are not fault-tolerant because they do not guarantee that data can be replayed using 
checkpointed offsets after a failure. See the earlier section on 
[fault-tolerance semantics](#fault-tolerance-semantics).
Here are the details of all the sources in Spark.

<table class="table">
  <tr>
    <th>Source</th>
    <th>Options</th>
    <th>Fault-tolerant</th>
    <th>Notes</th>
  </tr>
  <tr>
    <td><b>File source</b></td>
    <td>
        <code>path</code>: path to the input directory, and common to all file formats.
        <br/>
        <code>maxFilesPerTrigger</code>: maximum number of new files to be considered in every trigger (default: no max)
        <br/>
        <code>latestFirst</code>: whether to process the latest new files first, useful when there is a large backlog of files (default: false)
        <br/>
        <code>fileNameOnly</code>: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:
        <br/>
        "file:///dataset.txt"<br/>
        "s3://a/dataset.txt"<br/>
        "s3n://a/b/dataset.txt"<br/>
        "s3a://a/b/c/dataset.txt"<br/>
        <br/><br/>
        For file-format-specific options, see the related methods in <code>DataStreamReader</code>
        (<a href="api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>/<a
        href="api/R/read.stream.html">R</a>).
        E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>.
        <br/><br/>
        In addition, there are session configurations that affect certain file-formats. See the <a href="sql-programming-guide.html">SQL Programming Guide</a> for more details. E.g., for "parquet", see <a href="sql-data-sources-parquet.html#configuration">Parquet configuration</a> section.
        </td>
    <td>Yes</td>
    <td>Supports glob paths, but does not support multiple comma-separated paths/globs.</td>
  </tr>
  <tr>
    <td><b>Socket Source</b></td>
    <td>
        <code>host</code>: host to connect to, must be specified<br/>
        <code>port</code>: port to connect to, must be specified
    </td>
    <td>No</td>
    <td></td>
  </tr>
  <tr>
    <td><b>Rate Source</b></td>
    <td>
        <code>rowsPerSecond</code> (e.g. 100, default: 1): How many rows should be generated per second.<br/><br/>
        <code>rampUpTime</code> (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes <code>rowsPerSecond</code>. Using finer granularities than seconds will be truncated to integer seconds. <br/><br/>
        <code>numPartitions</code> (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. <br/><br/>
        
        The source will try its best to reach <code>rowsPerSecond</code>, but the query may be resource constrained, and <code>numPartitions</code> can be tweaked to help reach the desired speed.
    </td>
    <td>Yes</td>
    <td></td>
  </tr>

  <tr>
    <td><b>Kafka Source</b></td>
    <td>
        See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a>.
    </td>
    <td>Yes</td>
    <td></td>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

Here are some examples.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val spark: SparkSession = ...

// Read text from socket
val socketDF = spark
  .readStream
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load()

socketDF.isStreaming    // Returns True for DataFrames that have streaming sources

socketDF.printSchema

// Read all the csv files written atomically in a directory
val userSchema = new StructType().add("name", "string").add("age", "integer")
val csvDF = spark
  .readStream
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
SparkSession spark = ...

// Read text from socket
Dataset<Row> socketDF = spark
  .readStream()
  .format("socket")
  .option("host", "localhost")
  .option("port", 9999)
  .load();

socketDF.isStreaming();    // Returns True for DataFrames that have streaming sources

socketDF.printSchema();

// Read all the csv files written atomically in a directory
StructType userSchema = new StructType().add("name", "string").add("age", "integer");
Dataset<Row> csvDF = spark
  .readStream()
  .option("sep", ";")
  .schema(userSchema)      // Specify schema of the csv files
  .csv("/path/to/directory");    // Equivalent to format("csv").load("/path/to/directory")
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
spark = SparkSession. ...

# Read text from socket
socketDF = spark \
    .readStream \
    .format("socket") \
    .option("host", "localhost") \
    .option("port", 9999) \
    .load()

socketDF.isStreaming()    # Returns True for DataFrames that have streaming sources

socketDF.printSchema()

# Read all the csv files written atomically in a directory
userSchema = StructType().add("name", "string").add("age", "integer")
csvDF = spark \
    .readStream \
    .option("sep", ";") \
    .schema(userSchema) \
    .csv("/path/to/directory")  # Equivalent to format("csv").load("/path/to/directory")
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
sparkR.session(...)

# Read text from socket
socketDF <- read.stream("socket", host = hostname, port = port)

isStreaming(socketDF)    # Returns TRUE for SparkDataFrames that have streaming sources

printSchema(socketDF)

# Read all the csv files written atomically in a directory
schema <- structType(structField("name", "string"),
                     structField("age", "integer"))
csvDF <- read.stream("csv", path = "/path/to/directory", schema = schema, sep = ";")
{% endhighlight %}

</div>
</div>

These examples generate streaming DataFrames that are untyped, meaning that the schema of the DataFrame is not checked at compile time, only checked at runtime when the query is submitted. Some operations like `map`, `flatMap`, etc. need the type to be known at compile time. To do those, you can convert these untyped streaming DataFrames to typed streaming Datasets using the same methods as static DataFrame. See the [SQL Programming Guide](sql-programming-guide.html) for more details. Additionally, more details on the supported streaming sources are discussed later in the document.

### Schema inference and partition of streaming DataFrames/Datasets

By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting `spark.sql.streaming.schemaInference` to `true`.

Partition discovery does occur when subdirectories that are named `/key=value/` are present and listing will automatically recurse into these directories. If these columns appear in the user-provided schema, they will be filled in by Spark based on the path of the file being read. The directories that make up the partitioning scheme must be present when the query starts and must remain static. For example, it is okay to add `/data/year=2016/` when `/data/year=2015/` was present, but it is invalid to change the partitioning column (i.e. by creating the directory `/data/date=2016-04-17/`).

## Operations on streaming DataFrames/Datasets
You can apply all kinds of operations on streaming DataFrames/Datasets – ranging from untyped, SQL-like operations (e.g. `select`, `where`, `groupBy`), to typed RDD-like operations (e.g. `map`, `filter`, `flatMap`). See the [SQL programming guide](sql-programming-guide.html) for more details. Let’s take a look at a few example operations that you can use.

### Basic Operations - Selection, Projection, Aggregation
Most of the common operations on DataFrame/Dataset are supported for streaming. The few operations that are not supported are [discussed later](#unsupported-operations) in this section.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
case class DeviceData(device: String, deviceType: String, signal: Double, time: DateTime)

val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: string }
val ds: Dataset[DeviceData] = df.as[DeviceData]    // streaming Dataset with IOT device data

// Select the devices which have signal more than 10
df.select("device").where("signal > 10")      // using untyped APIs   
ds.filter(_.signal > 10).map(_.device)         // using typed APIs

// Running count of the number of updates for each device type
df.groupBy("deviceType").count()                          // using untyped API

// Running average signal for each device type
import org.apache.spark.sql.expressions.scalalang.typed
ds.groupByKey(_.deviceType).agg(typed.avg(_.signal))    // using typed API
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
import org.apache.spark.api.java.function.*;
import org.apache.spark.sql.*;
import org.apache.spark.sql.expressions.javalang.typed;
import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder;

public class DeviceData {
  private String device;
  private String deviceType;
  private Double signal;
  private java.sql.Date time;
  ...
  // Getter and setter methods for each field
}

Dataset<Row> df = ...;    // streaming DataFrame with IOT device data with schema { device: string, type: string, signal: double, time: DateType }
Dataset<DeviceData> ds = df.as(ExpressionEncoder.javaBean(DeviceData.class)); // streaming Dataset with IOT device data

// Select the devices which have signal more than 10
df.select("device").where("signal > 10"); // using untyped APIs
ds.filter((FilterFunction<DeviceData>) value -> value.getSignal() > 10)
  .map((MapFunction<DeviceData, String>) value -> value.getDevice(), Encoders.STRING());

// Running count of the number of updates for each device type
df.groupBy("deviceType").count(); // using untyped API

// Running average signal for each device type
ds.groupByKey((MapFunction<DeviceData, String>) value -> value.getDeviceType(), Encoders.STRING())
  .agg(typed.avg((MapFunction<DeviceData, Double>) value -> value.getSignal()));
{% endhighlight %}


</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
df = ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }

# Select the devices which have signal more than 10
df.select("device").where("signal > 10")

# Running count of the number of updates for each device type
df.groupBy("deviceType").count()
{% endhighlight %}
</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
df <- ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }

# Select the devices which have signal more than 10
select(where(df, "signal > 10"), "device")

# Running count of the number of updates for each device type
count(groupBy(df, "deviceType"))
{% endhighlight %}
</div>
</div>

You can also register a streaming DataFrame/Dataset as a temporary view and then apply SQL commands on it.

<div class="codetabs">
<div data-lang="scala"  markdown="1">
{% highlight scala %}
df.createOrReplaceTempView("updates")
spark.sql("select count(*) from updates")  // returns another streaming DF
{% endhighlight %}
</div>
<div data-lang="java"  markdown="1">  
{% highlight java %}
df.createOrReplaceTempView("updates");
spark.sql("select count(*) from updates");  // returns another streaming DF
{% endhighlight %}
</div>
<div data-lang="python"  markdown="1">  
{% highlight python %}
df.createOrReplaceTempView("updates")
spark.sql("select count(*) from updates")  # returns another streaming DF
{% endhighlight %}
</div>
<div data-lang="r"  markdown="1">
{% highlight r %}
createOrReplaceTempView(df, "updates")
sql("select count(*) from updates")
{% endhighlight %}
</div>
</div>

Note, you can identify whether a DataFrame/Dataset has streaming data or not by using `df.isStreaming`.

<div class="codetabs">
<div data-lang="scala"  markdown="1">
{% highlight scala %}
df.isStreaming
{% endhighlight %}
</div>
<div data-lang="java"  markdown="1">
{% highlight java %}
df.isStreaming()
{% endhighlight %}
</div>
<div data-lang="python"  markdown="1">
{% highlight python %}
df.isStreaming()
{% endhighlight %}
</div>
<div data-lang="r"  markdown="1">
{% highlight r %}
isStreaming(df)
{% endhighlight %}
</div>
</div>

### Window Operations on Event Time
Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration. 

Imagine our [quick example](#quick-example) is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time).

The result tables would look something like the following.

![Window Operations](img/structured-streaming-window.png)

Since this windowing is similar to grouping, in code, you can use `groupBy()` and `window()` operations to express windowed aggregations. You can see the full code for the below examples in
[Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words.groupBy(
  window($"timestamp", "10 minutes", "5 minutes"),
  $"word"
).count()
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
Dataset<Row> words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
Dataset<Row> windowedCounts = words.groupBy(
  functions.window(words.col("timestamp"), "10 minutes", "5 minutes"),
  words.col("word")
).count();
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">
{% highlight python %}
words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group
windowedCounts = words.groupBy(
    window(words.timestamp, "10 minutes", "5 minutes"),
    words.word
).count()
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">
{% highlight r %}
words <- ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group
windowedCounts <- count(
                    groupBy(
                      words,
                      window(words$timestamp, "10 minutes", "5 minutes"),
                      words$word))
{% endhighlight %}

</div>
</div>


#### Handling Late Data and Watermarking
Now consider what happens if one of the events arrives late to the application.
For example, say, a word generated at 12:04 (i.e. event time) could be received by 
the application at 12:11. The application should use the time 12:04 instead of 12:11
to update the older counts for the window `12:00 - 12:10`. This occurs 
naturally in our window-based grouping – Structured Streaming can maintain the intermediate state 
for partial aggregates for a long period of time such that late data can update aggregates of 
old windows correctly, as illustrated below.

![Handling Late Data](img/structured-streaming-late-data.png)

However, to run this query for days, it's necessary for the system to bound the amount of 
intermediate in-memory state it accumulates. This means the system needs to know when an old 
aggregate can be dropped from the in-memory state because the application is not going to receive 
late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced 
**watermarking**, which lets the engine automatically track the current event time in the data
and attempt to clean up old state accordingly. You can define the watermark of a query by 
specifying the event time column and the threshold on how late the data is expected to be in terms of 
event time. For a specific window starting at time `T`, the engine will maintain state and allow late
data to update the state until `(max event time seen by the engine - late threshold > T)`. 
In other words, late data within the threshold will be aggregated, 
but data later than the threshold will start getting dropped
(see [later](#semantic-guarantees-of-aggregation-with-watermarking)
in the section for the exact guarantees). Let's understand this with an example. We can
easily define watermarking on the previous example using `withWatermark()` as shown below.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
import spark.implicits._

val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
val windowedCounts = words
    .withWatermark("timestamp", "10 minutes")
    .groupBy(
        window($"timestamp", "10 minutes", "5 minutes"),
        $"word")
    .count()
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
Dataset<Row> words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }

// Group the data by window and word and compute the count of each group
Dataset<Row> windowedCounts = words
    .withWatermark("timestamp", "10 minutes")
    .groupBy(
        functions.window(words.col("timestamp"), "10 minutes", "5 minutes"),
        words.col("word"))
    .count();
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">
{% highlight python %}
words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group
windowedCounts = words \
    .withWatermark("timestamp", "10 minutes") \
    .groupBy(
        window(words.timestamp, "10 minutes", "5 minutes"),
        words.word) \
    .count()
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">
{% highlight r %}
words <- ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }

# Group the data by window and word and compute the count of each group

words <- withWatermark(words, "timestamp", "10 minutes")
windowedCounts <- count(
                    groupBy(
                      words,
                      window(words$timestamp, "10 minutes", "5 minutes"),
                      words$word))
{% endhighlight %}

</div>
</div>

In this example, we are defining the watermark of the query on the value of the column "timestamp", 
and also defining "10 minutes" as the threshold of how late is the data allowed to be. If this query 
is run in Update output mode (discussed later in [Output Modes](#output-modes) section), 
the engine will keep updating counts of a window in the Result Table until the window is older
than the watermark, which lags behind the current event time in column "timestamp" by 10 minutes.
Here is an illustration. 

![Watermarking in Update Mode](img/structured-streaming-watermark-update-mode.png)

As shown in the illustration, the maximum event time tracked by the engine is the 
*blue dashed line*, and the watermark set as `(max event time - '10 mins')`
at the beginning of every trigger is the red line. For example, when the engine observes the data 
`(12:14, dog)`, it sets the watermark for the next trigger as `12:04`.
This watermark lets the engine maintain intermediate state for additional 10 minutes to allow late
data to be counted. For example, the data `(12:09, cat)` is out of order and late, and it falls in
windows `12:00 - 12:10` and `12:05 - 12:15`. Since, it is still ahead of the watermark `12:04` in 
the trigger, the engine still maintains the intermediate counts as state and correctly updates the 
counts of the related windows. However, when the watermark is updated to `12:11`, the intermediate 
state for window `(12:00 - 12:10)` is cleared, and all subsequent data (e.g. `(12:04, donkey)`) 
is considered "too late" and therefore ignored. Note that after every trigger, 
the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by 
the Update mode.

Some sinks (e.g. files) may not supported fine-grained updates that Update Mode requires. To work
with them, we have also support Append Mode, where only the *final counts* are written to sink.
This is illustrated below.

Note that using `withWatermark` on a non-streaming Dataset is no-op. As the watermark should not affect 
any batch query in any way, we will ignore it directly.

![Watermarking in Append Mode](img/structured-streaming-watermark-append-mode.png)

Similar to the Update Mode earlier, the engine maintains intermediate counts for each window. 
However, the partial counts are not updated to the Result Table and not written to sink. The engine
waits for "10 mins" for late date to be counted, 
then drops intermediate state of a window < watermark, and appends the final
counts to the Result Table/sink. For example, the final counts of window `12:00 - 12:10` is 
appended to the Result Table only after the watermark is updated to `12:11`. 

##### Conditions for watermarking to clean aggregation state
{:.no_toc}

It is important to note that the following conditions must be satisfied for the watermarking to 
clean the state in aggregation queries *(as of Spark 2.1.1, subject to change in the future)*.

- **Output mode must be Append or Update.** Complete mode requires all aggregate data to be preserved, 
and hence cannot use watermarking to drop intermediate state. See the [Output Modes](#output-modes) 
section for detailed explanation of the semantics of each output mode.

- The aggregation must have either the event-time column, or a `window` on the event-time column. 

- `withWatermark` must be called on the 
same column as the timestamp column used in the aggregate. For example, 
`df.withWatermark("time", "1 min").groupBy("time2").count()` is invalid 
in Append output mode, as watermark is defined on a different column
from the aggregation column.

- `withWatermark` must be called before the aggregation for the watermark details to be used. 
For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append 
output mode.

##### Semantic Guarantees of Aggregation with Watermarking
{:.no_toc}

- A watermark delay (set with `withWatermark`) of "2 hours" guarantees that the engine will never
drop any data that is less than 2 hours delayed. In other words, any data less than 2 hours behind
(in terms of event-time) the latest data processed till then is guaranteed to be aggregated.

- However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is
not guaranteed to be dropped; it may or may not get aggregated. More delayed is the data, less
likely is the engine going to process it.

### Join Operations
Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame
as well as another streaming Dataset/DataFrame. The result of the streaming join is generated
incrementally, similar to the results of streaming aggregations in the previous section. In this
section we will explore what type of joins (i.e. inner, outer, etc.) are supported in the above
cases. Note that in all the supported join types, the result of the join with a streaming
Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame
containing the same data in the stream.


#### Stream-static Joins

Since the introduction in Spark 2.0, Structured Streaming has supported joins (inner join and some
type of outer joins) between a streaming and a static DataFrame/Dataset. Here is a simple example.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val staticDf = spark.read. ...
val streamingDf = spark.readStream. ...

streamingDf.join(staticDf, "type")          // inner equi-join with a static DF
streamingDf.join(staticDf, "type", "right_join")  // right outer join with a static DF  

{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
Dataset<Row> staticDf = spark.read(). ...;
Dataset<Row> streamingDf = spark.readStream(). ...;
streamingDf.join(staticDf, "type");         // inner equi-join with a static DF
streamingDf.join(staticDf, "type", "right_join");  // right outer join with a static DF
{% endhighlight %}


</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
staticDf = spark.read. ...
streamingDf = spark.readStream. ...
streamingDf.join(staticDf, "type")  # inner equi-join with a static DF
streamingDf.join(staticDf, "type", "right_join")  # right outer join with a static DF
{% endhighlight %}

</div>

<div data-lang="r"  markdown="1">

{% highlight r %}
staticDf <- read.df(...)
streamingDf <- read.stream(...)
joined <- merge(streamingDf, staticDf, sort = FALSE)  # inner equi-join with a static DF
joined <- join(
            staticDf,
            streamingDf, 
            streamingDf$value == staticDf$value,
            "right_outer")  # right outer join with a static DF
{% endhighlight %}

</div>
</div>

Note that stream-static joins are not stateful, so no state management is necessary.
However, a few types of stream-static outer joins are not yet supported.
These are listed at the [end of this Join section](#support-matrix-for-joins-in-streaming-queries).

#### Stream-stream Joins
In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming
Datasets/DataFrames. The challenge of generating join results between two data streams is that,
at any point of time, the view of the dataset is incomplete for both sides of the join making
it much harder to find matches between inputs. Any row received from one input stream can match
with any future, yet-to-be-received row from the other input stream. Hence, for both the input
streams, we buffer past input as streaming state, so that we can match every future input with
past input and accordingly generate joined results. Furthermore, similar to streaming aggregations,
we automatically handle late, out-of-order data and can limit the state using watermarks.
Let’s discuss the different types of supported stream-stream joins and how to use them.

##### Inner Joins with optional Watermarking
Inner joins on any kind of columns along with any kind of join conditions are supported.
However, as the stream runs, the size of streaming state will keep growing indefinitely as
*all* past input must be saved as any new input can match with any input from the past.
To avoid unbounded state, you have to define additional join conditions such that indefinitely
old inputs cannot match with future inputs and therefore can be cleared from the state.
In other words, you will have to do the following additional steps in the join.

1. Define watermark delays on both inputs such that the engine knows how delayed the input can be
(similar to streaming aggregations)

1. Define a constraint on event-time across the two inputs such that the engine can figure out when
old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for
matches with the other input. This constraint can be defined in one of the two ways.

    1. Time range join conditions (e.g. `...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR`),

    1. Join on event-time windows (e.g. `...JOIN ON leftTimeWindow = rightTimeWindow`).

Let’s understand this with an example.

Let’s say we want to join a stream of advertisement impressions (when an ad was shown) with
another stream of user clicks on advertisements to correlate when impressions led to
monetizable clicks. To allow the state cleanup in this stream-stream join, you will have to
specify the watermarking delays and the time constraints as follows.

1. Watermark delays: Say, the impressions and the corresponding clicks can be late/out-of-order
in event-time by at most 2 and 3 hours, respectively.

1. Event-time range condition: Say, a click can occur within a time range of 0 seconds to 1 hour
after the corresponding impression.

The code would look like this.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
import org.apache.spark.sql.functions.expr

val impressions = spark.readStream. ...
val clicks = spark.readStream. ...

// Apply watermarks on event-time columns
val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")

// Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """)
)

{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
import static org.apache.spark.sql.functions.expr

Dataset<Row> impressions = spark.readStream(). ...
Dataset<Row> clicks = spark.readStream(). ...

// Apply watermarks on event-time columns
Dataset<Row> impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours");
Dataset<Row> clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours");

// Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr(
    "clickAdId = impressionAdId AND " +
    "clickTime >= impressionTime AND " +
    "clickTime <= impressionTime + interval 1 hour ")
);

{% endhighlight %}


</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
from pyspark.sql.functions import expr

impressions = spark.readStream. ...
clicks = spark.readStream. ...

# Apply watermarks on event-time columns
impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")

# Join with event-time constraints
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """)
)

{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
impressions <- read.stream(...)
clicks <- read.stream(...)

# Apply watermarks on event-time columns
impressionsWithWatermark <- withWatermark(impressions, "impressionTime", "2 hours")
clicksWithWatermark <- withWatermark(clicks, "clickTime", "3 hours")

# Join with event-time constraints
joined <- join(
  impressionsWithWatermark,
  clicksWithWatermark,
  expr(
    paste(
      "clickAdId = impressionAdId AND",
      "clickTime >= impressionTime AND",
      "clickTime <= impressionTime + interval 1 hour"
)))

{% endhighlight %}

</div>
</div>

###### Semantic Guarantees of Stream-stream Inner Joins with Watermarking
{:.no_toc}
This is similar to the [guarantees provided by watermarking on aggregations](#semantic-guarantees-of-aggregation-with-watermarking).
A watermark delay of "2 hours" guarantees that the engine will never drop any data that is less than
 2 hours delayed. But data delayed by more than 2 hours may or may not get processed.

##### Outer Joins with Watermarking
While the watermark + event-time constraints is optional for inner joins, for left and right outer
joins they must be specified. This is because for generating the NULL results in outer join, the
engine must know when an input row is not going to match with anything in future. Hence, the
watermark + event-time constraints must be specified for generating correct results. Therefore,
a query with outer-join will look quite like the ad-monetization example earlier, except that
there will be an additional parameter specifying it to be an outer-join.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}

impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """),
  joinType = "leftOuter"      // can be "inner", "leftOuter", "rightOuter"
 )

{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr(
    "clickAdId = impressionAdId AND " +
    "clickTime >= impressionTime AND " +
    "clickTime <= impressionTime + interval 1 hour "),
  "leftOuter"                 // can be "inner", "leftOuter", "rightOuter"
);

{% endhighlight %}


</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
impressionsWithWatermark.join(
  clicksWithWatermark,
  expr("""
    clickAdId = impressionAdId AND
    clickTime >= impressionTime AND
    clickTime <= impressionTime + interval 1 hour
    """),
  "leftOuter"                 # can be "inner", "leftOuter", "rightOuter"
)

{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
joined <- join(
  impressionsWithWatermark,
  clicksWithWatermark,
  expr(
    paste(
      "clickAdId = impressionAdId AND",
      "clickTime >= impressionTime AND",
      "clickTime <= impressionTime + interval 1 hour"),
  "left_outer"                 # can be "inner", "left_outer", "right_outer"
))

{% endhighlight %}

</div>
</div>


###### Semantic Guarantees of Stream-stream Outer Joins with Watermarking
{:.no_toc}
Outer joins have the same guarantees as [inner joins](#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking)
regarding watermark delays and whether data will be dropped or not.

###### Caveats
{:.no_toc}
There are a few important characteristics to note regarding how the outer results are generated.

- *The outer NULL results will be generated with a delay that depends on the specified watermark
delay and the time range condition.* This is because the engine has to wait for that long to ensure
there were no matches and there will be no more matches in future.

- In the current implementation in the micro-batch engine, watermarks are advanced at the end of a
micro-batch, and the next micro-batch uses the updated watermark to clean up state and output
outer results. Since we trigger a micro-batch only when there is new data to be processed, the
generation of the outer result may get delayed if there no new data being received in the stream.
*In short, if any of the two input streams being joined does not receive data for a while, the
outer (both cases, left or right) output may get delayed.*

##### Support matrix for joins in streaming queries

<table class ="table">
  <tr>
    <th>Left Input</th>
    <th>Right Input</th>
    <th>Join Type</th>
    <th></th>
  </tr>
  <tr>
      <td style="vertical-align: middle;">Static</td>
      <td style="vertical-align: middle;">Static</td>
      <td style="vertical-align: middle;">All types</td>
      <td style="vertical-align: middle;">
        Supported, since its not on streaming data even though it
        can be present in a streaming query
      </td>
  </tr>
  <tr>
    <td rowspan="4" style="vertical-align: middle;">Stream</td>
    <td rowspan="4" style="vertical-align: middle;">Static</td>
    <td style="vertical-align: middle;">Inner</td>
    <td style="vertical-align: middle;">Supported, not stateful</td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Left Outer</td>
    <td style="vertical-align: middle;">Supported, not stateful</td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Right Outer</td>
    <td style="vertical-align: middle;">Not supported</td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Full Outer</td>
    <td style="vertical-align: middle;">Not supported</td>
  </tr>
  <tr>
    <td rowspan="4" style="vertical-align: middle;">Static</td>
    <td rowspan="4" style="vertical-align: middle;">Stream</td>
    <td style="vertical-align: middle;">Inner</td>
    <td style="vertical-align: middle;">Supported, not stateful</td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Left Outer</td>
    <td style="vertical-align: middle;">Not supported</td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Right Outer</td>
    <td style="vertical-align: middle;">Supported, not stateful</td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Full Outer</td>
    <td style="vertical-align: middle;">Not supported</td>
  </tr>
  <tr>
    <td rowspan="4" style="vertical-align: middle;">Stream</td>
    <td rowspan="4" style="vertical-align: middle;">Stream</td>
    <td style="vertical-align: middle;">Inner</td>
    <td style="vertical-align: middle;">
      Supported, optionally specify watermark on both sides +
      time constraints for state cleanup
    </td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Left Outer</td>
    <td style="vertical-align: middle;">
      Conditionally supported, must specify watermark on right + time constraints for correct
      results, optionally specify watermark on left for all state cleanup
    </td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Right Outer</td>
    <td style="vertical-align: middle;">
      Conditionally supported, must specify watermark on left + time constraints for correct
      results, optionally specify watermark on right for all state cleanup
    </td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Full Outer</td>
    <td style="vertical-align: middle;">Not supported</td>
  </tr>
 <tr>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

Additional details on supported joins:

- Joins can be cascaded, that is, you can do `df1.join(df2, ...).join(df3, ...).join(df4, ....)`.

- As of Spark 2.3, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.

- As of Spark 2.3, you cannot use other non-map-like operations before joins. Here are a few examples of
  what cannot be used.

  - Cannot use streaming aggregations before joins.

  - Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.


### Streaming Deduplication
You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as deduplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use deduplication with or without watermarking.

- *With watermark* - If there is an upper bound on how late a duplicate record may arrive, then you can define a watermark on an event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.

- *Without watermark* - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val streamingDf = spark.readStream. ...  // columns: guid, eventTime, ...

// Without watermark using guid column
streamingDf.dropDuplicates("guid")

// With watermark using guid and eventTime columns
streamingDf
  .withWatermark("eventTime", "10 seconds")
  .dropDuplicates("guid", "eventTime")
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
Dataset<Row> streamingDf = spark.readStream(). ...;  // columns: guid, eventTime, ...

// Without watermark using guid column
streamingDf.dropDuplicates("guid");

// With watermark using guid and eventTime columns
streamingDf
  .withWatermark("eventTime", "10 seconds")
  .dropDuplicates("guid", "eventTime");
{% endhighlight %}


</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
streamingDf = spark.readStream. ...

# Without watermark using guid column
streamingDf.dropDuplicates("guid")

# With watermark using guid and eventTime columns
streamingDf \
  .withWatermark("eventTime", "10 seconds") \
  .dropDuplicates("guid", "eventTime")
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
streamingDf <- read.stream(...)

# Without watermark using guid column
streamingDf <- dropDuplicates(streamingDf, "guid")

# With watermark using guid and eventTime columns
streamingDf <- withWatermark(streamingDf, "eventTime", "10 seconds")
streamingDf <- dropDuplicates(streamingDf, "guid", "eventTime")
{% endhighlight %}

</div>
</div>

### Policy for handling multiple watermarks
A streaming query can have multiple input streams that are unioned or joined together.
Each of the input streams can have a different threshold of late data that needs to
be tolerated for stateful operations. You specify these thresholds using
``withWatermarks("eventTime", delay)`` on each of the input streams. For example, consider
a query with stream-stream joins between `inputStream1` and `inputStream2`.
    
  inputStream1.withWatermark("eventTime1", "1 hour")
    .join(
      inputStream2.withWatermark("eventTime2", "2 hours"),
      joinCondition)

While executing the query, Structured Streaming individually tracks the maximum
event time seen in each input stream, calculates watermarks based on the corresponding delay,
and chooses a single global watermark with them to be used for stateful operations. By default,
the minimum is chosen as the global watermark because it ensures that no data is
accidentally dropped as too late if one of the streams falls behind the others
(for example, one of the streams stops receiving data due to upstream failures). In other words,
the global watermark will safely move at the pace of the slowest stream and the query output will
be delayed accordingly.

However, in some cases, you may want to get faster results even if it means dropping data from the
slowest stream. Since Spark 2.4, you can set the multiple watermark policy to choose
the maximum value as the global watermark by setting the SQL configuration
``spark.sql.streaming.multipleWatermarkPolicy`` to ``max`` (default is ``min``). 
This lets the global watermark move at the pace of the fastest stream.
However, as a side effect, data from the slower streams will be aggressively dropped. Hence, use
this configuration judiciously.

### Arbitrary Stateful Operations
Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).

### Unsupported Operations
There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets. 
Some of them are as follows.
 
- Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets.

- Limit and take the first N rows are not supported on streaming Datasets.

- Distinct operations on streaming Datasets are not supported.

- Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.

- Few types of outer joins on streaming Datasets are not supported. See the
  <a href="#support-matrix-for-joins-in-streaming-queries">support matrix in the Join Operations section</a>
  for more details.

In addition, there are some Dataset methods that will not work on streaming Datasets. They are actions that will immediately run queries and return results, which does not make sense on a streaming Dataset. Rather, those functionalities can be done by explicitly starting a streaming query (see the next section regarding that).

- `count()` - Cannot return a single count from a streaming Dataset. Instead, use `ds.groupBy().count()` which returns a streaming Dataset containing a running count. 

- `foreach()` - Instead use `ds.writeStream.foreach(...)` (see next section).

- `show()` - Instead use the console sink (see next section).

If you try any of these operations, you will see an `AnalysisException` like "operation XYZ is not supported with streaming DataFrames/Datasets".
While some of them may be supported in future releases of Spark, 
there are others which are fundamentally hard to implement on streaming data efficiently. 
For example, sorting on the input stream is not supported, as it requires keeping 
track of all the data received in the stream. This is therefore fundamentally hard to execute 
efficiently.

## Starting Streaming Queries
Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs)
returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface.

- *Details of the output sink:* Data format, location, etc.

- *Output mode:* Specify what gets written to the output sink.

- *Query name:* Optionally, specify a unique name of the query for identification.

- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has been completed. If a trigger time is missed because the previous processing has not been completed, then the system will trigger processing immediately.

- *Checkpoint location:* For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section.

#### Output Modes
There are a few types of output modes.

- **Append mode (default)** - This is the default mode, where only the 
new rows added to the Result Table since the last trigger will be 
outputted to the sink. This is supported for only those queries where 
rows added to the Result Table is never going to change. Hence, this mode 
guarantees that each row will be output only once (assuming 
fault-tolerant sink). For example, queries with only `select`, 
`where`, `map`, `flatMap`, `filter`, `join`, etc. will support Append mode.

- **Complete mode** - The whole Result Table will be outputted to the sink after every trigger.
 This is supported for aggregation queries.

- **Update mode** - (*Available since Spark 2.1.1*) Only the rows in the Result Table that were 
updated since the last trigger will be outputted to the sink. 
More information to be added in future releases.

Different types of streaming queries support different output modes.
Here is the compatibility matrix.

<table class="table">
  <tr>
    <th>Query Type</th>
    <th></th>
    <th>Supported Output Modes</th>
    <th>Notes</th>        
  </tr>
  <tr>
    <td rowspan="2" style="vertical-align: middle;">Queries with aggregation</td>
    <td style="vertical-align: middle;">Aggregation on event-time with watermark</td>
    <td style="vertical-align: middle;">Append, Update, Complete</td>
    <td>
        Append mode uses watermark to drop old aggregation state. But the output of a 
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are 
        finalized (i.e. after watermark is crossed). See the
        <a href="#handling-late-data-and-watermarking">Late Data</a> section for more details.
        <br/><br/>
        Update mode uses watermark to drop old aggregation state.
        <br/><br/>
        Complete mode does not drop old aggregation state since by definition this mode
        preserves all data in the Result Table.
    </td>    
  </tr>
  <tr>
    <td style="vertical-align: middle;">Other aggregations</td>
    <td style="vertical-align: middle;">Complete, Update</td>
    <td>
        Since no watermark is defined (only defined in other category), 
        old aggregation state is not dropped.
        <br/><br/>
        Append mode is not supported as aggregates can update thus violating the semantics of 
        this mode.
    </td>  
  </tr>
  <tr>
    <td colspan="2" style="vertical-align: middle;">Queries with <code>mapGroupsWithState</code></td>
    <td style="vertical-align: middle;">Update</td>
    <td style="vertical-align: middle;"></td>
  </tr>
  <tr>
    <td rowspan="2" style="vertical-align: middle;">Queries with <code>flatMapGroupsWithState</code></td>
    <td style="vertical-align: middle;">Append operation mode</td>
    <td style="vertical-align: middle;">Append</td>
    <td style="vertical-align: middle;">
      Aggregations are allowed after <code>flatMapGroupsWithState</code>.
    </td>
  </tr>
  <tr>
    <td style="vertical-align: middle;">Update operation mode</td>
    <td style="vertical-align: middle;">Update</td>
    <td style="vertical-align: middle;">
      Aggregations not allowed after <code>flatMapGroupsWithState</code>.
    </td>
  </tr>
  <tr>
      <td colspan="2" style="vertical-align: middle;">Queries with <code>joins</code></td>
      <td style="vertical-align: middle;">Append</td>
      <td style="vertical-align: middle;">
        Update and Complete mode not supported yet. See the
        <a href="#support-matrix-for-joins-in-streaming-queries">support matrix in the Join Operations section</a>
         for more details on what types of joins are supported.
      </td>
    </tr>
  <tr>
    <td colspan="2" style="vertical-align: middle;">Other queries</td>
    <td style="vertical-align: middle;">Append, Update</td>
    <td style="vertical-align: middle;">
      Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.
    </td>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>


#### Output Sinks
There are a few types of built-in output sinks.

- **File sink** - Stores the output to a directory.

{% highlight scala %}
writeStream
    .format("parquet")        // can be "orc", "json", "csv", etc.
    .option("path", "path/to/destination/dir")
    .start()
{% endhighlight %}

- **Kafka sink** - Stores the output to one or more topics in Kafka.

{% highlight scala %}
writeStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
    .option("topic", "updates")
    .start()
{% endhighlight %}

- **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.

{% highlight scala %}
writeStream
    .foreach(...)
    .start()
{% endhighlight %}

- **Console sink (for debugging)** - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.

{% highlight scala %}
writeStream
    .format("console")
    .start()
{% endhighlight %}

- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.
Both, Append and Complete output modes, are supported. This should be used for debugging purposes
on low data volumes as the entire output is collected and stored in the driver's memory.
Hence, use it with caution.

{% highlight scala %}
writeStream
    .format("memory")
    .queryName("tableName")
    .start()
{% endhighlight %}

Some sinks are not fault-tolerant because they do not guarantee persistence of the output and are 
meant for debugging purposes only. See the earlier section on 
[fault-tolerance semantics](#fault-tolerance-semantics). 
Here are the details of all the sinks in Spark.

<table class="table">
  <tr>
    <th>Sink</th>
    <th>Supported Output Modes</th>
    <th>Options</th>
    <th>Fault-tolerant</th>
    <th>Notes</th>
  </tr>
  <tr>
    <td><b>File Sink</b></td>
    <td>Append</td>
    <td>
        <code>path</code>: path to the output directory, must be specified.
        <br/><br/>
        For file-format-specific options, see the related methods in DataFrameWriter
        (<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
        href="api/R/write.stream.html">R</a>).
        E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
    </td>
    <td>Yes (exactly-once)</td>
    <td>Supports writes to partitioned tables. Partitioning by time may be useful.</td>
  </tr>
  <tr>
    <td><b>Kafka Sink</b></td>
    <td>Append, Update, Complete</td>
    <td>See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a></td>
    <td>Yes (at-least-once)</td>
    <td>More details in the <a href="structured-streaming-kafka-integration.html">Kafka Integration Guide</a></td>
  </tr>
  <tr>
    <td><b>Foreach Sink</b></td>
    <td>Append, Update, Complete</td>
    <td>None</td>
    <td>Depends on ForeachWriter implementation</td>
    <td>More details in the <a href="#using-foreach-and-foreachbatch">next section</a></td>
  </tr>
  <tr>
      <td><b>ForeachBatch Sink</b></td>
      <td>Append, Update, Complete</td>
      <td>None</td>
      <td>Depends on the implementation</td>
      <td>More details in the <a href="#using-foreach-and-foreachbatch">next section</a></td>
    </tr>
    
  <tr>
    <td><b>Console Sink</b></td>
    <td>Append, Update, Complete</td>
    <td>
        <code>numRows</code>: Number of rows to print every trigger (default: 20)
        <br/>
        <code>truncate</code>: Whether to truncate the output if too long (default: true)
    </td>
    <td>No</td>
    <td></td>
  </tr>
  <tr>
    <td><b>Memory Sink</b></td>
    <td>Append, Complete</td>
    <td>None</td>
    <td>No. But in Complete Mode, restarted query will recreate the full table.</td>
    <td>Table name is the query name.</td>
  </tr>
  <tr>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
    <td></td>
  </tr>
</table>

Note that you have to call `start()` to actually start the execution of the query. This returns a StreamingQuery object which is a handle to the continuously running execution. You can use this object to manage the query, which we will discuss in the next subsection. For now, let’s understand all this with a few examples.


<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
// ========== DF with no aggregations ==========
val noAggDF = deviceDataDf.select("device").where("signal > 10")   

// Print new data to console
noAggDF
  .writeStream
  .format("console")
  .start()

// Write new data to Parquet files
noAggDF
  .writeStream
  .format("parquet")
  .option("checkpointLocation", "path/to/checkpoint/dir")
  .option("path", "path/to/destination/dir")
  .start()

// ========== DF with aggregation ==========
val aggDF = df.groupBy("device").count()

// Print updated aggregations to console
aggDF
  .writeStream
  .outputMode("complete")
  .format("console")
  .start()

// Have all the aggregates in an in-memory table
aggDF
  .writeStream
  .queryName("aggregates")    // this query name will be the table name
  .outputMode("complete")
  .format("memory")
  .start()

spark.sql("select * from aggregates").show()   // interactively query in-memory table
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
// ========== DF with no aggregations ==========
Dataset<Row> noAggDF = deviceDataDf.select("device").where("signal > 10");

// Print new data to console
noAggDF
  .writeStream()
  .format("console")
  .start();

// Write new data to Parquet files
noAggDF
  .writeStream()
  .format("parquet")
  .option("checkpointLocation", "path/to/checkpoint/dir")
  .option("path", "path/to/destination/dir")
  .start();

// ========== DF with aggregation ==========
Dataset<Row> aggDF = df.groupBy("device").count();

// Print updated aggregations to console
aggDF
  .writeStream()
  .outputMode("complete")
  .format("console")
  .start();

// Have all the aggregates in an in-memory table
aggDF
  .writeStream()
  .queryName("aggregates")    // this query name will be the table name
  .outputMode("complete")
  .format("memory")
  .start();

spark.sql("select * from aggregates").show();   // interactively query in-memory table
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
# ========== DF with no aggregations ==========
noAggDF = deviceDataDf.select("device").where("signal > 10")   

# Print new data to console
noAggDF \
    .writeStream \
    .format("console") \
    .start()

# Write new data to Parquet files
noAggDF \
    .writeStream \
    .format("parquet") \
    .option("checkpointLocation", "path/to/checkpoint/dir") \
    .option("path", "path/to/destination/dir") \
    .start()

# ========== DF with aggregation ==========
aggDF = df.groupBy("device").count()

# Print updated aggregations to console
aggDF \
    .writeStream \
    .outputMode("complete") \
    .format("console") \
    .start()

# Have all the aggregates in an in-memory table. The query name will be the table name
aggDF \
    .writeStream \
    .queryName("aggregates") \
    .outputMode("complete") \
    .format("memory") \
    .start()

spark.sql("select * from aggregates").show()   # interactively query in-memory table
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
# ========== DF with no aggregations ==========
noAggDF <- select(where(deviceDataDf, "signal > 10"), "device")

# Print new data to console
write.stream(noAggDF, "console")

# Write new data to Parquet files
write.stream(noAggDF,
             "parquet",
             path = "path/to/destination/dir",
             checkpointLocation = "path/to/checkpoint/dir")

# ========== DF with aggregation ==========
aggDF <- count(groupBy(df, "device"))

# Print updated aggregations to console
write.stream(aggDF, "console", outputMode = "complete")

# Have all the aggregates in an in memory table. The query name will be the table name
write.stream(aggDF, "memory", queryName = "aggregates", outputMode = "complete")

# Interactively query in-memory table
head(sql("select * from aggregates"))
{% endhighlight %}

</div>
</div>

##### Using Foreach and ForeachBatch
The `foreach` and `foreachBatch` operations allow you to apply arbitrary operations and writing 
logic on the output of a streaming query. They have slightly different use cases - while `foreach` 
allows custom write logic on every row, `foreachBatch` allows arbitrary operations 
and custom logic on the output of each micro-batch. Let's understand their usages in more detail.  

###### ForeachBatch
`foreachBatch(...)` allows you to specify a function that is executed on 
the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. 
It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
  // Transform and write batchDF 
}.start()
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
streamingDatasetOfString.writeStream().foreachBatch(
  new VoidFunction2<Dataset<String>, Long> {
    public void call(Dataset<String> dataset, Long batchId) {
      // Transform and write batchDF
    }    
  }
).start();
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
def foreach_batch_function(df, epoch_id):
    # Transform and write batchDF
    pass
  
streamingDF.writeStream.foreachBatch(foreach_batch_function).start()   
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">
R is not yet supported.
</div>
</div>

With `foreachBatch`, you can do the following.

- **Reuse existing batch data sources** - For many storage systems, there may not be a streaming sink available yet, 
  but there may already exist a data writer for batch queries. Using `foreachBatch`, you can use the batch
  data writers on the output of each micro-batch.
- **Write to multiple locations** - If you want to write the output of a streaming query to multiple locations, 
  then you can simply write the output DataFrame/Dataset multiple times. However, each attempt to write can 
  cause the output data to be recomputed (including possible re-reading of the input data). To avoid recomputations,
  you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.  

    streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
      batchDF.persist()
      batchDF.write.format(...).save(...)  // location 1
      batchDF.write.format(...).save(...)  // location 2
      batchDF.unpersist()
    }

- **Apply additional DataFrame operations** - Many DataFrame and Dataset operations are not supported 
  in streaming DataFrames because Spark does not support generating incremental plans in those cases. 
  Using `foreachBatch`, you can apply some of these operations on each micro-batch output. However, you will have to reason about the end-to-end semantics of doing that operation yourself.

**Note:**
- By default, `foreachBatch` provides only at-least-once write guarantees. However, you can use the 
  batchId provided to the function as way to deduplicate the output and get an exactly-once guarantee.  
- `foreachBatch` does not work with the continuous processing mode as it fundamentally relies on the
  micro-batch execution of a streaming query. If you write data in the continuous mode, use `foreach` instead.


###### Foreach
If `foreachBatch` is not an option (for example, corresponding batch data writer does not exist, or 
continuous processing mode), then you can express your custom writer logic using `foreach`. 
Specifically, you can express the data writing logic by dividing it into three methods: `open`, `process`, and `close`.
Since Spark 2.4, `foreach` is available in Scala, Java and Python. 

<div class="codetabs">
<div data-lang="scala"  markdown="1">

In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/index.html#org.apache.spark.sql.ForeachWriter)).

{% highlight scala %}
streamingDatasetOfString.writeStream.foreach(
  new ForeachWriter[String] {

    def open(partitionId: Long, version: Long): Boolean = {
      // Open connection
    }

    def process(record: String): Unit = {
      // Write string to connection
    }

    def close(errorOrNull: Throwable): Unit = {
      // Close the connection
    }
  }
).start()
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

In Java, you have to extend the class `ForeachWriter` ([docs](api/java/org/apache/spark/sql/ForeachWriter.html)).
{% highlight java %}
streamingDatasetOfString.writeStream().foreach(
  new ForeachWriter[String] {

    @Override public boolean open(long partitionId, long version) {
      // Open connection
    }

    @Override public void process(String record) {
      // Write string to connection
    }

    @Override public void close(Throwable errorOrNull) {
      // Close the connection
    }
  }
).start();

{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

In Python, you can invoke foreach in two ways: in a function or in an object. 
The function offers a simple way to express your processing logic but does not allow you to 
deduplicate generated data when failures cause reprocessing of some input data. 
For that situation you must specify the processing logic in an object.

1. The function takes a row as input.

  {% highlight python %}
      def process_row(row):
          # Write row to storage
          pass
      
      query = streamingDF.writeStream.foreach(process_row).start()  
  {% endhighlight %}

2. The object has a process method and optional open and close methods: 

  {% highlight python %}
      class ForeachWriter:
          def open(self, partition_id, epoch_id):
              # Open connection. This method is optional in Python.
              pass
      
          def process(self, row):
              # Write row to connection. This method is NOT optional in Python.
              pass
      
          def close(self, error):
              # Close the connection. This method in optional in Python.
              pass
      
      query = streamingDF.writeStream.foreach(ForeachWriter()).start()
  {% endhighlight %}

</div>
<div data-lang="r"  markdown="1">
R is not yet supported.
</div>
</div>


**Execution semantics**
When the streaming query is started, Spark calls the function or the object’s methods in the following way:

- A single copy of this object is responsible for all the data generated by a single task in a query. 
  In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.

- This object must be serializable, because each task will get a fresh serialized-deserialized copy 
  of the provided object. Hence, it is strongly recommended that any initialization for writing data 
  (for example. opening a connection or starting a transaction) is done after the open() method has 
  been called, which signifies that the task is ready to generate data.

- The lifecycle of the methods are as follows:

  - For each partition with partition_id:

    - For each batch/epoch of streaming data with epoch_id:

      - Method open(partitionId, epochId) is called.

      - If open(...) returns true, for each row in the partition and batch/epoch, method process(row) is called.

      - Method close(error) is called with error (if any) seen while processing rows.

- The close() method (if it exists) is called if an open() method exists and returns successfully (irrespective of the return value), except if the JVM or Python process crashes in the middle.

- **Note:** The partitionId and epochId in the open() method can be used to deduplicate generated data 
  when failures cause reprocessing of some input data. This depends on the execution mode of the query. 
  If the streaming query is being executed in the micro-batch mode, then every partition represented 
  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same data. 
  Hence, (partition_id, epoch_id) can be used to deduplicate and/or transactionally commit 
  data and achieve exactly-once guarantees. However, if the streaming query is being executed 
  in the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication.

#### Triggers
The trigger settings of a streaming query define the timing of streaming data processing, whether
the query is going to be executed as micro-batch query with a fixed batch interval or as a continuous processing query.
Here are the different kinds of triggers that are supported.

<table class="table">
  <tr>
    <th>Trigger Type</th>
    <th>Description</th>
  </tr>
  <tr>
    <td><i>unspecified (default)</i></td>
    <td>
        If no trigger setting is explicitly specified, then by default, the query will be
        executed in micro-batch mode, where micro-batches will be generated as soon as
        the previous micro-batch has completed processing.
    </td>
  </tr>
  <tr>
    <td><b>Fixed interval micro-batches</b></td>
    <td>
        The query will be executed with micro-batches mode, where micro-batches will be kicked off
        at the user-specified intervals.
        <ul>
          <li>If the previous micro-batch completes within the interval, then the engine will wait until
          the interval is over before kicking off the next micro-batch.</li>

          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
          interval boundary is missed), then the next micro-batch will start as soon as the
          previous one completes (i.e., it will not wait for the next interval boundary).</li>

          <li>If no new data is available, then no micro-batch will be kicked off.</li>
        </ul>
    </td>
  </tr>
  <tr>
    <td><b>One-time micro-batch</b></td>
    <td>
        The query will execute *only one* micro-batch to process all the available data and then
        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
        process everything that is available since the last period, and then shutdown the
        cluster. In some case, this may lead to significant cost savings.
    </td>
  </tr>
  <tr>
    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
    <td>
        The query will be executed in the new low-latency, continuous processing mode. Read more
        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
    </td>
  </tr>
</table>

Here are a few code examples.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
import org.apache.spark.sql.streaming.Trigger

// Default trigger (runs micro-batch as soon as it can)
df.writeStream
  .format("console")
  .start()

// ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("2 seconds"))
  .start()

// One-time trigger
df.writeStream
  .format("console")
  .trigger(Trigger.Once())
  .start()

// Continuous trigger with one-second checkpointing interval
df.writeStream
  .format("console")
  .trigger(Trigger.Continuous("1 second"))
  .start()

{% endhighlight %}


</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
import org.apache.spark.sql.streaming.Trigger

// Default trigger (runs micro-batch as soon as it can)
df.writeStream
  .format("console")
  .start();

// ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream
  .format("console")
  .trigger(Trigger.ProcessingTime("2 seconds"))
  .start();

// One-time trigger
df.writeStream
  .format("console")
  .trigger(Trigger.Once())
  .start();

// Continuous trigger with one-second checkpointing interval
df.writeStream
  .format("console")
  .trigger(Trigger.Continuous("1 second"))
  .start();

{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}

# Default trigger (runs micro-batch as soon as it can)
df.writeStream \
  .format("console") \
  .start()

# ProcessingTime trigger with two-seconds micro-batch interval
df.writeStream \
  .format("console") \
  .trigger(processingTime='2 seconds') \
  .start()

# One-time trigger
df.writeStream \
  .format("console") \
  .trigger(once=True) \
  .start()

# Continuous trigger with one-second checkpointing interval
df.writeStream
  .format("console")
  .trigger(continuous='1 second')
  .start()

{% endhighlight %}
</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
# Default trigger (runs micro-batch as soon as it can)
write.stream(df, "console")

# ProcessingTime trigger with two-seconds micro-batch interval
write.stream(df, "console", trigger.processingTime = "2 seconds")

# One-time trigger
write.stream(df, "console", trigger.once = TRUE)

# Continuous trigger is not yet supported
{% endhighlight %}
</div>
</div>


## Managing Streaming Queries
The `StreamingQuery` object created when a query is started can be used to monitor and manage the query. 

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val query = df.writeStream.format("console").start()   // get the query object

query.id          // get the unique identifier of the running query that persists across restarts from checkpoint data

query.runId       // get the unique id of this run of the query, which will be generated at every start/restart

query.name        // get the name of the auto-generated or user-specified name

query.explain()   // print detailed explanations of the query

query.stop()      // stop the query

query.awaitTermination()   // block until query is terminated, with stop() or with error

query.exception       // the exception if the query has been terminated with error

query.recentProgress  // an array of the most recent progress updates for this query

query.lastProgress    // the most recent progress update of this streaming query
{% endhighlight %}


</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
StreamingQuery query = df.writeStream().format("console").start();   // get the query object

query.id();          // get the unique identifier of the running query that persists across restarts from checkpoint data

query.runId();       // get the unique id of this run of the query, which will be generated at every start/restart

query.name();        // get the name of the auto-generated or user-specified name

query.explain();   // print detailed explanations of the query

query.stop();      // stop the query

query.awaitTermination();   // block until query is terminated, with stop() or with error

query.exception();       // the exception if the query has been terminated with error

query.recentProgress();  // an array of the most recent progress updates for this query

query.lastProgress();    // the most recent progress update of this streaming query

{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
query = df.writeStream.format("console").start()   # get the query object

query.id()          # get the unique identifier of the running query that persists across restarts from checkpoint data

query.runId()       # get the unique id of this run of the query, which will be generated at every start/restart

query.name()        # get the name of the auto-generated or user-specified name

query.explain()   # print detailed explanations of the query

query.stop()      # stop the query

query.awaitTermination()   # block until query is terminated, with stop() or with error

query.exception()       # the exception if the query has been terminated with error

query.recentProgress()  # an array of the most recent progress updates for this query

query.lastProgress()    # the most recent progress update of this streaming query

{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
query <- write.stream(df, "console")  # get the query object

queryName(query)          # get the name of the auto-generated or user-specified name

explain(query)            # print detailed explanations of the query

stopQuery(query)          # stop the query

awaitTermination(query)   # block until query is terminated, with stop() or with error

lastProgress(query)       # the most recent progress update of this streaming query

{% endhighlight %}

</div>
</div>

You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources. You can use `sparkSession.streams()` to get the `StreamingQueryManager`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryManager)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager) docs)
that can be used to manage the currently active queries.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val spark: SparkSession = ...

spark.streams.active    // get the list of currently active streaming queries

spark.streams.get(id)   // get a query object by its unique id

spark.streams.awaitAnyTermination()   // block until any one of them terminates
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
SparkSession spark = ...

spark.streams().active();    // get the list of currently active streaming queries

spark.streams().get(id);   // get a query object by its unique id

spark.streams().awaitAnyTermination();   // block until any one of them terminates
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
spark = ...  # spark session

spark.streams().active  # get the list of currently active streaming queries

spark.streams().get(id)  # get a query object by its unique id

spark.streams().awaitAnyTermination()  # block until any one of them terminates
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">
{% highlight bash %}
Not available in R.
{% endhighlight %}

</div>
</div>


## Monitoring Streaming Queries
There are multiple ways to monitor active streaming queries. You can either push metrics to external systems using Spark's Dropwizard Metrics support, or access them programmatically.

### Reading Metrics Interactively

You can directly get the current status and metrics of an active query using 
`streamingQuery.lastProgress()` and `streamingQuery.status()`. 
`lastProgress()` returns a `StreamingQueryProgress` object 
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryProgress) 
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
and a dictionary with the same fields in Python. It has all the information about
the progress made in the last trigger of the stream - what data was processed, 
what were the processing rates, latencies, etc. There is also 
`streamingQuery.recentProgress` which returns an array of last few progresses.  

In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object 
in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus) 
and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
and a dictionary with the same fields in Python. It gives information about
what the query is immediately doing - is a trigger active, is data being processed, etc.

Here are a few examples.

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val query: StreamingQuery = ...

println(query.lastProgress)

/* Will print something like the following.

{
  "id" : "ce011fdc-8762-4dcb-84eb-a77333e28109",
  "runId" : "88e2ff94-ede0-45a8-b687-6316fbef529a",
  "name" : "MyQuery",
  "timestamp" : "2016-12-14T18:45:24.873Z",
  "numInputRows" : 10,
  "inputRowsPerSecond" : 120.0,
  "processedRowsPerSecond" : 200.0,
  "durationMs" : {
    "triggerExecution" : 3,
    "getOffset" : 2
  },
  "eventTime" : {
    "watermark" : "2016-12-14T18:45:24.873Z"
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[topic-0]]",
    "startOffset" : {
      "topic-0" : {
        "2" : 0,
        "4" : 1,
        "1" : 1,
        "3" : 1,
        "0" : 1
      }
    },
    "endOffset" : {
      "topic-0" : {
        "2" : 0,
        "4" : 115,
        "1" : 134,
        "3" : 21,
        "0" : 534
      }
    },
    "numInputRows" : 10,
    "inputRowsPerSecond" : 120.0,
    "processedRowsPerSecond" : 200.0
  } ],
  "sink" : {
    "description" : "MemorySink"
  }
}
*/


println(query.status)

/*  Will print something like the following.
{
  "message" : "Waiting for data to arrive",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}
*/
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
StreamingQuery query = ...

System.out.println(query.lastProgress());
/* Will print something like the following.

{
  "id" : "ce011fdc-8762-4dcb-84eb-a77333e28109",
  "runId" : "88e2ff94-ede0-45a8-b687-6316fbef529a",
  "name" : "MyQuery",
  "timestamp" : "2016-12-14T18:45:24.873Z",
  "numInputRows" : 10,
  "inputRowsPerSecond" : 120.0,
  "processedRowsPerSecond" : 200.0,
  "durationMs" : {
    "triggerExecution" : 3,
    "getOffset" : 2
  },
  "eventTime" : {
    "watermark" : "2016-12-14T18:45:24.873Z"
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaSource[Subscribe[topic-0]]",
    "startOffset" : {
      "topic-0" : {
        "2" : 0,
        "4" : 1,
        "1" : 1,
        "3" : 1,
        "0" : 1
      }
    },
    "endOffset" : {
      "topic-0" : {
        "2" : 0,
        "4" : 115,
        "1" : 134,
        "3" : 21,
        "0" : 534
      }
    },
    "numInputRows" : 10,
    "inputRowsPerSecond" : 120.0,
    "processedRowsPerSecond" : 200.0
  } ],
  "sink" : {
    "description" : "MemorySink"
  }
}
*/


System.out.println(query.status());
/*  Will print something like the following.
{
  "message" : "Waiting for data to arrive",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}
*/
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
query = ...  # a StreamingQuery
print(query.lastProgress)

'''
Will print something like the following.

{u'stateOperators': [], u'eventTime': {u'watermark': u'2016-12-14T18:45:24.873Z'}, u'name': u'MyQuery', u'timestamp': u'2016-12-14T18:45:24.873Z', u'processedRowsPerSecond': 200.0, u'inputRowsPerSecond': 120.0, u'numInputRows': 10, u'sources': [{u'description': u'KafkaSource[Subscribe[topic-0]]', u'endOffset': {u'topic-0': {u'1': 134, u'0': 534, u'3': 21, u'2': 0, u'4': 115}}, u'processedRowsPerSecond': 200.0, u'inputRowsPerSecond': 120.0, u'numInputRows': 10, u'startOffset': {u'topic-0': {u'1': 1, u'0': 1, u'3': 1, u'2': 0, u'4': 1}}}], u'durationMs': {u'getOffset': 2, u'triggerExecution': 3}, u'runId': u'88e2ff94-ede0-45a8-b687-6316fbef529a', u'id': u'ce011fdc-8762-4dcb-84eb-a77333e28109', u'sink': {u'description': u'MemorySink'}}
'''

print(query.status)
''' 
Will print something like the following.

{u'message': u'Waiting for data to arrive', u'isTriggerActive': False, u'isDataAvailable': False}
'''
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
query <- ...  # a StreamingQuery
lastProgress(query)

'''
Will print something like the following.

{
  "id" : "8c57e1ec-94b5-4c99-b100-f694162df0b9",
  "runId" : "ae505c5a-a64e-4896-8c28-c7cbaf926f16",
  "name" : null,
  "timestamp" : "2017-04-26T08:27:28.835Z",
  "numInputRows" : 0,
  "inputRowsPerSecond" : 0.0,
  "processedRowsPerSecond" : 0.0,
  "durationMs" : {
    "getOffset" : 0,
    "triggerExecution" : 1
  },
  "stateOperators" : [ {
    "numRowsTotal" : 4,
    "numRowsUpdated" : 0
  } ],
  "sources" : [ {
    "description" : "TextSocketSource[host: localhost, port: 9999]",
    "startOffset" : 1,
    "endOffset" : 1,
    "numInputRows" : 0,
    "inputRowsPerSecond" : 0.0,
    "processedRowsPerSecond" : 0.0
  } ],
  "sink" : {
    "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@76b37531"
  }
}
'''

status(query)
'''
Will print something like the following.

{
  "message" : "Waiting for data to arrive",
  "isDataAvailable" : false,
  "isTriggerActive" : false
}
'''
{% endhighlight %}

</div>
</div>

### Reporting Metrics programmatically using Asynchronous APIs

You can also asynchronously monitor all queries associated with a
`SparkSession` by attaching a `StreamingQueryListener`
([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryListener)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html) docs).
Once you attach your custom `StreamingQueryListener` object with
`sparkSession.streams.attachListener()`, you will get callbacks when a query is started and
stopped and when there is progress made in an active query. Here is an example,

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
val spark: SparkSession = ...

spark.streams.addListener(new StreamingQueryListener() {
    override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
        println("Query started: " + queryStarted.id)
    }
    override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
        println("Query terminated: " + queryTerminated.id)
    }
    override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
        println("Query made progress: " + queryProgress.progress)
    }
})
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
SparkSession spark = ...

spark.streams().addListener(new StreamingQueryListener() {
    @Override
    public void onQueryStarted(QueryStartedEvent queryStarted) {
        System.out.println("Query started: " + queryStarted.id());
    }
    @Override
    public void onQueryTerminated(QueryTerminatedEvent queryTerminated) {
        System.out.println("Query terminated: " + queryTerminated.id());
    }
    @Override
    public void onQueryProgress(QueryProgressEvent queryProgress) {
        System.out.println("Query made progress: " + queryProgress.progress());
    }
});
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">
{% highlight bash %}
Not available in Python.
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">
{% highlight bash %}
Not available in R.
{% endhighlight %}

</div>
</div>

### Reporting Metrics using Dropwizard 
Spark supports reporting metrics using the [Dropwizard Library](monitoring.html#metrics). To enable metrics of Structured Streaming queries to be reported as well, you have to explicitly enable the configuration `spark.sql.streaming.metricsEnabled` in the SparkSession. 

<div class="codetabs">
<div data-lang="scala"  markdown="1">
{% highlight scala %}
spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
// or
spark.sql("SET spark.sql.streaming.metricsEnabled=true")
{% endhighlight %}
</div>
<div data-lang="java"  markdown="1">  
{% highlight java %}
spark.conf().set("spark.sql.streaming.metricsEnabled", "true");
// or
spark.sql("SET spark.sql.streaming.metricsEnabled=true");
{% endhighlight %}
</div>
<div data-lang="python"  markdown="1">  
{% highlight python %}
spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
# or
spark.sql("SET spark.sql.streaming.metricsEnabled=true")
{% endhighlight %}
</div>
<div data-lang="r"  markdown="1">
{% highlight r %}
sql("SET spark.sql.streaming.metricsEnabled=true")
{% endhighlight %}
</div>
</div>


All queries started in the SparkSession after this configuration has been enabled will report metrics through Dropwizard to whatever [sinks](monitoring.html#metrics) have been configured (e.g. Ganglia, Graphite, JMX, etc.).

## Recovering from Failures with Checkpointing 
In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the [quick example](#quick-example)) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when [starting a query](#starting-streaming-queries).

<div class="codetabs">
<div data-lang="scala"  markdown="1">

{% highlight scala %}
aggDF
  .writeStream
  .outputMode("complete")
  .option("checkpointLocation", "path/to/HDFS/dir")
  .format("memory")
  .start()
{% endhighlight %}

</div>
<div data-lang="java"  markdown="1">

{% highlight java %}
aggDF
  .writeStream()
  .outputMode("complete")
  .option("checkpointLocation", "path/to/HDFS/dir")
  .format("memory")
  .start();
{% endhighlight %}

</div>
<div data-lang="python"  markdown="1">

{% highlight python %}
aggDF \
    .writeStream \
    .outputMode("complete") \
    .option("checkpointLocation", "path/to/HDFS/dir") \
    .format("memory") \
    .start()
{% endhighlight %}

</div>
<div data-lang="r"  markdown="1">

{% highlight r %}
write.stream(aggDF, "memory", outputMode = "complete", checkpointLocation = "path/to/HDFS/dir")
{% endhighlight %}

</div>
</div>


## Recovery Semantics after Changes in a Streaming Query
There are limitations on what changes in a streaming query are allowed between restarts from the 
same checkpoint location. Here are a few kinds of changes that are either not allowed, or 
the effect of the change is not well-defined. For all of them:

- The term *allowed* means you can do the specified change but whether the semantics of its effect 
  is well-defined depends on the query and the change.

- The term *not allowed* means you should not do the specified change as the restarted query is likely 
  to fail with unpredictable errors. `sdf` represents a streaming DataFrame/Dataset 
  generated with sparkSession.readStream.
  
**Types of changes**

- *Changes in the number or type (i.e. different source) of input sources*: This is not allowed.

- *Changes in the parameters of input sources*: Whether this is allowed and whether the semantics 
  of the change are well-defined depends on the source and the query. Here are a few examples.

  - Addition/deletion/modification of rate limits is allowed: `spark.readStream.format("kafka").option("subscribe", "topic")` to `spark.readStream.format("kafka").option("subscribe", "topic").option("maxOffsetsPerTrigger", ...)`

  - Changes to subscribed topics/files are generally not allowed as the results are unpredictable: `spark.readStream.format("kafka").option("subscribe", "topic")` to `spark.readStream.format("kafka").option("subscribe", "newTopic")`

- *Changes in the type of output sink*: Changes between a few specific combinations of sinks 
  are allowed. This needs to be verified on a case-by-case basis. Here are a few examples.

  - File sink to Kafka sink is allowed. Kafka will see only the new data.

  - Kafka sink to file sink is not allowed.

  - Kafka sink changed to foreach, or vice versa is allowed.

- *Changes in the parameters of output sink*: Whether this is allowed and whether the semantics of 
  the change are well-defined depends on the sink and the query. Here are a few examples.

  - Changes to output directory of a file sink are not allowed: `sdf.writeStream.format("parquet").option("path", "/somePath")` to `sdf.writeStream.format("parquet").option("path", "/anotherPath")`

  - Changes to output topic are allowed: `sdf.writeStream.format("kafka").option("topic", "someTopic")` to `sdf.writeStream.format("kafka").option("topic", "anotherTopic")`

  - Changes to the user-defined foreach sink (that is, the `ForeachWriter` code) are allowed, but the semantics of the change depends on the code.

- *Changes in projection / filter / map-like operations**: Some cases are allowed. For example:

  - Addition / deletion of filters is allowed: `sdf.selectExpr("a")` to `sdf.where(...).selectExpr("a").filter(...)`.

  - Changes in projections with same output schema are allowed: `sdf.selectExpr("stringColumn AS json").writeStream` to `sdf.selectExpr("anotherStringColumn AS json").writeStream`

  - Changes in projections with different output schema are conditionally allowed: `sdf.selectExpr("a").writeStream` to `sdf.selectExpr("b").writeStream` is allowed only if the output sink allows the schema change from `"a"` to `"b"`.

- *Changes in stateful operations*: Some operations in streaming queries need to maintain
  state data in order to continuously update the result. Structured Streaming automatically checkpoints
  the state data to fault-tolerant storage (for example, HDFS, AWS S3, Azure Blob storage) and restores it after restart.
  However, this assumes that the schema of the state data remains same across restarts. This means that
  *any changes (that is, additions, deletions, or schema modifications) to the stateful operations of a streaming query are not allowed between restarts*.
  Here is the list of stateful operations whose schema should not be changed between restarts in order to ensure state recovery:

  - *Streaming aggregation*: For example, `sdf.groupBy("a").agg(...)`. Any change in number or type of grouping keys or aggregates is not allowed.

  - *Streaming deduplication*: For example, `sdf.dropDuplicates("a")`. Any change in number or type of grouping keys or aggregates is not allowed.

  - *Stream-stream join*: For example, `sdf1.join(sdf2, ...)` (i.e. both inputs are generated with `sparkSession.readStream`). Changes
    in the schema or equi-joining columns are not allowed. Changes in join type (outer or inner) are not allowed. Other changes in the join condition are ill-defined.

  - *Arbitrary stateful operation*: For example, `sdf.groupByKey(...).mapGroupsWithState(...)` or `sdf.groupByKey(...).flatMapGroupsWithState(...)`.
    Any change to the schema of the user-defined state and the type of timeout is not allowed.
    Any change within the user-defined state-mapping function are allowed, but the semantic effect of the change depends on the user-defined logic.
    If you really want to support state schema changes, then you can explicitly encode/decode your complex state data
    structures into bytes using an encoding/decoding scheme that supports schema migration. For example,
    if you save your state as Avro-encoded bytes, then you are free to change the Avro-state-schema between query
    restarts as the binary state will always be restored successfully.

# Continuous Processing
## [Experimental]
{:.no_toc}

**Continuous processing** is a new, experimental streaming execution mode introduced in Spark 2.3 that enables low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees. Compare this with the default *micro-batch processing* engine which can achieve exactly-once guarantees but achieve latencies of ~100ms at best. For some types of queries (discussed below), you can choose which mode to execute them in without modifying the application logic (i.e. without changing the DataFrame/Dataset operations). 

To run a supported query in continuous processing mode, all you need to do is specify a **continuous trigger** with the desired checkpoint interval as a parameter. For example, 

<div class="codetabs">
<div data-lang="scala"  markdown="1">
{% highlight scala %}
import org.apache.spark.sql.streaming.Trigger

spark
  .readStream
  .format("rate")
  .option("rowsPerSecond", "10")
  .option("")

spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("topic", "topic1")
  .trigger(Trigger.Continuous("1 second"))  // only change in query
  .start()
{% endhighlight %}
</div>
<div data-lang="java"  markdown="1">  
{% highlight java %}
import org.apache.spark.sql.streaming.Trigger;

spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("subscribe", "topic1")
  .load()
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
  .writeStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
  .option("topic", "topic1")
  .trigger(Trigger.Continuous("1 second"))  // only change in query
  .start();
{% endhighlight %}
</div>
<div data-lang="python"  markdown="1">  
{% highlight python %}
spark \
  .readStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("subscribe", "topic1") \
  .load() \
  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
  .writeStream \
  .format("kafka") \
  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
  .option("topic", "topic1") \
  .trigger(continuous="1 second") \     # only change in query
  .start()

{% endhighlight %}
</div>
</div>

A checkpoint interval of 1 second means that the continuous processing engine will record the progress of the query every second. The resulting checkpoints are in a format compatible with the micro-batch engine, hence any query can be restarted with any trigger. For example, a supported query started with the micro-batch mode can be restarted in continuous mode, and vice versa. Note that any time you switch to continuous mode, you will get at-least-once fault-tolerance guarantees.

## Supported Queries
{:.no_toc}

As of Spark 2.3, only the following type of queries are supported in the continuous processing mode.

- *Operations*: Only map-like Dataset/DataFrame operations are supported in continuous mode, that is, only projections (`select`, `map`, `flatMap`, `mapPartitions`, etc.) and selections (`where`, `filter`, etc.).
  + All SQL functions are supported except aggregation functions (since aggregations are not yet supported), `current_timestamp()` and `current_date()` (deterministic computations using time is challenging).

- *Sources*:
  + Kafka source: All options are supported.
  + Rate source: Good for testing. Only options that are supported in the continuous mode are `numPartitions` and `rowsPerSecond`.

- *Sinks*: 
  + Kafka sink: All options are supported.
  + Memory sink: Good for debugging.
  + Console sink: Good for debugging. All options are supported. Note that the console will print every checkpoint interval that you have specified in the continuous trigger. 

See [Input Sources](#input-sources) and [Output Sinks](#output-sinks) sections for more details on them. While the console sink is good for testing, the end-to-end low-latency processing can be best observed with Kafka as the source and sink, as this allows the engine to process the data and make the results available in the output topic within milliseconds of the input data being available in the input topic.

## Caveats
{:.no_toc}

- Continuous processing engine launches multiple long-running tasks that continuously read data from sources, process it and continuously write to sinks. The number of tasks required by the query depends on how many partitions the query can read from the sources in parallel. Therefore, before starting a continuous processing query, you must ensure there are enough cores in the cluster to all the tasks in parallel. For example, if you are reading from a Kafka topic that has 10 partitions, then the cluster must have at least 10 cores for the query to make progress.
- Stopping a continuous processing stream may produce spurious task termination warnings. These can be safely ignored.
- There are currently no automatic retries of failed tasks. Any failure will lead to the query being stopped and it needs to be manually restarted from the checkpoint.

# Additional Information

**Further Reading**

- See and run the
  [Scala]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming)/[Java]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming)/[Python]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming)/[R]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/r/streaming)
  examples.
    - [Instructions](index.html#running-the-examples-and-shell) on how to run Spark examples
- Read about integrating with Kafka in the [Structured Streaming Kafka Integration Guide](structured-streaming-kafka-integration.html)
- Read more details about using DataFrames/Datasets in the [Spark SQL Programming Guide](sql-programming-guide.html)
- Third-party Blog Posts
    - [Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1 (Databricks Blog)](https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html)
    - [Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html)
    - [Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html)

**Talks**

- Spark Summit Europe 2017
  - Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark -
    [Part 1 slides/video](https://databricks.com/session/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-in-apache-spark), [Part 2 slides/video](https://databricks.com/session/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-in-apache-spark-continues)
  - Deep Dive into Stateful Stream Processing in Structured Streaming - [slides/video](https://databricks.com/session/deep-dive-into-stateful-stream-processing-in-structured-streaming)
- Spark Summit 2016
  - A Deep Dive into Structured Streaming - [slides/video](https://spark-summit.org/2016/events/a-deep-dive-into-structured-streaming/)
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								---
 								layout: global
-												[SPARK-20844] Remove experimental from Structured Streaming APIs

Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.

Author: Michael Armbrust <michael@databricks.com>

Closes #18065 from marmbrus/streamingGA.

											
										
										
											2017-05-26 16:33:23 -04:00
+								displayTitle: Structured Streaming Programming Guide
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								title: Structured Streaming Programming Guide
 								---
 								* This will become a table of contents (this text will be scraped).
 								{:toc}
 								# Overview
-												[MINOR][DOC] Fix some typos and grammar issues

## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes #20928 from dsakuma/fix_typo_configuration_docs.

											
										
										
											2018-04-06 01:37:08 -04:00
+								Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data. The Spark SQL engine will take care of running it incrementally and continuously and updating the final result as streaming data continues to arrive. You can use the [Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java, Python or R to express streaming aggregations, event-time windows, stream-to-batch joins, etc. The computation is executed on the same optimized Spark SQL engine. Finally, the system ensures end-to-end exactly-once fault-tolerance guarantees through checkpointing and Write-Ahead Logs. In short, *Structured Streaming provides fast, scalable, fault-tolerant, end-to-end exactly-once stream processing without the user having to reason about streaming.*
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.

## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.

											
										
										
											2018-01-20 17:34:37 -05:00
+								Internally, by default, Structured Streaming queries are processed using a *micro-batch processing* engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees. However, since Spark 2.3, we have introduced a new low-latency processing mode called **Continuous Processing**, which can achieve end-to-end latencies as low as 1 millisecond with at-least-once guarantees. Without changing the Dataset/DataFrame operations in your queries, you will be able to choose the mode based on your application requirements.
-												[SPARK-23142][SS][DOCS] Added docs for continuous processing

## What changes were proposed in this pull request?

Added documentation for continuous processing. Modified two locations.
- Modified the overview to have a mention of Continuous Processing.
- Added a new section on Continuous Processing at the end.

![image](https://user-images.githubusercontent.com/663212/35083551-a3dd23f6-fbd4-11e7-9e7e-90866f131ca9.png)
![image](https://user-images.githubusercontent.com/663212/35083618-d844027c-fbd4-11e7-9fde-75992cc517bd.png)

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20308 from tdas/SPARK-23142.

											
										
										
											2018-01-18 19:29:45 -05:00
 								In this guide, we are going to walk you through the programming model and the APIs. We are going to explain the concepts mostly using the default micro-batch processing model, and then [later](#continuous-processing-experimental) discuss Continuous Processing model. First, let's start with a simple example of a Structured Streaming query - a streaming word count.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# Quick Example
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								Let’s say you want to maintain a running word count of text data received from a data server listening on a TCP socket. Let’s see how you can express this using Structured Streaming. You can see the full code in
 								[Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCount.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java)/[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount.py)/[R]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/r/streaming/structured_network_wordcount.R).
-												[DOCS][MINOR] Fix a few broken links and typos, and, nit, use HTTPS more consistently

## What changes were proposed in this pull request?

Fix a few broken links and typos, and, nit, use HTTPS more consistently esp. on scripts and Apache links

## How was this patch tested?

Doc build

Closes #22172 from srowen/DocTypo.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>

											
										
										
											2018-08-21 13:02:17 -04:00
+								And if you [download Spark](https://spark.apache.org/downloads.html), you can directly [run the example](index.html#running-the-examples-and-shell). In any case, let’s walk through the example step-by-step and understand how it works. First, we have to import the necessary classes and create a local SparkSession, the starting point of all functionalities related to Spark.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								{% highlight scala %}
 								import org.apache.spark.sql.functions._
 								import org.apache.spark.sql.SparkSession
 								val spark = SparkSession
 								  .builder
 								  .appName("StructuredNetworkWordCount")
 								  .getOrCreate()
 								import spark.implicits._
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								</div>
 								<div data-lang="java"  markdown="1">
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								{% highlight java %}
 								import org.apache.spark.api.java.function.FlatMapFunction;
 								import org.apache.spark.sql.*;
 								import org.apache.spark.sql.streaming.StreamingQuery;
 								import java.util.Arrays;
 								import java.util.Iterator;
 								SparkSession spark = SparkSession
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .builder()
 								  .appName("JavaStructuredNetworkWordCount")
 								  .getOrCreate();
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								</div>
 								<div data-lang="python"  markdown="1">
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								{% highlight python %}
 								from pyspark.sql import SparkSession
 								from pyspark.sql.functions import explode
 								from pyspark.sql.functions import split
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								spark = SparkSession \
-												[SPARK-18426][STRUCTURED STREAMING] Python Documentation Fix for Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Update the python section of the Structured Streaming Guide from .builder() to .builder

## How was this patch tested?

Validated documentation and successfully running the test example.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.

'Builder' object is not callable object hence changed .builder() to
.builder

Author: Denny Lee <dennylee@gallifrey.local>

Closes #15872 from dennyglee/master.

											
										
										
											2016-11-13 21:10:06 -05:00
+								    .builder \
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								    .appName("StructuredNetworkWordCount") \
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								    .getOrCreate()
 								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								sparkR.session(appName = "StructuredNetworkWordCount")
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
 								Next, let’s create a streaming DataFrame that represents text data received from a server listening on localhost:9999, and transform the DataFrame to calculate word counts.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								// Create DataFrame representing the stream of input lines from connection to localhost:9999
 								val lines = spark.readStream
 								  .format("socket")
 								  .option("host", "localhost")
 								  .option("port", 9999)
 								  .load()
 								// Split the lines into words
 								val words = lines.as[String].flatMap(_.split(" "))
 								// Generate running word count
 								val wordCounts = words.groupBy("value").count()
 								{% endhighlight %}
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a  Dataset of String using `.as[String]`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								// Create DataFrame representing the stream of input lines from connection to localhost:9999
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								Dataset<Row> lines = spark
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								  .readStream()
 								  .format("socket")
 								  .option("host", "localhost")
 								  .option("port", 9999)
 								  .load();
 								// Split the lines into words
 								Dataset<String> words = lines
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .as(Encoders.STRING())
-												[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support

- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.

											
										
										
											2017-02-16 07:32:45 -05:00
+								  .flatMap((FlatMapFunction<String, String>) x -> Arrays.asList(x.split(" ")).iterator(), Encoders.STRING());
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Generate running word count
 								Dataset<Row> wordCounts = words.groupBy("value").count();
 								{% endhighlight %}
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have converted the DataFrame to a  Dataset of String using `.as(Encoders.STRING())`, so that we can apply the `flatMap` operation to split each line into multiple words. The resultant `words` Dataset contains all the words. Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								# Create DataFrame representing the stream of input lines from connection to localhost:9999
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								lines = spark \
 								    .readStream \
 								    .format("socket") \
 								    .option("host", "localhost") \
 								    .option("port", 9999) \
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								    .load()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# Split the lines into words
 								words = lines.select(
 								   explode(
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								       split(lines.value, " ")
 								   ).alias("word")
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								)
 								# Generate running word count
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								wordCounts = words.groupBy("word").count()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								This `lines` DataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have used two built-in SQL functions - split and explode, to split each line into multiple rows with a word each. In addition, we use the function `alias` to name the new column as "word". Finally, we have defined the `wordCounts` DataFrame by grouping by the unique values in the Dataset and counting them. Note that this is a streaming DataFrame which represents the running word counts of the stream.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								# Create DataFrame representing the stream of input lines from connection to localhost:9999
 								lines <- read.stream("socket", host = "localhost", port = 9999)
 								# Split the lines into words
 								words <- selectExpr(lines, "explode(split(value, ' ')) as word")
 								# Generate running word count
 								wordCounts <- count(group_by(words, "word"))
 								{% endhighlight %}
 								This `lines` SparkDataFrame represents an unbounded table containing the streaming text data. This table contains one column of strings named "value", and each line in the streaming text data becomes a row in the table. Note, that this is not currently receiving any data as we are just setting up the transformation, and have not yet started it. Next, we have a SQL expression with two SQL functions - split and explode, to split each line into multiple rows with a word each. In addition, we name the new column as "word". Finally, we have defined the `wordCounts` SparkDataFrame by grouping by the unique values in the SparkDataFrame and counting them. Note that this is a streaming SparkDataFrame which represents the running word counts of the stream.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								We have now set up the query on the streaming data. All that is left is to actually start receiving data and computing the counts. To do this, we set it up to print the complete set of counts (specified by `outputMode("complete")`) to the console every time they are updated. And then start the streaming computation using `start()`.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								// Start running the query that prints the running counts to the console
 								val query = wordCounts.writeStream
 								  .outputMode("complete")
 								  .format("console")
 								  .start()
 								query.awaitTermination()
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								// Start running the query that prints the running counts to the console
 								StreamingQuery query = wordCounts.writeStream()
 								  .outputMode("complete")
 								  .format("console")
 								  .start();
 								query.awaitTermination();
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								 # Start running the query that prints the running counts to the console
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								query = wordCounts \
 								    .writeStream \
 								    .outputMode("complete") \
 								    .format("console") \
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    .start()
 								query.awaitTermination()
 								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								# Start running the query that prints the running counts to the console
 								query <- write.stream(wordCounts, "console", outputMode = "complete")
 								awaitTermination(query)
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								After this code is executed, the streaming computation will have started in the background. The `query` object is a handle to that active streaming query, and we have decided to wait for the termination of the query using `awaitTermination()` to prevent the process from exiting while the query is active.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								To actually execute this example code, you can either compile the code in your own
 								[Spark application](quick-start.html#self-contained-applications), or simply
 								[run the example](index.html#running-the-examples-and-shell) once you have downloaded Spark. We are showing the latter. You will first need to run Netcat (a small utility found in most Unix-like systems) as a data server by using
 								    $ nc -lk 9999
 								Then, in a different terminal, you can start the example by using
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight bash %}
 								$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight bash %}
 								$ ./bin/run-example org.apache.spark.examples.sql.streaming.JavaStructuredNetworkWordCount localhost 9999
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								{% highlight bash %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								$ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
 								{% endhighlight %}
 								</div>
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								<div data-lang="r"  markdown="1">
 								{% highlight bash %}
 								$ ./bin/spark-submit examples/src/main/r/streaming/structured_network_wordcount.R localhost 9999
 								{% endhighlight %}
 								</div>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								Then, any lines typed in the terminal running the netcat server will be counted and printed on screen every second. It will look something like the following.
 								<table width="100%">
 								    <td>
 								{% highlight bash %}
 								# TERMINAL 1:
 								# Running Netcat
 								$ nc -lk 9999
 								apache spark
 								apache hadoop
 								...
 								{% endhighlight %}
 								    </td>
 								    <td width="2%"></td>
 								    <td>
 								<div class="codetabs">
 								<div data-lang="scala" markdown="1">
 								{% highlight bash %}
 								# TERMINAL 2: RUNNING StructuredNetworkWordCount
 								$ ./bin/run-example org.apache.spark.examples.sql.streaming.StructuredNetworkWordCount localhost 9999
 								-------------------------------------------
 								Batch: 0
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    1|
 								| spark|    1|
 								+------+-----+
 								-------------------------------------------
 								Batch: 1
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    2|
 								| spark|    1|
 								|hadoop|    1|
 								+------+-----+
 								...
 								{% endhighlight %}
 								</div>
 								<div data-lang="java" markdown="1">
 								{% highlight bash %}
 								# TERMINAL 2: RUNNING JavaStructuredNetworkWordCount
 								$ ./bin/run-example org.apache.spark.examples.sql.streaming.JavaStructuredNetworkWordCount localhost 9999
 								-------------------------------------------
 								Batch: 0
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    1|
 								| spark|    1|
 								+------+-----+
 								-------------------------------------------
 								Batch: 1
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    2|
 								| spark|    1|
 								|hadoop|    1|
 								+------+-----+
 								...
 								{% endhighlight %}
 								</div>
 								<div data-lang="python" markdown="1">
 								{% highlight bash %}
 								# TERMINAL 2: RUNNING structured_network_wordcount.py
 								$ ./bin/spark-submit examples/src/main/python/sql/streaming/structured_network_wordcount.py localhost 9999
 								-------------------------------------------
 								Batch: 0
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    1|
 								| spark|    1|
 								+------+-----+
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								-------------------------------------------
 								Batch: 1
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    2|
 								| spark|    1|
 								|hadoop|    1|
 								+------+-----+
 								...
 								{% endhighlight %}
 								</div>
 								<div data-lang="r" markdown="1">
 								{% highlight bash %}
 								# TERMINAL 2: RUNNING structured_network_wordcount.R
 								$ ./bin/spark-submit examples/src/main/r/streaming/structured_network_wordcount.R localhost 9999
 								-------------------------------------------
 								Batch: 0
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    1|
 								| spark|    1|
 								+------+-----+
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								-------------------------------------------
 								Batch: 1
 								-------------------------------------------
 								+------+-----+
 								| value|count|
 								+------+-----+
 								|apache|    2|
 								| spark|    1|
 								|hadoop|    1|
 								+------+-----+
 								...
 								{% endhighlight %}
 								</div>
 								</div>
 								    </td>
 								</table>
 								# Programming Model
 								The key idea in Structured Streaming is to treat a live data stream as a
 								table that is being continuously appended. This leads to a new stream
 								processing model that is very similar to a batch processing model. You will
 								express your streaming computation as standard batch-like query as on a static
 								table, and Spark runs it as an *incremental* query on the *unbounded* input
 								table. Let’s understand this model in more detail.
 								## Basic Concepts
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								Consider the input data stream as the "Input Table". Every data item that is
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								arriving on the stream is like a new row being appended to the Input Table.
 								![Stream as a Table](img/structured-streaming-stream-as-a-table.png "Stream as a Table")
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								A query on the input will generate the "Result Table". Every trigger interval (say, every 1 second), new rows get appended to the Input Table, which eventually updates the Result Table. Whenever the result table gets updated, we would want to write the changed result rows to an external sink.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								![Model](img/structured-streaming-model.png)
-												[MINOR][DOCS] Fix spacings in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

1. Omitted space between the sentences: `... on static data.The Spark SQL engine will ...` -> `... on static data. The Spark SQL engine will ...`
2. Omitted colon in Output Model section.

## How was this patch tested?

None.

Author: Lee Dongjin <dongjin@apache.org>

Closes #17564 from dongjinleekr/feature/fix-programming-guide.

											
										
										
											2017-04-12 04:12:14 -04:00
+								The "Output" is defined as what gets written out to the external storage. The output can be defined in a different mode:
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								  - *Complete Mode* - The entire updated Result Table will be written to the external storage. It is up to the storage connector to decide how to handle writing of the entire table.
 								  - *Append Mode* - Only the new rows appended in the Result Table since the last trigger will be written to the external storage. This is applicable only on the queries where existing rows in the Result Table are not expected to change.
-												[SPARK-19140][SS] Allow update mode for non-aggregation streaming queries

## What changes were proposed in this pull request?

This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16520 from zsxwing/update-without-agg.

											
										
										
											2017-01-10 20:58:11 -05:00
+								  - *Update Mode* - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn't contain aggregations, it will be equivalent to Append mode.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								Note that each mode is applicable on certain types of queries. This is discussed in detail [later](#output-modes).
 								To illustrate the use of this model, let’s understand the model in context of
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								the [Quick Example](#quick-example) above. The first `lines` DataFrame is the input table, and
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								the final `wordCounts` DataFrame is the result table. Note that the query on
 								streaming `lines` DataFrame to generate `wordCounts` is *exactly the same* as
 								it would be a static DataFrame. However, when this query is started, Spark
 								will continuously check for new data from the socket connection. If there is
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								new data, Spark will run an "incremental" query that combines the previous
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								running counts with the new data to compute updated counts, as shown below.
 								![Model](img/structured-streaming-example-model.png)
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								**Note that Structured Streaming does not materialize the entire table**. It reads the latest
 								available data from the streaming data source, processes it incrementally to update the result,
 								and then discards the source data. It only keeps around the minimal intermediate *state* data as
 								required to update the result (e.g. intermediate counts in the earlier example).
 								This model is significantly different from many other stream processing
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								engines. Many streaming systems require the user to maintain running
 								aggregations themselves, thus having to reason about fault-tolerance, and
 								data consistency (at-least-once, or at-most-once, or exactly-once). In this
 								model, Spark is responsible for updating the Result Table when there is new
 								data, thus relieving the users from reasoning about it. As an example, let’s
 								see how this model handles event-time based processing and late arriving data.
 								## Handling Event-time and Late Data
-												[MINOR][DOCS] Fix few typos in structured streaming doc

## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>

Closes #17037 from ramkumarvenkat/doc-fix.

											
										
										
											2017-02-24 21:18:22 -05:00
+								Event-time is the time embedded in the data itself. For many applications, you may want to operate on this event-time. For example, if you want to get the number of events generated by IoT devices every minute, then you probably want to use the time when the data was generated (that is, event-time in the data), rather than the time Spark receives them. This event-time is very naturally expressed in this model -- each event from the devices is a row in the table, and event-time is a column value in the row. This allows window-based aggregations (e.g. number of events every minute) to be just a special type of grouping and aggregation on the event-time column -- each time window is a group and each row can belong to multiple windows/groups. Therefore, such event-time-window-based aggregation queries can be defined consistently on both a static dataset (e.g. from collected device events logs) as well as on a data stream, making the life of the user much easier.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								Furthermore, this model naturally handles data that has arrived later than
 								expected based on its event-time. Since Spark is updating the Result Table,
 								it has full control over updating old aggregates when there is late data,
 								as well as cleaning up old aggregates to limit the size of intermediate
 								state data. Since Spark 2.1, we have support for watermarking which
 								allows the user to specify the threshold of late data, and allows the engine
 								to accordingly clean up old state. These are explained later in more
-												[MINOR][DOCS] Fix few typos in structured streaming doc

## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>

Closes #17037 from ramkumarvenkat/doc-fix.

											
										
										
											2017-02-24 21:18:22 -05:00
+								detail in the [Window Operations](#window-operations-on-event-time) section.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								## Fault Tolerance Semantics
 								Delivering end-to-end exactly-once semantics was one of key goals behind the design of Structured Streaming. To achieve that, we have designed the Structured Streaming sources, the sinks and the execution engine to reliably track the exact progress of the processing so that it can handle any kind of failure by restarting and/or reprocessing. Every streaming source is assumed to have offsets (similar to Kafka offsets, or Kinesis sequence numbers)
-												[MINOR][DOC] Fix some typos and grammar issues

## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes #20928 from dsakuma/fix_typo_configuration_docs.

											
										
										
											2018-04-06 01:37:08 -04:00
+								to track the read position in the stream. The engine uses checkpointing and write-ahead logs to record the offset range of the data being processed in each trigger. The streaming sinks are designed to be idempotent for handling reprocessing. Together, using replayable sources and idempotent sinks, Structured Streaming can ensure **end-to-end exactly-once semantics** under any failure.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# API using Datasets and DataFrames
-												[MINOR][DOC] Remove spaces following slashs

## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.

											
										
										
											2016-11-01 09:08:49 -04:00
+								Since Spark 2.0, DataFrames and Datasets can represent static, bounded data, as well as streaming, unbounded data. Similar to static Datasets/DataFrames, you can use the common entry point `SparkSession`
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								([Scala](api/scala/index.html#org.apache.spark.sql.SparkSession)/[Java](api/java/org/apache/spark/sql/SparkSession.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.SparkSession)/[R](api/R/sparkR.session.html) docs)
-												[MINOR][DOC] Remove spaces following slashs

## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.

											
										
										
											2016-11-01 09:08:49 -04:00
+								to create streaming DataFrames/Datasets from streaming sources, and apply the same operations on them as static DataFrames/Datasets. If you are not familiar with Datasets/DataFrames, you are strongly advised to familiarize yourself with them using the
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								[DataFrame/Dataset Programming Guide](sql-programming-guide.html).
 								## Creating streaming DataFrames and streaming Datasets
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								Streaming DataFrames can be created through the `DataStreamReader` interface
-												[MINOR][DOC] Remove spaces following slashs

## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.

											
										
										
											2016-11-01 09:08:49 -04:00
+								([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamReader.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader) docs)
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								returned by `SparkSession.readStream()`. In [R](api/R/read.stream.html), with the `read.stream()` method. Similar to the read interface for creating static DataFrame, you can specify the details of the source – data format, schema, options, etc.
-												[SPARK-17346][SQL] Add Kafka source for Structured Streaming

## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.

											
										
										
											2016-10-05 19:45:45 -04:00
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								#### Input Sources
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								There are a few built-in sources.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-22781][SS] Support creating streaming dataset with ORC files

## What changes were proposed in this pull request?

Like `Parquet`, users can use `ORC` with Apache Spark structured streaming. This PR adds `orc()` to `DataStreamReader`(Scala/Python) in order to support creating streaming dataset with ORC file format more easily like the other file formats. Also, this adds a test coverage for ORC data source and updates the document.

**BEFORE**

```scala
scala> spark.readStream.schema("a int").orc("/tmp/orc_ss").writeStream.format("console").start()
<console>:24: error: value orc is not a member of org.apache.spark.sql.streaming.DataStreamReader
       spark.readStream.schema("a int").orc("/tmp/orc_ss").writeStream.format("console").start()
```

**AFTER**
```scala
scala> spark.readStream.schema("a int").orc("/tmp/orc_ss").writeStream.format("console").start()
res0: org.apache.spark.sql.streaming.StreamingQuery = org.apache.spark.sql.execution.streaming.StreamingQueryWrapper678b3746

scala>
-------------------------------------------
Batch: 0
-------------------------------------------
+---+
|  a|
+---+
|  1|
+---+
```

## How was this patch tested?

Pass the newly added test cases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #19975 from dongjoon-hyun/SPARK-22781.

											
										
										
											2017-12-20 02:50:06 -05:00
+								  - **File source** - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, orc, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration

## What changes were proposed in this pull request?

Remove Kafka 0.8 integration

## How was this patch tested?

Existing tests, build scripts

Closes #22703 from srowen/SPARK-25705.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-10-16 10:10:24 -04:00
+								  - **Kafka source** - Reads data from Kafka. It's compatible with Kafka broker versions 0.10.0 or higher. See the [Kafka Integration Guide](structured-streaming-kafka-0-10-integration.html) for more details.
-												[SPARK-17346][SQL] Add Kafka source for Structured Streaming

## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.

											
										
										
											2016-10-05 19:45:45 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								  - **Socket source (for testing)** - Reads UTF8 text data from a socket connection. The listening server socket is at the driver. Note that this should be used only for testing as this does not provide end-to-end fault-tolerance guarantees.
-												[SPARK-21069][SS][DOCS] Add rate source to programming guide.

## What changes were proposed in this pull request?

SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide.

## How was this patch tested?

Tested by running jekyll locally.

Author: Prashant Sharma <prashant@apache.org>
Author: Prashant Sharma <prashsh1@in.ibm.com>

Closes #18562 from ScrapCodes/spark-21069/rate-source-docs.

											
										
										
											2017-07-08 02:33:12 -04:00
+								  - **Rate source (for testing)** - Generates data at the specified number of rows per second, each output row contains a `timestamp` and `value`. Where `timestamp` is a `Timestamp` type containing the time of message dispatch, and `value` is of `Long` type containing the message count, starting from 0 as the first row. This source is intended for testing and benchmarking.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								Some sources are not fault-tolerant because they do not guarantee that data can be replayed using
 								checkpointed offsets after a failure. See the earlier section on
 								[fault-tolerance semantics](#fault-tolerance-semantics).
 								Here are the details of all the sources in Spark.
 								<table class="table">
 								  <tr>
 								    <th>Source</th>
 								    <th>Options</th>
 								    <th>Fault-tolerant</th>
 								    <th>Notes</th>
 								  </tr>
 								  <tr>
 								    <td><b>File source</b></td>
 								    <td>
 								        <code>path</code>: path to the input directory, and common to all file formats.
-												[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table

## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

The original documentation was written by tdas and lw-lin

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18342 from assafmendelson/spark-21123.

											
										
										
											2017-06-19 13:58:58 -04:00
+								        <br/>
 								        <code>maxFilesPerTrigger</code>: maximum number of new files to be considered in every trigger (default: no max)
 								        <br/>
-												Fix typos detected by github.com/client9/misspell

## What changes were proposed in this pull request?

Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell).

This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know.

## How was this patch tested?

### before

```
$ misspell . | grep -v '.js'
R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition"
NOTICE-binary:454:16: "containd" is a misspelling of "contained"
R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition"
R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition"
R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence"
R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred"
R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output"
R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent"
common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred"
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin"
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc"
core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden"
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments"
dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual"
dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across"
dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across"
dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments"
docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden"
docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes"
docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN"
docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior"
examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract"
examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The"
mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial"
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean"
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions"
python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment"
python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress"
python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability"
python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently"
python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean"
python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter"
python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability"
python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter"
python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns"
python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization"
python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize"
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary"
resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully"
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints"
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter"
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence"
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary"
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when"
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp"
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage"
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred"
sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing"
sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with"
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring"
```

### after

```
$ misspell . | grep -v '.js'
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The"
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean"
python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean"
```

Closes #22070 from seratch/fix-typo.

Authored-by: Kazuhiro Sera <seratch@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

											
										
										
											2018-08-11 22:23:36 -04:00
+								        <code>latestFirst</code>: whether to process the latest new files first, useful when there is a large backlog of files (default: false)
-												[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table

## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

The original documentation was written by tdas and lw-lin

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18342 from assafmendelson/spark-21123.

											
										
										
											2017-06-19 13:58:58 -04:00
+								        <br/>
 								        <code>fileNameOnly</code>: whether to check new files based on only the filename instead of on the full path (default: false). With this set to `true`, the following files would be considered as the same file, because their filenames, "dataset.txt", are the same:
 								        <br/>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								        "file:///dataset.txt"<br/>
 								        "s3://a/dataset.txt"<br/>
 								        "s3n://a/b/dataset.txt"<br/>
 								        "s3a://a/b/c/dataset.txt"<br/>
 								        <br/><br/>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								        For file-format-specific options, see the related methods in <code>DataStreamReader</code>
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								        (<a href="api/scala/index.html#org.apache.spark.sql.streaming.DataStreamReader">Scala</a>/<a href="api/java/org/apache/spark/sql/streaming/DataStreamReader.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamReader">Python</a>/<a
 								        href="api/R/read.stream.html">R</a>).
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								        E.g. for "parquet" format options see <code>DataStreamReader.parquet()</code>.
 								        <br/><br/>
-												[SPARK-24499][SQL][DOC] Split the page of sql-programming-guide.html to multiple separate pages

## What changes were proposed in this pull request?

1. Split the main page of sql-programming-guide into 7 parts:

- Getting Started
- Data Sources
- Performance Turing
- Distributed SQL Engine
- PySpark Usage Guide for Pandas with Apache Arrow
- Migration Guide
- Reference

2. Add left menu for sql-programming-guide, keep first level index for each part in the menu.
![image](https://user-images.githubusercontent.com/4833765/47016859-6332e180-d183-11e8-92e8-ce62518a83c4.png)

## How was this patch tested?

Local test with jekyll build/serve.

Closes #22746 from xuanyuanking/SPARK-24499.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

											
										
										
											2018-10-18 14:59:06 -04:00
+								        In addition, there are session configurations that affect certain file-formats. See the <a href="sql-programming-guide.html">SQL Programming Guide</a> for more details. E.g., for "parquet", see <a href="sql-data-sources-parquet.html#configuration">Parquet configuration</a> section.
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								        </td>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td>Yes</td>
 								    <td>Supports glob paths, but does not support multiple comma-separated paths/globs.</td>
 								  </tr>
 								  <tr>
 								    <td><b>Socket Source</b></td>
 								    <td>
 								        <code>host</code>: host to connect to, must be specified<br/>
 								        <code>port</code>: port to connect to, must be specified
 								    </td>
 								    <td>No</td>
 								    <td></td>
 								  </tr>
-												[SPARK-21069][SS][DOCS] Add rate source to programming guide.

## What changes were proposed in this pull request?

SPARK-20979 added a new structured streaming source: Rate source. This patch adds the corresponding documentation to programming guide.

## How was this patch tested?

Tested by running jekyll locally.

Author: Prashant Sharma <prashant@apache.org>
Author: Prashant Sharma <prashsh1@in.ibm.com>

Closes #18562 from ScrapCodes/spark-21069/rate-source-docs.

											
										
										
											2017-07-08 02:33:12 -04:00
+								  <tr>
 								    <td><b>Rate Source</b></td>
 								    <td>
 								        <code>rowsPerSecond</code> (e.g. 100, default: 1): How many rows should be generated per second.<br/><br/>
 								        <code>rampUpTime</code> (e.g. 5s, default: 0s): How long to ramp up before the generating speed becomes <code>rowsPerSecond</code>. Using finer granularities than seconds will be truncated to integer seconds. <br/><br/>
 								        <code>numPartitions</code> (e.g. 10, default: Spark's default parallelism): The partition number for the generated rows. <br/><br/>
 								        The source will try its best to reach <code>rowsPerSecond</code>, but the query may be resource constrained, and <code>numPartitions</code> can be tweaked to help reach the desired speed.
 								    </td>
 								    <td>Yes</td>
 								    <td></td>
 								  </tr>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								  <tr>
 								    <td><b>Kafka Source</b></td>
 								    <td>
-												[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration

## What changes were proposed in this pull request?

Remove Kafka 0.8 integration

## How was this patch tested?

Existing tests, build scripts

Closes #22703 from srowen/SPARK-25705.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-10-16 10:10:24 -04:00
+								        See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a>.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    </td>
 								    <td>Yes</td>
 								    <td></td>
 								  </tr>
 								  <tr>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								  </tr>
 								</table>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								Here are some examples.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								val spark: SparkSession = ...
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								// Read text from socket
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								val socketDF = spark
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .readStream
 								  .format("socket")
 								  .option("host", "localhost")
 								  .option("port", 9999)
 								  .load()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								socketDF.isStreaming    // Returns True for DataFrames that have streaming sources
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								socketDF.printSchema
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Read all the csv files written atomically in a directory
 								val userSchema = new StructType().add("name", "string").add("age", "integer")
 								val csvDF = spark
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .readStream
 								  .option("sep", ";")
 								  .schema(userSchema)      // Specify schema of the csv files
 								  .csv("/path/to/directory")    // Equivalent to format("csv").load("/path/to/directory")
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								SparkSession spark = ...
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								// Read text from socket
-												[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support

- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.

											
										
										
											2017-02-16 07:32:45 -05:00
+								Dataset<Row> socketDF = spark
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .readStream()
 								  .format("socket")
 								  .option("host", "localhost")
 								  .option("port", 9999)
 								  .load();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								socketDF.isStreaming();    // Returns True for DataFrames that have streaming sources
 								socketDF.printSchema();
 								// Read all the csv files written atomically in a directory
 								StructType userSchema = new StructType().add("name", "string").add("age", "integer");
-												[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support

- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.

											
										
										
											2017-02-16 07:32:45 -05:00
+								Dataset<Row> csvDF = spark
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .readStream()
 								  .option("sep", ";")
 								  .schema(userSchema)      // Specify schema of the csv files
 								  .csv("/path/to/directory");    // Equivalent to format("csv").load("/path/to/directory")
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								spark = SparkSession. ...
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								# Read text from socket
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								socketDF = spark \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .readStream \
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    .format("socket") \
 								    .option("host", "localhost") \
 								    .option("port", 9999) \
 								    .load()
 								socketDF.isStreaming()    # Returns True for DataFrames that have streaming sources
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								socketDF.printSchema()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# Read all the csv files written atomically in a directory
 								userSchema = StructType().add("name", "string").add("age", "integer")
 								csvDF = spark \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .readStream \
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    .option("sep", ";") \
 								    .schema(userSchema) \
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								    .csv("/path/to/directory")  # Equivalent to format("csv").load("/path/to/directory")
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								sparkR.session(...)
 								# Read text from socket
 								socketDF <- read.stream("socket", host = hostname, port = port)
 								isStreaming(socketDF)    # Returns TRUE for SparkDataFrames that have streaming sources
 								printSchema(socketDF)
 								# Read all the csv files written atomically in a directory
 								schema <- structType(structField("name", "string"),
 								                     structField("age", "integer"))
 								csvDF <- read.stream("csv", path = "/path/to/directory", schema = schema, sep = ";")
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								These examples generate streaming DataFrames that are untyped, meaning that the schema of the DataFrame is not checked at compile time, only checked at runtime when the query is submitted. Some operations like `map`, `flatMap`, etc. need the type to be known at compile time. To do those, you can convert these untyped streaming DataFrames to typed streaming Datasets using the same methods as static DataFrame. See the [SQL Programming Guide](sql-programming-guide.html) for more details. Additionally, more details on the supported streaming sources are discussed later in the document.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing

## What changes were proposed in this pull request?

When reading file stream with non-globbing path, the results return data with all `null`s for the
partitioned columns. E.g.,

    case class A(id: Int, value: Int)
    val data = spark.createDataset(Seq(
      A(1, 1),
      A(2, 2),
      A(2, 3))
    )
    val url = "/tmp/test"
    data.write.partitionBy("id").parquet(url)
    spark.read.parquet(url).show

    +-----+---+
    |value| id|
    +-----+---+
    |    2|  2|
    |    3|  2|
    |    1|  1|
    +-----+---+

    val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url)
    s.writeStream.queryName("test").format("memory").start()

    sql("SELECT * FROM test").show

    +-----+----+
    |value|  id|
    +-----+----+
    |    2|null|
    |    3|null|
    |    1|null|
    +-----+----+

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #14803 from viirya/filestreamsource-option.

											
										
										
											2016-09-26 16:07:11 -04:00
+								### Schema inference and partition of streaming DataFrames/Datasets
 								By default, Structured Streaming from file based sources requires you to specify the schema, rather than rely on Spark to infer it automatically. This restriction ensures a consistent schema will be used for the streaming query, even in the case of failures. For ad-hoc use cases, you can reenable schema inference by setting `spark.sql.streaming.schemaInference` to `true`.
-												[MINOR][DOC] Fix some typos and grammar issues

## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes #20928 from dsakuma/fix_typo_configuration_docs.

											
										
										
											2018-04-06 01:37:08 -04:00
+								Partition discovery does occur when subdirectories that are named `/key=value/` are present and listing will automatically recurse into these directories. If these columns appear in the user-provided schema, they will be filled in by Spark based on the path of the file being read. The directories that make up the partitioning scheme must be present when the query starts and must remain static. For example, it is okay to add `/data/year=2016/` when `/data/year=2015/` was present, but it is invalid to change the partitioning column (i.e. by creating the directory `/data/date=2016-04-17/`).
-												[SPARK-17153][SQL] Should read partition data when reading new files in filestream without globbing

## What changes were proposed in this pull request?

When reading file stream with non-globbing path, the results return data with all `null`s for the
partitioned columns. E.g.,

    case class A(id: Int, value: Int)
    val data = spark.createDataset(Seq(
      A(1, 1),
      A(2, 2),
      A(2, 3))
    )
    val url = "/tmp/test"
    data.write.partitionBy("id").parquet(url)
    spark.read.parquet(url).show

    +-----+---+
    |value| id|
    +-----+---+
    |    2|  2|
    |    3|  2|
    |    1|  1|
    +-----+---+

    val s = spark.readStream.schema(spark.read.load(url).schema).parquet(url)
    s.writeStream.queryName("test").format("memory").start()

    sql("SELECT * FROM test").show

    +-----+----+
    |value|  id|
    +-----+----+
    |    2|null|
    |    3|null|
    |    1|null|
    +-----+----+

## How was this patch tested?

Jenkins tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #14803 from viirya/filestreamsource-option.

											
										
										
											2016-09-26 16:07:11 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								## Operations on streaming DataFrames/Datasets
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								You can apply all kinds of operations on streaming DataFrames/Datasets – ranging from untyped, SQL-like operations (e.g. `select`, `where`, `groupBy`), to typed RDD-like operations (e.g. `map`, `filter`, `flatMap`). See the [SQL programming guide](sql-programming-guide.html) for more details. Let’s take a look at a few example operations that you can use.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								### Basic Operations - Selection, Projection, Aggregation
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								Most of the common operations on DataFrame/Dataset are supported for streaming. The few operations that are not supported are [discussed later](#unsupported-operations) in this section.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								case class DeviceData(device: String, deviceType: String, signal: Double, time: DateTime)
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								val df: DataFrame = ... // streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: string }
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								val ds: Dataset[DeviceData] = df.as[DeviceData]    // streaming Dataset with IOT device data
 								// Select the devices which have signal more than 10
 								df.select("device").where("signal > 10")      // using untyped APIs
 								ds.filter(_.signal > 10).map(_.device)         // using typed APIs
 								// Running count of the number of updates for each device type
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								df.groupBy("deviceType").count()                          // using untyped API
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Running average signal for each device type
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								import org.apache.spark.sql.expressions.scalalang.typed
 								ds.groupByKey(_.deviceType).agg(typed.avg(_.signal))    // using typed API
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								import org.apache.spark.api.java.function.*;
 								import org.apache.spark.sql.*;
 								import org.apache.spark.sql.expressions.javalang.typed;
 								import org.apache.spark.sql.catalyst.encoders.ExpressionEncoder;
 								public class DeviceData {
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  private String device;
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								  private String deviceType;
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  private Double signal;
 								  private java.sql.Date time;
 								  ...
 								  // Getter and setter methods for each field
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								}
 								Dataset<Row> df = ...;    // streaming DataFrame with IOT device data with schema { device: string, type: string, signal: double, time: DateType }
 								Dataset<DeviceData> ds = df.as(ExpressionEncoder.javaBean(DeviceData.class)); // streaming Dataset with IOT device data
 								// Select the devices which have signal more than 10
 								df.select("device").where("signal > 10"); // using untyped APIs
-												[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support

- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.

											
										
										
											2017-02-16 07:32:45 -05:00
+								ds.filter((FilterFunction<DeviceData>) value -> value.getSignal() > 10)
 								  .map((MapFunction<DeviceData, String>) value -> value.getDevice(), Encoders.STRING());
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Running count of the number of updates for each device type
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								df.groupBy("deviceType").count(); // using untyped API
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Running average signal for each device type
-												[SPARK-19550][BUILD][CORE][WIP] Remove Java 7 support

- Move external/java8-tests tests into core, streaming, sql and remove
- Remove MaxPermGen and related options
- Fix some reflection / TODOs around Java 8+ methods
- Update doc references to 1.7/1.8 differences
- Remove Java 7/8 related build profiles
- Update some plugins for better Java 8 compatibility
- Fix a few Java-related warnings

For the future:

- Update Java 8 examples to fully use Java 8
- Update Java tests to use lambdas for simplicity
- Update Java internal implementations to use lambdas

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16871 from srowen/SPARK-19493.

											
										
										
											2017-02-16 07:32:45 -05:00
+								ds.groupByKey((MapFunction<DeviceData, String>) value -> value.getDeviceType(), Encoders.STRING())
 								  .agg(typed.avg((MapFunction<DeviceData, Double>) value -> value.getSignal()));
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								df = ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								# Select the devices which have signal more than 10
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								df.select("device").where("signal > 10")
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# Running count of the number of updates for each device type
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								df.groupBy("deviceType").count()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								df <- ...  # streaming DataFrame with IOT device data with schema { device: string, deviceType: string, signal: double, time: DateType }
 								# Select the devices which have signal more than 10
 								select(where(df, "signal > 10"), "device")
 								# Running count of the number of updates for each device type
 								count(groupBy(df, "deviceType"))
 								{% endhighlight %}
 								</div>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								You can also register a streaming DataFrame/Dataset as a temporary view and then apply SQL commands on it.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								df.createOrReplaceTempView("updates")
 								spark.sql("select count(*) from updates")  // returns another streaming DF
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								df.createOrReplaceTempView("updates");
 								spark.sql("select count(*) from updates");  // returns another streaming DF
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								df.createOrReplaceTempView("updates")
 								spark.sql("select count(*) from updates")  # returns another streaming DF
 								{% endhighlight %}
 								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								createOrReplaceTempView(df, "updates")
 								sql("select count(*) from updates")
 								{% endhighlight %}
 								</div>
 								</div>
 								Note, you can identify whether a DataFrame/Dataset has streaming data or not by using `df.isStreaming`.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								df.isStreaming
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								df.isStreaming()
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								df.isStreaming()
 								{% endhighlight %}
 								</div>
 								<div data-lang="r"  markdown="1">
-												[SPARK-21293][SPARKR][DOCS] structured streaming doc update

## What changes were proposed in this pull request?

doc update

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20197 from felixcheung/rwadoc.

											
										
										
											2018-01-09 01:08:19 -05:00
+								{% highlight r %}
 								isStreaming(df)
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								{% endhighlight %}
 								</div>
 								</div>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								### Window Operations on Event Time
-												[MINOR][DOCS] Fix few typos in structured streaming doc

## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>

Closes #17037 from ramkumarvenkat/doc-fix.

											
										
										
											2017-02-24 21:18:22 -05:00
+								Aggregations over a sliding event-time window are straightforward with Structured Streaming and are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								Imagine our [quick example](#quick-example) is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time).
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								The result tables would look something like the following.
 								![Window Operations](img/structured-streaming-window.png)
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								Since this windowing is similar to grouping, in code, you can use `groupBy()` and `window()` operations to express windowed aggregations. You can see the full code for the below examples in
-												[MINOR][DOC] Remove spaces following slashs

## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.

											
										
										
											2016-11-01 09:08:49 -04:00
+								[Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCountWindowed.java)/[Python]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming/structured_network_wordcount_windowed.py).
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								import spark.implicits._
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								// Group the data by window and word and compute the count of each group
 								val windowedCounts = words.groupBy(
 								  window($"timestamp", "10 minutes", "5 minutes"),
 								  $"word"
 								).count()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								Dataset<Row> words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								// Group the data by window and word and compute the count of each group
 								Dataset<Row> windowedCounts = words.groupBy(
 								  functions.window(words.col("timestamp"), "10 minutes", "5 minutes"),
 								  words.col("word")
 								).count();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								# Group the data by window and word and compute the count of each group
 								windowedCounts = words.groupBy(
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								    window(words.timestamp, "10 minutes", "5 minutes"),
-												[SPARK-16114][SQL] updated structured streaming guide

## What changes were proposed in this pull request?

Updated structured streaming programming guide with new windowed example.

## How was this patch tested?

Docs

Author: James Thomas <jamesjoethomas@gmail.com>

Closes #14183 from jjthomas/ss_docs_update.

											
										
										
											2016-07-13 16:26:23 -04:00
+								    words.word
 								).count()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
-												[SPARK-21293][SPARKR][DOCS] structured streaming doc update

## What changes were proposed in this pull request?

doc update

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20197 from felixcheung/rwadoc.

											
										
										
											2018-01-09 01:08:19 -05:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								words <- ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
 								# Group the data by window and word and compute the count of each group
 								windowedCounts <- count(
 								                    groupBy(
 								                      words,
 								                      window(words$timestamp, "10 minutes", "5 minutes"),
 								                      words$word))
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								#### Handling Late Data and Watermarking
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								Now consider what happens if one of the events arrives late to the application.
-												[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo

## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.

											
										
										
											2017-01-04 10:07:29 -05:00
+								For example, say, a word generated at 12:04 (i.e. event time) could be received by
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								the application at 12:11. The application should use the time 12:04 instead of 12:11
 								to update the older counts for the window `12:00 - 12:10`. This occurs
 								naturally in our window-based grouping – Structured Streaming can maintain the intermediate state
 								for partial aggregates for a long period of time such that late data can update aggregates of
 								old windows correctly, as illustrated below.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								![Handling Late Data](img/structured-streaming-late-data.png)
-												[MINOR][DOCS] Fix few typos in structured streaming doc

## What changes were proposed in this pull request?

Minor typo in `even-time`, which is changed to `event-time` and a couple of grammatical errors fix.

## How was this patch tested?

N/A - since this is a doc fix. I did a jekyll build locally though.

Author: Ramkumar Venkataraman <rvenkataraman@paypal.com>

Closes #17037 from ramkumarvenkat/doc-fix.

											
										
										
											2017-02-24 21:18:22 -05:00
+								However, to run this query for days, it's necessary for the system to bound the amount of
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								intermediate in-memory state it accumulates. This means the system needs to know when an old
 								aggregate can be dropped from the in-memory state because the application is not going to receive
 								late data for that aggregate any more. To enable this, in Spark 2.1, we have introduced
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								**watermarking**, which lets the engine automatically track the current event time in the data
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								and attempt to clean up old state accordingly. You can define the watermark of a query by
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								specifying the event time column and the threshold on how late the data is expected to be in terms of
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								event time. For a specific window starting at time `T`, the engine will maintain state and allow late
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								data to update the state until `(max event time seen by the engine - late threshold > T)`.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								In other words, late data within the threshold will be aggregated,
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								but data later than the threshold will start getting dropped
-												[SPARK-24520] Double braces in documentations

There are double braces in the markdown, which break the link.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Fokko Driesprong <fokkodriesprong@godatadriven.com>

Closes #21528 from Fokko/patch-1.

											
										
										
											2018-06-11 18:12:33 -04:00
+								(see [later](#semantic-guarantees-of-aggregation-with-watermarking)
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								in the section for the exact guarantees). Let's understand this with an example. We can
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								easily define watermarking on the previous example using `withWatermark()` as shown below.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								import spark.implicits._
 								val words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
 								// Group the data by window and word and compute the count of each group
 								val windowedCounts = words
 								    .withWatermark("timestamp", "10 minutes")
 								    .groupBy(
 								        window($"timestamp", "10 minutes", "5 minutes"),
 								        $"word")
 								    .count()
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								Dataset<Row> words = ... // streaming DataFrame of schema { timestamp: Timestamp, word: String }
 								// Group the data by window and word and compute the count of each group
 								Dataset<Row> windowedCounts = words
 								    .withWatermark("timestamp", "10 minutes")
 								    .groupBy(
 								        functions.window(words.col("timestamp"), "10 minutes", "5 minutes"),
 								        words.col("word"))
 								    .count();
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								words = ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
 								# Group the data by window and word and compute the count of each group
-												[SPARK-20021][PYSPARK] Miss backslash in python code

## What changes were proposed in this pull request?

Add backslash for line continuation in python code.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>
Author: dylon <hustyugm@gmail.com>

Closes #17352 from uncleGen/python-example-doc.

											
										
										
											2017-03-22 07:10:08 -04:00
+								windowedCounts = words \
 								    .withWatermark("timestamp", "10 minutes") \
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								    .groupBy(
 								        window(words.timestamp, "10 minutes", "5 minutes"),
-												[SPARK-20021][PYSPARK] Miss backslash in python code

## What changes were proposed in this pull request?

Add backslash for line continuation in python code.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>
Author: dylon <hustyugm@gmail.com>

Closes #17352 from uncleGen/python-example-doc.

											
										
										
											2017-03-22 07:10:08 -04:00
+								        words.word) \
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								    .count()
 								{% endhighlight %}
-												[SPARK-21293][SPARKR][DOCS] structured streaming doc update

## What changes were proposed in this pull request?

doc update

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20197 from felixcheung/rwadoc.

											
										
										
											2018-01-09 01:08:19 -05:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								words <- ...  # streaming DataFrame of schema { timestamp: Timestamp, word: String }
 								# Group the data by window and word and compute the count of each group
 								words <- withWatermark(words, "timestamp", "10 minutes")
 								windowedCounts <- count(
 								                    groupBy(
 								                      words,
 								                      window(words$timestamp, "10 minutes", "5 minutes"),
 								                      words$word))
 								{% endhighlight %}
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								</div>
 								</div>
 								In this example, we are defining the watermark of the query on the value of the column "timestamp",
 								and also defining "10 minutes" as the threshold of how late is the data allowed to be. If this query
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								is run in Update output mode (discussed later in [Output Modes](#output-modes) section),
-												Fixed typos in docs

## What changes were proposed in this pull request?

Typos at a couple of place in the docs.

## How was this patch tested?

build including docs

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: ymahajan <ymahajan@snappydata.io>

Closes #17690 from ymahajan/master.

											
										
										
											2017-04-19 23:08:31 -04:00
+								the engine will keep updating counts of a window in the Result Table until the window is older
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								than the watermark, which lags behind the current event time in column "timestamp" by 10 minutes.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								Here is an illustration.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								![Watermarking in Update Mode](img/structured-streaming-watermark-update-mode.png)
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
 								As shown in the illustration, the maximum event time tracked by the engine is the
 								*blue dashed line*, and the watermark set as `(max event time - '10 mins')`
-												Correct missing punctuation in the documentation

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #22189 from movrsprbp/patch-1.

Authored-by: jaroslav chládek <mastermism@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>

											
										
										
											2018-08-25 00:49:48 -04:00
+								at the beginning of every trigger is the red line. For example, when the engine observes the data
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								`(12:14, dog)`, it sets the watermark for the next trigger as `12:04`.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								This watermark lets the engine maintain intermediate state for additional 10 minutes to allow late
 								data to be counted. For example, the data `(12:09, cat)` is out of order and late, and it falls in
-												[SPARK-21924][DOCS] Update structured streaming programming guide doc

## What changes were proposed in this pull request?

Update the line "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:05 - 12:15 and 12:10 - 12:20." as follow "For example, the data (12:09, cat) is out of order and late, and it falls in windows 12:00 - 12:10 and 12:05 - 12:15." under the programming structured streaming programming guide.

Author: Riccardo Corbella <r.corbella@reply.it>

Closes #19137 from riccardocorbella/bugfix.

											
										
										
											2017-09-06 03:22:57 -04:00
+								windows `12:00 - 12:10` and `12:05 - 12:15`. Since, it is still ahead of the watermark `12:04` in
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								the trigger, the engine still maintains the intermediate counts as state and correctly updates the
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								counts of the related windows. However, when the watermark is updated to `12:11`, the intermediate
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								state for window `(12:00 - 12:10)` is cleared, and all subsequent data (e.g. `(12:04, donkey)`)
 								is considered "too late" and therefore ignored. Note that after every trigger,
 								the updated counts (i.e. purple rows) are written to sink as the trigger output, as dictated by
 								the Update mode.
 								Some sinks (e.g. files) may not supported fine-grained updates that Update Mode requires. To work
 								with them, we have also support Append Mode, where only the *final counts* are written to sink.
 								This is illustrated below.
-												[SPARK-20373][SQL][SS] Batch queries with 'Dataset/DataFrame.withWatermark()` does not execute

## What changes were proposed in this pull request?

Any Dataset/DataFrame batch query with the operation `withWatermark` does not execute because the batch planner does not have any rule to explicitly handle the EventTimeWatermark logical plan.
The right solution is to simply remove the plan node, as the watermark should not affect any batch query in any way.

Changes:
- In this PR, we add a new rule `EliminateEventTimeWatermark` to check if we need to ignore the event time watermark. We will ignore watermark in any batch query.

Depends upon:
- [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can not add this rule into analyzer directly, because streaming query will be copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied to `triggerLogicalPlan` mistakenly.

Others:
- A typo fix in example.

## How was this patch tested?

add new unit test.

Author: uncleGen <hustyugm@gmail.com>

Closes #17896 from uncleGen/SPARK-20373.

											
										
										
											2017-05-09 18:08:09 -04:00
+								Note that using `withWatermark` on a non-streaming Dataset is no-op. As the watermark should not affect
 								any batch query in any way, we will ignore it directly.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								![Watermarking in Append Mode](img/structured-streaming-watermark-append-mode.png)
 								Similar to the Update Mode earlier, the engine maintains intermediate counts for each window.
 								However, the partial counts are not updated to the Result Table and not written to sink. The engine
 								waits for "10 mins" for late date to be counted,
 								then drops intermediate state of a window < watermark, and appends the final
 								counts to the Result Table/sink. For example, the final counts of window `12:00 - 12:10` is
 								appended to the Result Table only after the watermark is updated to `12:11`.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								##### Conditions for watermarking to clean aggregation state
 								{:.no_toc}
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								It is important to note that the following conditions must be satisfied for the watermarking to
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								clean the state in aggregation queries *(as of Spark 2.1.1, subject to change in the future)*.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								- **Output mode must be Append or Update.** Complete mode requires all aggregate data to be preserved,
 								and hence cannot use watermarking to drop intermediate state. See the [Output Modes](#output-modes)
 								section for detailed explanation of the semantics of each output mode.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
 								- The aggregation must have either the event-time column, or a `window` on the event-time column.
 								- `withWatermark` must be called on the
 								same column as the timestamp column used in the aggregate. For example,
 								`df.withWatermark("time", "1 min").groupBy("time2").count()` is invalid
 								in Append output mode, as watermark is defined on a different column
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								from the aggregation column.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
 								- `withWatermark` must be called before the aggregation for the watermark details to be used.
 								For example, `df.groupBy("time").count().withWatermark("time", "1 min")` is invalid in Append
 								output mode.
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								##### Semantic Guarantees of Aggregation with Watermarking
 								{:.no_toc}
 								- A watermark delay (set with `withWatermark`) of "2 hours" guarantees that the engine will never
 								drop any data that is less than 2 hours delayed. In other words, any data less than 2 hours behind
 								(in terms of event-time) the latest data processed till then is guaranteed to be aggregated.
 								- However, the guarantee is strict only in one direction. Data delayed by more than 2 hours is
 								not guaranteed to be dropped; it may or may not get aggregated. More delayed is the data, less
 								likely is the engine going to process it.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								### Join Operations
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								Structured Streaming supports joining a streaming Dataset/DataFrame with a static Dataset/DataFrame
 								as well as another streaming Dataset/DataFrame. The result of the streaming join is generated
 								incrementally, similar to the results of streaming aggregations in the previous section. In this
 								section we will explore what type of joins (i.e. inner, outer, etc.) are supported in the above
 								cases. Note that in all the supported join types, the result of the join with a streaming
 								Dataset/DataFrame will be the exactly the same as if it was with a static Dataset/DataFrame
 								containing the same data in the stream.
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								#### Stream-static Joins
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
 								Since the introduction in Spark 2.0, Structured Streaming has supported joins (inner join and some
 								type of outer joins) between a streaming and a static DataFrame/Dataset. Here is a simple example.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								val staticDf = spark.read. ...
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								val streamingDf = spark.readStream. ...
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								streamingDf.join(staticDf, "type")          // inner equi-join with a static DF
 								streamingDf.join(staticDf, "type", "right_join")  // right outer join with a static DF
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
-												Fix Java SimpleApp spark application

## What changes were proposed in this pull request?

Add missing import and missing parentheses to invoke `SparkSession::text()`.

## How was this patch tested?

Built and the code for this application, ran jekyll locally per docs/README.md.

Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov>

Closes #18795 from christiam/master.

											
										
										
											2017-08-03 18:40:25 -04:00
+								Dataset<Row> staticDf = spark.read(). ...;
 								Dataset<Row> streamingDf = spark.readStream(). ...;
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								streamingDf.join(staticDf, "type");         // inner equi-join with a static DF
 								streamingDf.join(staticDf, "type", "right_join");  // right outer join with a static DF
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								staticDf = spark.read. ...
 								streamingDf = spark.readStream. ...
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								streamingDf.join(staticDf, "type")  # inner equi-join with a static DF
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								streamingDf.join(staticDf, "type", "right_join")  # right outer join with a static DF
 								{% endhighlight %}
-												[SPARK-21293][SS][SPARKR] Add doc example for streaming join, dedup

## What changes were proposed in this pull request?

streaming programming guide changes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20340 from felixcheung/rstreamdoc.

											
										
										
											2018-01-21 14:23:51 -05:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								staticDf <- read.df(...)
 								streamingDf <- read.stream(...)
 								joined <- merge(streamingDf, staticDf, sort = FALSE)  # inner equi-join with a static DF
 								joined <- join(
 								            staticDf,
 								            streamingDf,
 								            streamingDf$value == staticDf$value,
 								            "right_outer")  # right outer join with a static DF
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								Note that stream-static joins are not stateful, so no state management is necessary.
 								However, a few types of stream-static outer joins are not yet supported.
 								These are listed at the [end of this Join section](#support-matrix-for-joins-in-streaming-queries).
 								#### Stream-stream Joins
 								In Spark 2.3, we have added support for stream-stream joins, that is, you can join two streaming
 								Datasets/DataFrames. The challenge of generating join results between two data streams is that,
 								at any point of time, the view of the dataset is incomplete for both sides of the join making
 								it much harder to find matches between inputs. Any row received from one input stream can match
 								with any future, yet-to-be-received row from the other input stream. Hence, for both the input
 								streams, we buffer past input as streaming state, so that we can match every future input with
 								past input and accordingly generate joined results. Furthermore, similar to streaming aggregations,
 								we automatically handle late, out-of-order data and can limit the state using watermarks.
 								Let’s discuss the different types of supported stream-stream joins and how to use them.
 								##### Inner Joins with optional Watermarking
 								Inner joins on any kind of columns along with any kind of join conditions are supported.
 								However, as the stream runs, the size of streaming state will keep growing indefinitely as
-												[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.

## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.

											
										
										
											2018-01-20 17:34:37 -05:00
+								*all* past input must be saved as any new input can match with any input from the past.
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								To avoid unbounded state, you have to define additional join conditions such that indefinitely
 								old inputs cannot match with future inputs and therefore can be cleared from the state.
 								In other words, you will have to do the following additional steps in the join.
 . Define watermark delays on both inputs such that the engine knows how delayed the input can be
 								(similar to streaming aggregations)
 . Define a constraint on event-time across the two inputs such that the engine can figure out when
 								old rows of one input is not going to be required (i.e. will not satisfy the time constraint) for
 								matches with the other input. This constraint can be defined in one of the two ways.
-												Fix typos detected by github.com/client9/misspell

## What changes were proposed in this pull request?

Fixing typos is sometimes very hard. It's not so easy to visually review them. Recently, I discovered a very useful tool for it, [misspell](https://github.com/client9/misspell).

This pull request fixes minor typos detected by [misspell](https://github.com/client9/misspell) except for the false positives. If you would like me to work on other files as well, let me know.

## How was this patch tested?

### before

```
$ misspell . | grep -v '.js'
R/pkg/R/SQLContext.R:354:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:424:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:445:43: "definiton" is a misspelling of "definition"
R/pkg/R/SQLContext.R:495:43: "definiton" is a misspelling of "definition"
NOTICE-binary:454:16: "containd" is a misspelling of "contained"
R/pkg/R/context.R:46:43: "definiton" is a misspelling of "definition"
R/pkg/R/context.R:74:43: "definiton" is a misspelling of "definition"
R/pkg/R/DataFrame.R:591:48: "persistance" is a misspelling of "persistence"
R/pkg/R/streaming.R:166:44: "occured" is a misspelling of "occurred"
R/pkg/inst/worker/worker.R:65:22: "ouput" is a misspelling of "output"
R/pkg/tests/fulltests/test_utils.R:106:25: "environemnt" is a misspelling of "environment"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/InMemoryStoreSuite.java:38:39: "existant" is a misspelling of "existent"
common/kvstore/src/test/java/org/apache/spark/util/kvstore/LevelDBSuite.java:83:39: "existant" is a misspelling of "existent"
common/network-common/src/main/java/org/apache/spark/network/crypto/TransportCipher.java:243:46: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:234:19: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:238:63: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:244:46: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/sasl/SaslEncryption.java:276:39: "transfered" is a misspelling of "transferred"
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred"
common/unsafe/src/test/scala/org/apache/spark/unsafe/types/UTF8StringPropertyCheckSuite.scala:195:15: "orgin" is a misspelling of "origin"
core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala:621:39: "gauranteed" is a misspelling of "guaranteed"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc"
core/src/main/scala/org/apache/spark/storage/DiskStore.scala:282:18: "transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/util/ListenerBus.scala:64:17: "overriden" is a misspelling of "overridden"
core/src/test/scala/org/apache/spark/ShuffleSuite.scala:211:7: "substracted" is a misspelling of "subtracted"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:2468:84: "truely" is a misspelling of "truly"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:25:18: "persistance" is a misspelling of "persistence"
core/src/test/scala/org/apache/spark/storage/FlatmapIteratorSuite.scala:26:69: "persistance" is a misspelling of "persistence"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
dev/run-pip-tests:55:28: "enviroments" is a misspelling of "environments"
dev/run-pip-tests:91:37: "virutal" is a misspelling of "virtual"
dev/merge_spark_pr.py:377:72: "accross" is a misspelling of "across"
dev/merge_spark_pr.py:378:66: "accross" is a misspelling of "across"
dev/run-pip-tests:126:25: "enviroments" is a misspelling of "environments"
docs/configuration.md:1830:82: "overriden" is a misspelling of "overridden"
docs/structured-streaming-programming-guide.md:525:45: "processs" is a misspelling of "processes"
docs/structured-streaming-programming-guide.md:1165:61: "BETWEN" is a misspelling of "BETWEEN"
docs/sql-programming-guide.md:1891:810: "behaivor" is a misspelling of "behavior"
examples/src/main/python/sql/arrow.py:98:8: "substract" is a misspelling of "subtract"
examples/src/main/python/sql/arrow.py:103:27: "substract" is a misspelling of "subtract"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The"
mllib/src/main/scala/org/apache/spark/mllib/clustering/StreamingKMeans.scala:230:24: "inital" is a misspelling of "initial"
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean"
mllib/src/test/scala/org/apache/spark/ml/clustering/KMeansSuite.scala:237:26: "descripiton" is a misspelling of "descriptions"
python/pyspark/find_spark_home.py:30:13: "enviroment" is a misspelling of "environment"
python/pyspark/context.py:937:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:938:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:939:12: "supress" is a misspelling of "suppress"
python/pyspark/context.py:940:12: "supress" is a misspelling of "suppress"
python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:713:8: "probabilty" is a misspelling of "probability"
python/pyspark/ml/clustering.py:1038:8: "Currenlty" is a misspelling of "Currently"
python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean"
python/pyspark/ml/regression.py:1378:20: "paramter" is a misspelling of "parameter"
python/pyspark/mllib/stat/_statistics.py:262:8: "probabilty" is a misspelling of "probability"
python/pyspark/rdd.py:1363:32: "paramter" is a misspelling of "parameter"
python/pyspark/streaming/tests.py:825:42: "retuns" is a misspelling of "returns"
python/pyspark/sql/tests.py:768:29: "initalization" is a misspelling of "initialization"
python/pyspark/sql/tests.py:3616:31: "initalize" is a misspelling of "initialize"
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerBackendUtil.scala:120:39: "arbitary" is a misspelling of "arbitrary"
resource-managers/mesos/src/test/scala/org/apache/spark/deploy/mesos/MesosClusterDispatcherArgumentsSuite.scala:26:45: "sucessfully" is a misspelling of "successfully"
resource-managers/mesos/src/main/scala/org/apache/spark/scheduler/cluster/mesos/MesosSchedulerUtils.scala:358:27: "constaints" is a misspelling of "constraints"
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala:111:24: "senstive" is a misspelling of "sensitive"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala:1063:5: "overwirte" is a misspelling of "overwrite"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/datetimeExpressions.scala:1348:17: "compatability" is a misspelling of "compatibility"
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala:77:36: "paramter" is a misspelling of "parameter"
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:1374:22: "precendence" is a misspelling of "precedence"
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala:238:27: "unnecassary" is a misspelling of "unnecessary"
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/ConditionalExpressionSuite.scala:212:17: "whn" is a misspelling of "when"
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:147:60: "timestmap" is a misspelling of "timestamp"
sql/core/src/test/scala/org/apache/spark/sql/TPCDSQuerySuite.scala:150:45: "precentage" is a misspelling of "percentage"
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchemaSuite.scala:135:29: "infered" is a misspelling of "inferred"
sql/hive/src/test/resources/golden/udf_instr-1-2e76f819563dbaba4beb51e3a130b922:1:52: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_instr-2-32da357fc754badd6e3898dcc8989182:1:52: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_locate-1-6e41693c9c6dceea4d7fab4c02884e4e:1:63: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_locate-2-d9b5934457931447874d6bb7c13de478:1:63: "occurance" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:9:79: "occurence" is a misspelling of "occurrence"
sql/hive/src/test/resources/golden/udf_translate-2-f7aa38a33ca0df73b7a1e6b6da4b7fe8:13:110: "occurence" is a misspelling of "occurrence"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/annotate_stats_join.q:46:105: "distint" is a misspelling of "distinct"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/auto_sortmerge_join_11.q:29:3: "Currenly" is a misspelling of "Currently"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/avro_partitioned.q:72:15: "existant" is a misspelling of "existent"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/decimal_udf.q:25:3: "substraction" is a misspelling of "subtraction"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby2_map_multi_distinct.q:16:51: "funtion" is a misspelling of "function"
sql/hive/src/test/resources/ql/src/test/queries/clientpositive/groupby_sort_8.q:15:30: "issueing" is a misspelling of "issuing"
sql/hive/src/test/scala/org/apache/spark/sql/sources/HadoopFsRelationTest.scala:669:52: "wiht" is a misspelling of "with"
sql/hive-thriftserver/src/main/java/org/apache/hive/service/cli/session/HiveSessionImpl.java:474:9: "Refering" is a misspelling of "Referring"
```

### after

```
$ misspell . | grep -v '.js'
common/network-common/src/main/java/org/apache/spark/network/util/AbstractFileRegion.java:27:20: "transfered" is a misspelling of "transferred"
core/src/main/scala/org/apache/spark/status/storeTypes.scala:113:29: "ect" is a misspelling of "etc"
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:1922:49: "agriculteur" is a misspelling of "agriculture"
data/streaming/AFINN-111.txt:1219:0: "humerous" is a misspelling of "humorous"
licenses/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:5:63: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:6:2: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:262:29: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:262:39: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:269:49: "Stichting" is a misspelling of "Stitching"
licenses-binary/LICENSE-heapq.txt:269:59: "Mathematisch" is a misspelling of "Mathematics"
licenses-binary/LICENSE-heapq.txt:274:2: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:274:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
licenses-binary/LICENSE-heapq.txt:276:29: "STICHTING" is a misspelling of "STITCHING"
licenses-binary/LICENSE-heapq.txt:276:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/hungarian.txt:170:0: "teh" is a misspelling of "the"
mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/portuguese.txt:53:0: "eles" is a misspelling of "eels"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:99:20: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala:539:11: "Euclidian" is a misspelling of "Euclidean"
mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala:77:36: "Teh" is a misspelling of "The"
mllib/src/main/scala/org/apache/spark/mllib/stat/MultivariateOnlineSummarizer.scala:276:9: "Euclidian" is a misspelling of "Euclidean"
python/pyspark/heapq3.py:6:63: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:7:2: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:263:29: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:263:39: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:270:49: "Stichting" is a misspelling of "Stitching"
python/pyspark/heapq3.py:270:59: "Mathematisch" is a misspelling of "Mathematics"
python/pyspark/heapq3.py:275:2: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:275:12: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/heapq3.py:277:29: "STICHTING" is a misspelling of "STITCHING"
python/pyspark/heapq3.py:277:39: "MATHEMATISCH" is a misspelling of "MATHEMATICS"
python/pyspark/ml/stat.py:339:23: "Euclidian" is a misspelling of "Euclidean"
```

Closes #22070 from seratch/fix-typo.

Authored-by: Kazuhiro Sera <seratch@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>

											
										
										
											2018-08-11 22:23:36 -04:00
+. Time range join conditions (e.g. `...JOIN ON leftTime BETWEEN rightTime AND rightTime + INTERVAL 1 HOUR`),
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
 . Join on event-time windows (e.g. `...JOIN ON leftTimeWindow = rightTimeWindow`).
 								Let’s understand this with an example.
 								Let’s say we want to join a stream of advertisement impressions (when an ad was shown) with
 								another stream of user clicks on advertisements to correlate when impressions led to
 								monetizable clicks. To allow the state cleanup in this stream-stream join, you will have to
 								specify the watermarking delays and the time constraints as follows.
 . Watermark delays: Say, the impressions and the corresponding clicks can be late/out-of-order
 								in event-time by at most 2 and 3 hours, respectively.
 . Event-time range condition: Say, a click can occur within a time range of 0 seconds to 1 hour
 								after the corresponding impression.
 								The code would look like this.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								import org.apache.spark.sql.functions.expr
 								val impressions = spark.readStream. ...
 								val clicks = spark.readStream. ...
 								// Apply watermarks on event-time columns
 								val impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
 								val clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")
 								// Join with event-time constraints
 								impressionsWithWatermark.join(
 								  clicksWithWatermark,
 								  expr("""
 								    clickAdId = impressionAdId AND
 								    clickTime >= impressionTime AND
 								    clickTime <= impressionTime + interval 1 hour
 								    """)
 								)
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								import static org.apache.spark.sql.functions.expr
 								Dataset<Row> impressions = spark.readStream(). ...
 								Dataset<Row> clicks = spark.readStream(). ...
 								// Apply watermarks on event-time columns
 								Dataset<Row> impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours");
 								Dataset<Row> clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours");
 								// Join with event-time constraints
 								impressionsWithWatermark.join(
 								  clicksWithWatermark,
 								  expr(
 								    "clickAdId = impressionAdId AND " +
 								    "clickTime >= impressionTime AND " +
 								    "clickTime <= impressionTime + interval 1 hour ")
 								);
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								from pyspark.sql.functions import expr
 								impressions = spark.readStream. ...
 								clicks = spark.readStream. ...
 								# Apply watermarks on event-time columns
 								impressionsWithWatermark = impressions.withWatermark("impressionTime", "2 hours")
 								clicksWithWatermark = clicks.withWatermark("clickTime", "3 hours")
 								# Join with event-time constraints
 								impressionsWithWatermark.join(
 								  clicksWithWatermark,
 								  expr("""
 								    clickAdId = impressionAdId AND
 								    clickTime >= impressionTime AND
 								    clickTime <= impressionTime + interval 1 hour
 								    """)
 								)
 								{% endhighlight %}
-												[SPARK-21293][SS][SPARKR] Add doc example for streaming join, dedup

## What changes were proposed in this pull request?

streaming programming guide changes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20340 from felixcheung/rstreamdoc.

											
										
										
											2018-01-21 14:23:51 -05:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								impressions <- read.stream(...)
 								clicks <- read.stream(...)
 								# Apply watermarks on event-time columns
 								impressionsWithWatermark <- withWatermark(impressions, "impressionTime", "2 hours")
 								clicksWithWatermark <- withWatermark(clicks, "clickTime", "3 hours")
 								# Join with event-time constraints
 								joined <- join(
 								  impressionsWithWatermark,
 								  clicksWithWatermark,
 								  expr(
 								    paste(
 								      "clickAdId = impressionAdId AND",
 								      "clickTime >= impressionTime AND",
 								      "clickTime <= impressionTime + interval 1 hour"
 								)))
 								{% endhighlight %}
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								</div>
 								</div>
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								###### Semantic Guarantees of Stream-stream Inner Joins with Watermarking
 								{:.no_toc}
 								This is similar to the [guarantees provided by watermarking on aggregations](#semantic-guarantees-of-aggregation-with-watermarking).
 								A watermark delay of "2 hours" guarantees that the engine will never drop any data that is less than
 hours delayed. But data delayed by more than 2 hours may or may not get processed.
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								##### Outer Joins with Watermarking
 								While the watermark + event-time constraints is optional for inner joins, for left and right outer
 								joins they must be specified. This is because for generating the NULL results in outer join, the
 								engine must know when an input row is not going to match with anything in future. Hence, the
 								watermark + event-time constraints must be specified for generating correct results. Therefore,
 								a query with outer-join will look quite like the ad-monetization example earlier, except that
 								there will be an additional parameter specifying it to be an outer-join.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								impressionsWithWatermark.join(
 								  clicksWithWatermark,
 								  expr("""
 								    clickAdId = impressionAdId AND
 								    clickTime >= impressionTime AND
 								    clickTime <= impressionTime + interval 1 hour
 								    """),
 								  joinType = "leftOuter"      // can be "inner", "leftOuter", "rightOuter"
 								 )
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								impressionsWithWatermark.join(
 								  clicksWithWatermark,
 								  expr(
 								    "clickAdId = impressionAdId AND " +
 								    "clickTime >= impressionTime AND " +
 								    "clickTime <= impressionTime + interval 1 hour "),
 								  "leftOuter"                 // can be "inner", "leftOuter", "rightOuter"
 								);
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								impressionsWithWatermark.join(
 								  clicksWithWatermark,
 								  expr("""
 								    clickAdId = impressionAdId AND
 								    clickTime >= impressionTime AND
 								    clickTime <= impressionTime + interval 1 hour
 								    """),
 								  "leftOuter"                 # can be "inner", "leftOuter", "rightOuter"
 								)
 								{% endhighlight %}
-												[SPARK-21293][SS][SPARKR] Add doc example for streaming join, dedup

## What changes were proposed in this pull request?

streaming programming guide changes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20340 from felixcheung/rstreamdoc.

											
										
										
											2018-01-21 14:23:51 -05:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								joined <- join(
 								  impressionsWithWatermark,
 								  clicksWithWatermark,
 								  expr(
 								    paste(
 								      "clickAdId = impressionAdId AND",
 								      "clickTime >= impressionTime AND",
 								      "clickTime <= impressionTime + interval 1 hour"),
 								  "left_outer"                 # can be "inner", "left_outer", "right_outer"
 								))
 								{% endhighlight %}
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								</div>
 								</div>
-												[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20494 from tdas/SPARK-23064-2.

											
										
										
											2018-02-02 20:37:51 -05:00
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								###### Semantic Guarantees of Stream-stream Outer Joins with Watermarking
 								{:.no_toc}
 								Outer joins have the same guarantees as [inner joins](#semantic-guarantees-of-stream-stream-inner-joins-with-watermarking)
 								regarding watermark delays and whether data will be dropped or not.
 								###### Caveats
 								{:.no_toc}
 								There are a few important characteristics to note regarding how the outer results are generated.
-												[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20494 from tdas/SPARK-23064-2.

											
										
										
											2018-02-02 20:37:51 -05:00
 								- *The outer NULL results will be generated with a delay that depends on the specified watermark
 								delay and the time range condition.* This is because the engine has to wait for that long to ensure
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								there were no matches and there will be no more matches in future.
-												[SPARK-23064][SS][DOCS] Stream-stream joins Documentation - follow up

## What changes were proposed in this pull request?
Further clarification of caveats in using stream-stream outer joins.

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20494 from tdas/SPARK-23064-2.

											
										
										
											2018-02-02 20:37:51 -05:00
+								- In the current implementation in the micro-batch engine, watermarks are advanced at the end of a
 								micro-batch, and the next micro-batch uses the updated watermark to clean up state and output
 								outer results. Since we trigger a micro-batch only when there is new data to be processed, the
 								generation of the outer result may get delayed if there no new data being received in the stream.
 								*In short, if any of the two input streams being joined does not receive data for a while, the
 								outer (both cases, left or right) output may get delayed.*
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								##### Support matrix for joins in streaming queries
 								<table class ="table">
 								  <tr>
 								    <th>Left Input</th>
 								    <th>Right Input</th>
 								    <th>Join Type</th>
 								    <th></th>
 								  </tr>
 								  <tr>
 								      <td style="vertical-align: middle;">Static</td>
 								      <td style="vertical-align: middle;">Static</td>
 								      <td style="vertical-align: middle;">All types</td>
 								      <td style="vertical-align: middle;">
 								        Supported, since its not on streaming data even though it
 								        can be present in a streaming query
 								      </td>
 								  </tr>
 								  <tr>
 								    <td rowspan="4" style="vertical-align: middle;">Stream</td>
 								    <td rowspan="4" style="vertical-align: middle;">Static</td>
 								    <td style="vertical-align: middle;">Inner</td>
 								    <td style="vertical-align: middle;">Supported, not stateful</td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Left Outer</td>
 								    <td style="vertical-align: middle;">Supported, not stateful</td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Right Outer</td>
 								    <td style="vertical-align: middle;">Not supported</td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Full Outer</td>
 								    <td style="vertical-align: middle;">Not supported</td>
 								  </tr>
 								  <tr>
 								    <td rowspan="4" style="vertical-align: middle;">Static</td>
 								    <td rowspan="4" style="vertical-align: middle;">Stream</td>
 								    <td style="vertical-align: middle;">Inner</td>
 								    <td style="vertical-align: middle;">Supported, not stateful</td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Left Outer</td>
 								    <td style="vertical-align: middle;">Not supported</td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Right Outer</td>
 								    <td style="vertical-align: middle;">Supported, not stateful</td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Full Outer</td>
 								    <td style="vertical-align: middle;">Not supported</td>
 								  </tr>
 								  <tr>
 								    <td rowspan="4" style="vertical-align: middle;">Stream</td>
 								    <td rowspan="4" style="vertical-align: middle;">Stream</td>
 								    <td style="vertical-align: middle;">Inner</td>
 								    <td style="vertical-align: middle;">
 								      Supported, optionally specify watermark on both sides +
 								      time constraints for state cleanup
 								    </td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Left Outer</td>
 								    <td style="vertical-align: middle;">
 								      Conditionally supported, must specify watermark on right + time constraints for correct
 								      results, optionally specify watermark on left for all state cleanup
 								    </td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Right Outer</td>
 								    <td style="vertical-align: middle;">
 								      Conditionally supported, must specify watermark on left + time constraints for correct
 								      results, optionally specify watermark on right for all state cleanup
 								    </td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Full Outer</td>
 								    <td style="vertical-align: middle;">Not supported</td>
 								  </tr>
 								 <tr>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								  </tr>
 								</table>
 								Additional details on supported joins:
 								- Joins can be cascaded, that is, you can do `df1.join(df2, ...).join(df3, ...).join(df4, ....)`.
 								- As of Spark 2.3, you can use joins only when the query is in Append output mode. Other output modes are not yet supported.
 								- As of Spark 2.3, you cannot use other non-map-like operations before joins. Here are a few examples of
 								  what cannot be used.
 								  - Cannot use streaming aggregations before joins.
 								  - Cannot use mapGroupsWithState and flatMapGroupsWithState in Update mode before joins.
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								### Streaming Deduplication
 								You can deduplicate records in data streams using a unique identifier in the events. This is exactly same as deduplication on static using a unique identifier column. The query will store the necessary amount of data from previous records such that it can filter duplicate records. Similar to aggregations, you can use deduplication with or without watermarking.
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								- *With watermark* - If there is an upper bound on how late a duplicate record may arrive, then you can define a watermark on an event time column and deduplicate using both the guid and the event time columns. The query will use the watermark to remove old state data from past records that are not expected to get any duplicates any more. This bounds the amount of the state the query has to maintain.
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
 								- *Without watermark* - Since there are no bounds on when a duplicate record may arrive, the query stores the data from all the past records as state.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								val streamingDf = spark.readStream. ...  // columns: guid, eventTime, ...
 								// Without watermark using guid column
 								streamingDf.dropDuplicates("guid")
 								// With watermark using guid and eventTime columns
 								streamingDf
 								  .withWatermark("eventTime", "10 seconds")
 								  .dropDuplicates("guid", "eventTime")
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
-												Fix Java SimpleApp spark application

## What changes were proposed in this pull request?

Add missing import and missing parentheses to invoke `SparkSession::text()`.

## How was this patch tested?

Built and the code for this application, ran jekyll locally per docs/README.md.

Author: Christiam Camacho <camacho@ncbi.nlm.nih.gov>

Closes #18795 from christiam/master.

											
										
										
											2017-08-03 18:40:25 -04:00
+								Dataset<Row> streamingDf = spark.readStream(). ...;  // columns: guid, eventTime, ...
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
 								// Without watermark using guid column
 								streamingDf.dropDuplicates("guid");
 								// With watermark using guid and eventTime columns
 								streamingDf
 								  .withWatermark("eventTime", "10 seconds")
 								  .dropDuplicates("guid", "eventTime");
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								streamingDf = spark.readStream. ...
-												[SPARK-21293][SS][SPARKR] Add doc example for streaming join, dedup

## What changes were proposed in this pull request?

streaming programming guide changes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20340 from felixcheung/rstreamdoc.

											
										
										
											2018-01-21 14:23:51 -05:00
+								# Without watermark using guid column
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								streamingDf.dropDuplicates("guid")
-												[SPARK-21293][SS][SPARKR] Add doc example for streaming join, dedup

## What changes were proposed in this pull request?

streaming programming guide changes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20340 from felixcheung/rstreamdoc.

											
										
										
											2018-01-21 14:23:51 -05:00
+								# With watermark using guid and eventTime columns
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								streamingDf \
 								  .withWatermark("eventTime", "10 seconds") \
 								  .dropDuplicates("guid", "eventTime")
 								{% endhighlight %}
-												[SPARK-21293][SS][SPARKR] Add doc example for streaming join, dedup

## What changes were proposed in this pull request?

streaming programming guide changes

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #20340 from felixcheung/rstreamdoc.

											
										
										
											2018-01-21 14:23:51 -05:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								streamingDf <- read.stream(...)
 								# Without watermark using guid column
 								streamingDf <- dropDuplicates(streamingDf, "guid")
 								# With watermark using guid and eventTime columns
 								streamingDf <- withWatermark(streamingDf, "eventTime", "10 seconds")
 								streamingDf <- dropDuplicates(streamingDf, "guid", "eventTime")
 								{% endhighlight %}
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								</div>
 								</div>
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								### Policy for handling multiple watermarks
 								A streaming query can have multiple input streams that are unioned or joined together.
 								Each of the input streams can have a different threshold of late data that needs to
 								be tolerated for stateful operations. You specify these thresholds using
 								``withWatermarks("eventTime", delay)`` on each of the input streams. For example, consider
 								a query with stream-stream joins between `inputStream1` and `inputStream2`.
 								  inputStream1.withWatermark("eventTime1", "1 hour")
 								    .join(
 								      inputStream2.withWatermark("eventTime2", "2 hours"),
 								      joinCondition)
 								While executing the query, Structured Streaming individually tracks the maximum
 								event time seen in each input stream, calculates watermarks based on the corresponding delay,
 								and chooses a single global watermark with them to be used for stateful operations. By default,
 								the minimum is chosen as the global watermark because it ensures that no data is
 								accidentally dropped as too late if one of the streams falls behind the others
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								(for example, one of the streams stops receiving data due to upstream failures). In other words,
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								the global watermark will safely move at the pace of the slowest stream and the query output will
 								be delayed accordingly.
 								However, in some cases, you may want to get faster results even if it means dropping data from the
 								slowest stream. Since Spark 2.4, you can set the multiple watermark policy to choose
 								the maximum value as the global watermark by setting the SQL configuration
 								``spark.sql.streaming.multipleWatermarkPolicy`` to ``max`` (default is ``min``).
 								This lets the global watermark move at the pace of the fastest stream.
 								However, as a side effect, data from the slower streams will be aggressively dropped. Hence, use
 								this configuration judiciously.
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								### Arbitrary Stateful Operations
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								Many usecases require more advanced stateful operations than aggregations. For example, in many usecases, you have to track sessions from data streams of events. For doing such sessionization, you will have to save arbitrary types of data as state, and perform arbitrary operations on the state using the data stream events in every trigger. Since Spark 2.2, this can be done using the operation `mapGroupsWithState` and the more powerful operation `flatMapGroupsWithState`. Both operations allow you to apply user-defined code on grouped Datasets to update user-defined state. For more concrete details, take a look at the API documentation ([Scala](api/scala/index.html#org.apache.spark.sql.streaming.GroupState)/[Java](api/java/org/apache/spark/sql/streaming/GroupState.html)) and the examples ([Scala]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala)/[Java]({{site.SPARK_GITHUB_URL}}/blob/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredSessionization.java)).
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								### Unsupported Operations
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								There are a few DataFrame/Dataset operations that are not supported with streaming DataFrames/Datasets.
 								Some of them are as follows.
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								- Multiple streaming aggregations (i.e. a chain of aggregations on a streaming DF) are not yet supported on streaming Datasets.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								- Limit and take the first N rows are not supported on streaming Datasets.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								- Distinct operations on streaming Datasets are not supported.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								- Sorting operations are supported on streaming Datasets only after an aggregation and in Complete Output Mode.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								- Few types of outer joins on streaming Datasets are not supported. See the
 								  <a href="#support-matrix-for-joins-in-streaming-queries">support matrix in the Join Operations section</a>
 								  for more details.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								In addition, there are some Dataset methods that will not work on streaming Datasets. They are actions that will immediately run queries and return results, which does not make sense on a streaming Dataset. Rather, those functionalities can be done by explicitly starting a streaming query (see the next section regarding that).
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[DOCS] Fix error: ambiguous reference to overloaded definition

## What changes were proposed in this pull request?

`df.groupBy.count()` should be `df.groupBy().count()` , otherwise there is an error :

ambiguous reference to overloaded definition, both method groupBy in class Dataset of type (col1: String, cols: String*) and method groupBy in class Dataset of type (cols: org.apache.spark.sql.Column*)

## How was this patch tested?

```scala
val df = spark.readStream.schema(...).json(...)
val dfCounts = df.groupBy().count()
```

Author: Ziyue Huang <zyhuang94@gmail.com>

Closes #18272 from ZiyueHuang/master.

											
										
										
											2017-06-12 05:59:33 -04:00
+								- `count()` - Cannot return a single count from a streaming Dataset. Instead, use `ds.groupBy().count()` which returns a streaming Dataset containing a running count.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								- `foreach()` - Instead use `ds.writeStream.foreach(...)` (see next section).
 								- `show()` - Instead use the console sink (see next section).
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								If you try any of these operations, you will see an `AnalysisException` like "operation XYZ is not supported with streaming DataFrames/Datasets".
 								While some of them may be supported in future releases of Spark,
 								there are others which are fundamentally hard to implement on streaming data efficiently.
 								For example, sorting on the input stream is not supported, as it requires keeping
 								track of all the data received in the stream. This is therefore fundamentally hard to execute
 								efficiently.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								## Starting Streaming Queries
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								Once you have defined the final result DataFrame/Dataset, all that is left is for you to start the streaming computation. To do that, you have to use the `DataStreamWriter`
-												[MINOR][DOC] Remove spaces following slashs

## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.

											
										
										
											2016-11-01 09:08:49 -04:00
+								([Scala](api/scala/index.html#org.apache.spark.sql.streaming.DataStreamWriter)/[Java](api/java/org/apache/spark/sql/streaming/DataStreamWriter.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.DataStreamWriter) docs)
 								returned through `Dataset.writeStream()`. You will have to specify one or more of the following in this interface.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								- *Details of the output sink:* Data format, location, etc.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								- *Output mode:* Specify what gets written to the output sink.
 								- *Query name:* Optionally, specify a unique name of the query for identification.
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								- *Trigger interval:* Optionally, specify the trigger interval. If it is not specified, the system will check for availability of new data as soon as the previous processing has been completed. If a trigger time is missed because the previous processing has not been completed, then the system will trigger processing immediately.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								- *Checkpoint location:* For some output sinks where the end-to-end fault-tolerance can be guaranteed, specify the location where the system will write all the checkpoint information. This should be a directory in an HDFS-compatible fault-tolerant file system. The semantics of checkpointing is discussed in more detail in the next section.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								#### Output Modes
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								There are a few types of output modes.
 								- **Append mode (default)** - This is the default mode, where only the
 								new rows added to the Result Table since the last trigger will be
 								outputted to the sink. This is supported for only those queries where
 								rows added to the Result Table is never going to change. Hence, this mode
 								guarantees that each row will be output only once (assuming
 								fault-tolerant sink). For example, queries with only `select`,
 								`where`, `map`, `flatMap`, `filter`, `join`, etc. will support Append mode.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								- **Complete mode** - The whole Result Table will be outputted to the sink after every trigger.
 								 This is supported for aggregation queries.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								- **Update mode** - (*Available since Spark 2.1.1*) Only the rows in the Result Table that were
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								updated since the last trigger will be outputted to the sink.
 								More information to be added in future releases.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								Different types of streaming queries support different output modes.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								Here is the compatibility matrix.
 								<table class="table">
 								  <tr>
 								    <th>Query Type</th>
 								    <th></th>
 								    <th>Supported Output Modes</th>
 								    <th>Notes</th>
 								  </tr>
 								  <tr>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td rowspan="2" style="vertical-align: middle;">Queries with aggregation</td>
 								    <td style="vertical-align: middle;">Aggregation on event-time with watermark</td>
 								    <td style="vertical-align: middle;">Append, Update, Complete</td>
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								    <td>
 								        Append mode uses watermark to drop old aggregation state. But the output of a
 								        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
 								        the modes semantics, rows can be added to the Result Table only once after they are
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								        finalized (i.e. after watermark is crossed). See the
 								        <a href="#handling-late-data-and-watermarking">Late Data</a> section for more details.
 								        <br/><br/>
 								        Update mode uses watermark to drop old aggregation state.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        <br/><br/>
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								        Complete mode does not drop old aggregation state since by definition this mode
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        preserves all data in the Result Table.
 								    </td>
 								  </tr>
 								  <tr>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td style="vertical-align: middle;">Other aggregations</td>
 								    <td style="vertical-align: middle;">Complete, Update</td>
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								    <td>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								        Since no watermark is defined (only defined in other category),
 								        old aggregation state is not dropped.
 								        <br/><br/>
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        Append mode is not supported as aggregates can update thus violating the semantics of
 								        this mode.
 								    </td>
 								  </tr>
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								  <tr>
 								    <td colspan="2" style="vertical-align: middle;">Queries with <code>mapGroupsWithState</code></td>
 								    <td style="vertical-align: middle;">Update</td>
 								    <td style="vertical-align: middle;"></td>
 								  </tr>
 								  <tr>
 								    <td rowspan="2" style="vertical-align: middle;">Queries with <code>flatMapGroupsWithState</code></td>
 								    <td style="vertical-align: middle;">Append operation mode</td>
 								    <td style="vertical-align: middle;">Append</td>
 								    <td style="vertical-align: middle;">
 								      Aggregations are allowed after <code>flatMapGroupsWithState</code>.
 								    </td>
 								  </tr>
 								  <tr>
 								    <td style="vertical-align: middle;">Update operation mode</td>
 								    <td style="vertical-align: middle;">Update</td>
 								    <td style="vertical-align: middle;">
 								      Aggregations not allowed after <code>flatMapGroupsWithState</code>.
 								    </td>
 								  </tr>
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								  <tr>
 								      <td colspan="2" style="vertical-align: middle;">Queries with <code>joins</code></td>
 								      <td style="vertical-align: middle;">Append</td>
 								      <td style="vertical-align: middle;">
 								        Update and Complete mode not supported yet. See the
 								        <a href="#support-matrix-for-joins-in-streaming-queries">support matrix in the Join Operations section</a>
 								         for more details on what types of joins are supported.
 								      </td>
 								    </tr>
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
+								  <tr>
 								    <td colspan="2" style="vertical-align: middle;">Other queries</td>
 								    <td style="vertical-align: middle;">Append, Update</td>
 								    <td style="vertical-align: middle;">
 								      Complete mode not supported as it is infeasible to keep all unaggregated data in the Result Table.
 								    </td>
 								  </tr>
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								  <tr>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								  </tr>
 								</table>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20224][SS] Updated docs for streaming dropDuplicates and mapGroupsWithState

## What changes were proposed in this pull request?

- Fixed bug in Java API not passing timeout conf to scala API
- Updated markdown docs
- Updated scala docs
- Added scala and Java example

## How was this patch tested?
Manually ran examples.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #17539 from tdas/SPARK-20224.

											
										
										
											2017-04-05 19:03:04 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								#### Output Sinks
 								There are a few types of built-in output sinks.
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								- **File sink** - Stores the output to a directory.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								{% highlight scala %}
 								writeStream
 								    .format("parquet")        // can be "orc", "json", "csv", etc.
 								    .option("path", "path/to/destination/dir")
 								    .start()
 								{% endhighlight %}
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								- **Kafka sink** - Stores the output to one or more topics in Kafka.
 								{% highlight scala %}
 								writeStream
 								    .format("kafka")
 								    .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
 								    .option("topic", "updates")
 								    .start()
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								- **Foreach sink** - Runs arbitrary computation on the records in the output. See later in the section for more details.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								{% highlight scala %}
 								writeStream
 								    .foreach(...)
 								    .start()
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								- **Console sink (for debugging)** - Prints the output to the console/stdout every time there is a trigger. Both, Append and Complete output modes, are supported. This should be used for debugging purposes on low data volumes as the entire output is collected and stored in the driver's memory after every trigger.
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								{% highlight scala %}
 								writeStream
 								    .format("console")
 								    .start()
 								{% endhighlight %}
 								- **Memory sink (for debugging)** - The output is stored in memory as an in-memory table.
 								Both, Append and Complete output modes, are supported. This should be used for debugging purposes
 								on low data volumes as the entire output is collected and stored in the driver's memory.
 								Hence, use it with caution.
 								{% highlight scala %}
 								writeStream
 								    .format("memory")
 								    .queryName("tableName")
 								    .start()
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								Some sinks are not fault-tolerant because they do not guarantee persistence of the output and are
 								meant for debugging purposes only. See the earlier section on
 								[fault-tolerance semantics](#fault-tolerance-semantics).
 								Here are the details of all the sinks in Spark.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<table class="table">
 								  <tr>
 								    <th>Sink</th>
 								    <th>Supported Output Modes</th>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <th>Options</th>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    <th>Fault-tolerant</th>
 								    <th>Notes</th>
 								  </tr>
 								  <tr>
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								    <td><b>File Sink</b></td>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    <td>Append</td>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td>
 								        <code>path</code>: path to the output directory, must be specified.
-												[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table

## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This pull request has two commits: The first includes changes to the version as it appeared in spark 2.1 and the second handled an additional option added for spark 2.2

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

The original documentation was written by tdas and lw-lin

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18342 from assafmendelson/spark-21123.

											
										
										
											2017-06-19 13:58:58 -04:00
+								        <br/><br/>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								        For file-format-specific options, see the related methods in DataFrameWriter
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								        (<a href="api/scala/index.html#org.apache.spark.sql.DataFrameWriter">Scala</a>/<a href="api/java/org/apache/spark/sql/DataFrameWriter.html">Java</a>/<a href="api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter">Python</a>/<a
 								        href="api/R/write.stream.html">R</a>).
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								        E.g. for "parquet" format options see <code>DataFrameWriter.parquet()</code>
 								    </td>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								    <td>Yes (exactly-once)</td>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    <td>Supports writes to partitioned tables. Partitioning by time may be useful.</td>
 								  </tr>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								  <tr>
 								    <td><b>Kafka Sink</b></td>
 								    <td>Append, Update, Complete</td>
-												[SPARK-25705][BUILD][STREAMING][TEST-MAVEN] Remove Kafka 0.8 integration

## What changes were proposed in this pull request?

Remove Kafka 0.8 integration

## How was this patch tested?

Existing tests, build scripts

Closes #22703 from srowen/SPARK-25705.

Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-10-16 10:10:24 -04:00
+								    <td>See the <a href="structured-streaming-kafka-0-10-integration.html">Kafka Integration Guide</a></td>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								    <td>Yes (at-least-once)</td>
 								    <td>More details in the <a href="structured-streaming-kafka-integration.html">Kafka Integration Guide</a></td>
 								  </tr>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								  <tr>
 								    <td><b>Foreach Sink</b></td>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								    <td>Append, Update, Complete</td>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td>None</td>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    <td>Depends on ForeachWriter implementation</td>
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								    <td>More details in the <a href="#using-foreach-and-foreachbatch">next section</a></td>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								  </tr>
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								  <tr>
 								      <td><b>ForeachBatch Sink</b></td>
 								      <td>Append, Update, Complete</td>
 								      <td>None</td>
 								      <td>Depends on the implementation</td>
 								      <td>More details in the <a href="#using-foreach-and-foreachbatch">next section</a></td>
 								    </tr>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								  <tr>
 								    <td><b>Console Sink</b></td>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td>Append, Update, Complete</td>
 								    <td>
 								        <code>numRows</code>: Number of rows to print every trigger (default: 20)
 								        <br/>
 								        <code>truncate</code>: Whether to truncate the output if too long (default: true)
 								    </td>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								    <td>No</td>
 								    <td></td>
 								  </tr>
 								  <tr>
 								    <td><b>Memory Sink</b></td>
 								    <td>Append, Complete</td>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								    <td>None</td>
 								    <td>No. But in Complete Mode, restarted query will recreate the full table.</td>
 								    <td>Table name is the query name.</td>
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								  </tr>
 								  <tr>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								    <td></td>
 								  </tr>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</table>
-												[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

											
										
										
											2017-01-06 14:29:01 -05:00
+								Note that you have to call `start()` to actually start the execution of the query. This returns a StreamingQuery object which is a handle to the continuously running execution. You can use this object to manage the query, which we will discuss in the next subsection. For now, let’s understand all this with a few examples.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								// ========== DF with no aggregations ==========
 								val noAggDF = deviceDataDf.select("device").where("signal > 10")
 								// Print new data to console
 								noAggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream
 								  .format("console")
 								  .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Write new data to Parquet files
 								noAggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								  .format("parquet")
 								  .option("checkpointLocation", "path/to/checkpoint/dir")
 								  .option("path", "path/to/destination/dir")
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .start()
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								// ========== DF with aggregation ==========
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								val aggDF = df.groupBy("device").count()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Print updated aggregations to console
 								aggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream
 								  .outputMode("complete")
 								  .format("console")
 								  .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								// Have all the aggregates in an in-memory table
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								aggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream
 								  .queryName("aggregates")    // this query name will be the table name
 								  .outputMode("complete")
 								  .format("memory")
 								  .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								spark.sql("select * from aggregates").show()   // interactively query in-memory table
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								// ========== DF with no aggregations ==========
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								Dataset<Row> noAggDF = deviceDataDf.select("device").where("signal > 10");
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Print new data to console
 								noAggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream()
 								  .format("console")
 								  .start();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Write new data to Parquet files
 								noAggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream()
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								  .format("parquet")
 								  .option("checkpointLocation", "path/to/checkpoint/dir")
 								  .option("path", "path/to/destination/dir")
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .start();
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								// ========== DF with aggregation ==========
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								Dataset<Row> aggDF = df.groupBy("device").count();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								// Print updated aggregations to console
 								aggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream()
 								  .outputMode("complete")
 								  .format("console")
 								  .start();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								// Have all the aggregates in an in-memory table
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								aggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream()
 								  .queryName("aggregates")    // this query name will be the table name
 								  .outputMode("complete")
 								  .format("memory")
 								  .start();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								spark.sql("select * from aggregates").show();   // interactively query in-memory table
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								# ========== DF with no aggregations ==========
 								noAggDF = deviceDataDf.select("device").where("signal > 10")
 								# Print new data to console
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								noAggDF \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .writeStream \
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								    .format("console") \
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								    .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# Write new data to Parquet files
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								noAggDF \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .writeStream \
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								    .format("parquet") \
 								    .option("checkpointLocation", "path/to/checkpoint/dir") \
 								    .option("path", "path/to/destination/dir") \
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								    .start()
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								# ========== DF with aggregation ==========
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								aggDF = df.groupBy("device").count()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								# Print updated aggregations to console
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								aggDF \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .writeStream \
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								    .outputMode("complete") \
 								    .format("console") \
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								    .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-23165][DOC] Spelling mistake fix in quick-start doc.

## What changes were proposed in this pull request?

Fix spelling in quick-start doc.

## How was this patch tested?

Doc only.

Author: Shashwat Anand <me@shashwat.me>

Closes #20336 from ashashwat/SPARK-23165.

											
										
										
											2018-01-20 17:34:37 -05:00
+								# Have all the aggregates in an in-memory table. The query name will be the table name
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								aggDF \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .writeStream \
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								    .queryName("aggregates") \
 								    .outputMode("complete") \
 								    .format("memory") \
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								    .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-16256][DOCS] Minor fixes on the Structured Streaming Programming Guide

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13978 from tdas/SPARK-16256-1.

											
										
										
											2016-06-30 02:38:19 -04:00
+								spark.sql("select * from aggregates").show()   # interactively query in-memory table
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								# ========== DF with no aggregations ==========
 								noAggDF <- select(where(deviceDataDf, "signal > 10"), "device")
 								# Print new data to console
 								write.stream(noAggDF, "console")
 								# Write new data to Parquet files
 								write.stream(noAggDF,
 								             "parquet",
 								             path = "path/to/destination/dir",
 								             checkpointLocation = "path/to/checkpoint/dir")
 								# ========== DF with aggregation ==========
 								aggDF <- count(groupBy(df, "device"))
 								# Print updated aggregations to console
 								write.stream(aggDF, "console", outputMode = "complete")
 								# Have all the aggregates in an in memory table. The query name will be the table name
 								write.stream(aggDF, "memory", queryName = "aggregates", outputMode = "complete")
 								# Interactively query in-memory table
 								head(sql("select * from aggregates"))
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								##### Using Foreach and ForeachBatch
 								The `foreach` and `foreachBatch` operations allow you to apply arbitrary operations and writing
 								logic on the output of a streaming query. They have slightly different use cases - while `foreach`
 								allows custom write logic on every row, `foreachBatch` allows arbitrary operations
 								and custom logic on the output of each micro-batch. Let's understand their usages in more detail.
 								###### ForeachBatch
 								`foreachBatch(...)` allows you to specify a function that is executed on
 								the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python.
 								It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
 								  // Transform and write batchDF
 								}.start()
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								streamingDatasetOfString.writeStream().foreachBatch(
 								  new VoidFunction2<Dataset<String>, Long> {
 								    public void call(Dataset<String> dataset, Long batchId) {
 								      // Transform and write batchDF
 								    }
 								  }
 								).start();
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								def foreach_batch_function(df, epoch_id):
 								    # Transform and write batchDF
 								    pass
 								streamingDF.writeStream.foreachBatch(foreach_batch_function).start()
 								{% endhighlight %}
 								</div>
 								<div data-lang="r"  markdown="1">
 								R is not yet supported.
 								</div>
 								</div>
 								With `foreachBatch`, you can do the following.
 								- **Reuse existing batch data sources** - For many storage systems, there may not be a streaming sink available yet,
 								  but there may already exist a data writer for batch queries. Using `foreachBatch`, you can use the batch
 								  data writers on the output of each micro-batch.
 								- **Write to multiple locations** - If you want to write the output of a streaming query to multiple locations,
 								  then you can simply write the output DataFrame/Dataset multiple times. However, each attempt to write can
 								  cause the output data to be recomputed (including possible re-reading of the input data). To avoid recomputations,
 								  you should cache the output DataFrame/Dataset, write it to multiple locations, and then uncache it. Here is an outline.
 								    streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) =>
 								      batchDF.persist()
 								      batchDF.write.format(...).save(...)  // location 1
 								      batchDF.write.format(...).save(...)  // location 2
 								      batchDF.unpersist()
 								    }
 								- **Apply additional DataFrame operations** - Many DataFrame and Dataset operations are not supported
 								  in streaming DataFrames because Spark does not support generating incremental plans in those cases.
 								  Using `foreachBatch`, you can apply some of these operations on each micro-batch output. However, you will have to reason about the end-to-end semantics of doing that operation yourself.
 								**Note:**
 								- By default, `foreachBatch` provides only at-least-once write guarantees. However, you can use the
 								  batchId provided to the function as way to deduplicate the output and get an exactly-once guarantee.
 								- `foreachBatch` does not work with the continuous processing mode as it fundamentally relies on the
 								  micro-batch execution of a streaming query. If you write data in the continuous mode, use `foreach` instead.
 								###### Foreach
 								If `foreachBatch` is not an option (for example, corresponding batch data writer does not exist, or
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								continuous processing mode), then you can express your custom writer logic using `foreach`.
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								Specifically, you can express the data writing logic by dividing it into three methods: `open`, `process`, and `close`.
 								Since Spark 2.4, `foreach` is available in Scala, Java and Python.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								In Scala, you have to extend the class `ForeachWriter` ([docs](api/scala/index.html#org.apache.spark.sql.ForeachWriter)).
 								{% highlight scala %}
 								streamingDatasetOfString.writeStream.foreach(
 								  new ForeachWriter[String] {
 								    def open(partitionId: Long, version: Long): Boolean = {
 								      // Open connection
 								    }
 								    def process(record: String): Unit = {
 								      // Write string to connection
 								    }
 								    def close(errorOrNull: Throwable): Unit = {
 								      // Close the connection
 								    }
 								  }
 								).start()
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								In Java, you have to extend the class `ForeachWriter` ([docs](api/java/org/apache/spark/sql/ForeachWriter.html)).
 								{% highlight java %}
 								streamingDatasetOfString.writeStream().foreach(
 								  new ForeachWriter[String] {
 								    @Override public boolean open(long partitionId, long version) {
 								      // Open connection
 								    }
 								    @Override public void process(String record) {
 								      // Write string to connection
 								    }
 								    @Override public void close(Throwable errorOrNull) {
 								      // Close the connection
 								    }
 								  }
 								).start();
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								In Python, you can invoke foreach in two ways: in a function or in an object.
 								The function offers a simple way to express your processing logic but does not allow you to
 								deduplicate generated data when failures cause reprocessing of some input data.
 								For that situation you must specify the processing logic in an object.
 . The function takes a row as input.
 								  {% highlight python %}
 								      def process_row(row):
 								          # Write row to storage
 								          pass
 								      query = streamingDF.writeStream.foreach(process_row).start()
 								  {% endhighlight %}
 . The object has a process method and optional open and close methods:
 								  {% highlight python %}
 								      class ForeachWriter:
 								          def open(self, partition_id, epoch_id):
 								              # Open connection. This method is optional in Python.
 								              pass
 								          def process(self, row):
 								              # Write row to connection. This method is NOT optional in Python.
 								              pass
 								          def close(self, error):
 								              # Close the connection. This method in optional in Python.
 								              pass
 								      query = streamingDF.writeStream.foreach(ForeachWriter()).start()
 								  {% endhighlight %}
 								</div>
 								<div data-lang="r"  markdown="1">
 								R is not yet supported.
 								</div>
 								</div>
 								**Execution semantics**
 								When the streaming query is started, Spark calls the function or the object’s methods in the following way:
 								- A single copy of this object is responsible for all the data generated by a single task in a query.
 								  In other words, one instance is responsible for processing one partition of the data generated in a distributed manner.
 								- This object must be serializable, because each task will get a fresh serialized-deserialized copy
 								  of the provided object. Hence, it is strongly recommended that any initialization for writing data
 								  (for example. opening a connection or starting a transaction) is done after the open() method has
 								  been called, which signifies that the task is ready to generate data.
 								- The lifecycle of the methods are as follows:
 								  - For each partition with partition_id:
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								    - For each batch/epoch of streaming data with epoch_id:
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								      - Method open(partitionId, epochId) is called.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								      - If open(...) returns true, for each row in the partition and batch/epoch, method process(row) is called.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								      - Method close(error) is called with error (if any) seen while processing rows.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								- The close() method (if it exists) is called if an open() method exists and returns successfully (irrespective of the return value), except if the JVM or Python process crashes in the middle.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
+								- **Note:** The partitionId and epochId in the open() method can be used to deduplicate generated data
 								  when failures cause reprocessing of some input data. This depends on the execution mode of the query.
 								  If the streaming query is being executed in the micro-batch mode, then every partition represented
 								  by a unique tuple (partition_id, epoch_id) is guaranteed to have the same data.
 								  Hence, (partition_id, epoch_id) can be used to deduplicate and/or transactionally commit
 								  data and achieve exactly-once guarantees. However, if the streaming query is being executed
 								  in the continuous mode, then this guarantee does not hold and therefore should not be used for deduplication.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								#### Triggers
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								The trigger settings of a streaming query define the timing of streaming data processing, whether
 								the query is going to be executed as micro-batch query with a fixed batch interval or as a continuous processing query.
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								Here are the different kinds of triggers that are supported.
 								<table class="table">
 								  <tr>
 								    <th>Trigger Type</th>
 								    <th>Description</th>
 								  </tr>
 								  <tr>
 								    <td><i>unspecified (default)</i></td>
 								    <td>
 								        If no trigger setting is explicitly specified, then by default, the query will be
 								        executed in micro-batch mode, where micro-batches will be generated as soon as
 								        the previous micro-batch has completed processing.
 								    </td>
 								  </tr>
 								  <tr>
 								    <td><b>Fixed interval micro-batches</b></td>
 								    <td>
 								        The query will be executed with micro-batches mode, where micro-batches will be kicked off
 								        at the user-specified intervals.
 								        <ul>
 								          <li>If the previous micro-batch completes within the interval, then the engine will wait until
 								          the interval is over before kicking off the next micro-batch.</li>
 								          <li>If the previous micro-batch takes longer than the interval to complete (i.e. if an
 								          interval boundary is missed), then the next micro-batch will start as soon as the
 								          previous one completes (i.e., it will not wait for the next interval boundary).</li>
 								          <li>If no new data is available, then no micro-batch will be kicked off.</li>
 								        </ul>
 								    </td>
 								  </tr>
 								  <tr>
 								    <td><b>One-time micro-batch</b></td>
 								    <td>
 								        The query will execute *only one* micro-batch to process all the available data and then
 								        stop on its own. This is useful in scenarios you want to periodically spin up a cluster,
 								        process everything that is available since the last period, and then shutdown the
 								        cluster. In some case, this may lead to significant cost savings.
 								    </td>
 								  </tr>
 								  <tr>
 								    <td><b>Continuous with fixed checkpoint interval</b><br/><i>(experimental)</i></td>
 								    <td>
 								        The query will be executed in the new low-latency, continuous processing mode. Read more
 								        about this in the <a href="#continuous-processing-experimental">Continuous Processing section</a> below.
 								    </td>
 								  </tr>
 								</table>
 								Here are a few code examples.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								import org.apache.spark.sql.streaming.Trigger
 								// Default trigger (runs micro-batch as soon as it can)
 								df.writeStream
 								  .format("console")
 								  .start()
 								// ProcessingTime trigger with two-seconds micro-batch interval
 								df.writeStream
 								  .format("console")
 								  .trigger(Trigger.ProcessingTime("2 seconds"))
 								  .start()
 								// One-time trigger
 								df.writeStream
 								  .format("console")
 								  .trigger(Trigger.Once())
 								  .start()
 								// Continuous trigger with one-second checkpointing interval
 								df.writeStream
 								  .format("console")
 								  .trigger(Trigger.Continuous("1 second"))
 								  .start()
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								import org.apache.spark.sql.streaming.Trigger
 								// Default trigger (runs micro-batch as soon as it can)
 								df.writeStream
 								  .format("console")
 								  .start();
 								// ProcessingTime trigger with two-seconds micro-batch interval
 								df.writeStream
 								  .format("console")
 								  .trigger(Trigger.ProcessingTime("2 seconds"))
 								  .start();
 								// One-time trigger
 								df.writeStream
 								  .format("console")
 								  .trigger(Trigger.Once())
 								  .start();
 								// Continuous trigger with one-second checkpointing interval
 								df.writeStream
 								  .format("console")
 								  .trigger(Trigger.Continuous("1 second"))
 								  .start();
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								# Default trigger (runs micro-batch as soon as it can)
 								df.writeStream \
 								  .format("console") \
 								  .start()
 								# ProcessingTime trigger with two-seconds micro-batch interval
 								df.writeStream \
 								  .format("console") \
 								  .trigger(processingTime='2 seconds') \
 								  .start()
 								# One-time trigger
 								df.writeStream \
 								  .format("console") \
 								  .trigger(once=True) \
 								  .start()
 								# Continuous trigger with one-second checkpointing interval
 								df.writeStream
 								  .format("console")
 								  .trigger(continuous='1 second')
 								  .start()
 								{% endhighlight %}
 								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								# Default trigger (runs micro-batch as soon as it can)
 								write.stream(df, "console")
 								# ProcessingTime trigger with two-seconds micro-batch interval
 								write.stream(df, "console", trigger.processingTime = "2 seconds")
 								# One-time trigger
 								write.stream(df, "console", trigger.once = TRUE)
 								# Continuous trigger is not yet supported
 								{% endhighlight %}
 								</div>
 								</div>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								## Managing Streaming Queries
 								The `StreamingQuery` object created when a query is started can be used to monitor and manage the query.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								val query = df.writeStream.format("console").start()   // get the query object
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.id          // get the unique identifier of the running query that persists across restarts from checkpoint data
 								query.runId       // get the unique id of this run of the query, which will be generated at every start/restart
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								query.name        // get the name of the auto-generated or user-specified name
 								query.explain()   // print detailed explanations of the query
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								query.stop()      // stop the query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								query.awaitTermination()   // block until query is terminated, with stop() or with error
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.exception       // the exception if the query has been terminated with error
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.recentProgress  // an array of the most recent progress updates for this query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.lastProgress    // the most recent progress update of this streaming query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								StreamingQuery query = df.writeStream().format("console").start();   // get the query object
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.id();          // get the unique identifier of the running query that persists across restarts from checkpoint data
 								query.runId();       // get the unique id of this run of the query, which will be generated at every start/restart
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								query.name();        // get the name of the auto-generated or user-specified name
 								query.explain();   // print detailed explanations of the query
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.stop();      // stop the query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								query.awaitTermination();   // block until query is terminated, with stop() or with error
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.exception();       // the exception if the query has been terminated with error
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.recentProgress();  // an array of the most recent progress updates for this query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.lastProgress();    // the most recent progress update of this streaming query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								query = df.writeStream.format("console").start()   # get the query object
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.id()          # get the unique identifier of the running query that persists across restarts from checkpoint data
 								query.runId()       # get the unique id of this run of the query, which will be generated at every start/restart
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								query.name()        # get the name of the auto-generated or user-specified name
 								query.explain()   # print detailed explanations of the query
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								query.stop()      # stop the query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								query.awaitTermination()   # block until query is terminated, with stop() or with error
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.exception()       # the exception if the query has been terminated with error
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.recentProgress()  # an array of the most recent progress updates for this query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								query.lastProgress()    # the most recent progress update of this streaming query
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								query <- write.stream(df, "console")  # get the query object
 								queryName(query)          # get the name of the auto-generated or user-specified name
 								explain(query)            # print detailed explanations of the query
 								stopQuery(query)          # stop the query
 								awaitTermination(query)   # block until query is terminated, with stop() or with error
 								lastProgress(query)       # the most recent progress update of this streaming query
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[MINOR][DOC] Remove spaces following slashs

## What changes were proposed in this pull request?

This PR merges multiple lines enumerating items in order to remove the redundant spaces following slashes in [Structured Streaming Programming Guide in 2.0.2-rc1](http://people.apache.org/~pwendell/spark-releases/spark-2.0.2-rc1-docs/structured-streaming-programming-guide.html).
- Before: `Scala/ Java/ Python`
- After: `Scala/Java/Python`
## How was this patch tested?

Manual by the followings because this is documentation update.

```
cd docs
SKIP_API=1 jekyll build
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15686 from dongjoon-hyun/minor_doc_space.

											
										
										
											2016-11-01 09:08:49 -04:00
+								You can start any number of queries in a single SparkSession. They will all be running concurrently sharing the cluster resources. You can use `sparkSession.streams()` to get the `StreamingQueryManager`
 								([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryManager)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryManager.html)/[Python](api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager) docs)
 								that can be used to manage the currently active queries.
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								val spark: SparkSession = ...
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								spark.streams.active    // get the list of currently active streaming queries
 								spark.streams.get(id)   // get a query object by its unique id
 								spark.streams.awaitAnyTermination()   // block until any one of them terminates
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								SparkSession spark = ...
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								spark.streams().active();    // get the list of currently active streaming queries
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								spark.streams().get(id);   // get a query object by its unique id
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][SQL][STREAMING][DOCS] Fix minor typos, punctuations and grammar

## What changes were proposed in this pull request?

Minor fixes correcting some typos, punctuations, grammar.
Adding more anchors for easy navigation.
Fixing minor issues with code snippets.

## How was this patch tested?

`jekyll serve`

Author: Ahmed Mahran <ahmed.mahran@mashin.io>

Closes #14234 from ahmed-mahran/b-struct-streaming-docs.

											
										
										
											2016-07-19 07:01:54 -04:00
+								spark.streams().awaitAnyTermination();   // block until any one of them terminates
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								spark = ...  # spark session
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								spark.streams().active  # get the list of currently active streaming queries
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								spark.streams().get(id)  # get a query object by its unique id
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								spark.streams().awaitAnyTermination()  # block until any one of them terminates
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight bash %}
 								Not available in R.
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
 								## Monitoring Streaming Queries
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								There are multiple ways to monitor active streaming queries. You can either push metrics to external systems using Spark's Dropwizard Metrics support, or access them programmatically.
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								### Reading Metrics Interactively
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
 								You can directly get the current status and metrics of an active query using
 								`streamingQuery.lastProgress()` and `streamingQuery.status()`.
 								`lastProgress()` returns a `StreamingQueryProgress` object
 								in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryProgress)
 								and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryProgress.html)
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								and a dictionary with the same fields in Python. It has all the information about
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								the progress made in the last trigger of the stream - what data was processed,
 								what were the processing rates, latencies, etc. There is also
 								`streamingQuery.recentProgress` which returns an array of last few progresses.
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								In addition, `streamingQuery.status()` returns a `StreamingQueryStatus` object
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								in [Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryStatus)
 								and [Java](api/java/org/apache/spark/sql/streaming/StreamingQueryStatus.html)
-												[DOCS][MINOR] Fixed a few typos in the Structured Streaming documentation

Fixed a few typos.

There is one more I'm not sure of:

```
        Append mode uses watermark to drop old aggregation state. But the output of a
        windowed aggregation is delayed the late threshold specified in `withWatermark()` as by
        the modes semantics, rows can be added to the Result Table only once after they are
```

Not sure how to change `is delayed the late threshold`.

Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>

Closes #17443 from aseigneurin/typos.

											
										
										
											2017-03-30 11:12:17 -04:00
+								and a dictionary with the same fields in Python. It gives information about
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								what the query is immediately doing - is a trigger active, is data being processed, etc.
 								Here are a few examples.
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								val query: StreamingQuery = ...
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								println(query.lastProgress)
 								/* Will print something like the following.
 								{
 								  "id" : "ce011fdc-8762-4dcb-84eb-a77333e28109",
 								  "runId" : "88e2ff94-ede0-45a8-b687-6316fbef529a",
 								  "name" : "MyQuery",
 								  "timestamp" : "2016-12-14T18:45:24.873Z",
 								  "numInputRows" : 10,
 								  "inputRowsPerSecond" : 120.0,
 								  "processedRowsPerSecond" : 200.0,
 								  "durationMs" : {
 								    "triggerExecution" : 3,
 								    "getOffset" : 2
 								  },
 								  "eventTime" : {
 								    "watermark" : "2016-12-14T18:45:24.873Z"
 								  },
 								  "stateOperators" : [ ],
 								  "sources" : [ {
 								    "description" : "KafkaSource[Subscribe[topic-0]]",
 								    "startOffset" : {
 								      "topic-0" : {
 								        "2" : 0,
 								        "4" : 1,
 								        "1" : 1,
 								        "3" : 1,
 								        "0" : 1
 								      }
 								    },
 								    "endOffset" : {
 								      "topic-0" : {
 								        "2" : 0,
 								        "4" : 115,
 								        "1" : 134,
 								        "3" : 21,
 								        "0" : 534
 								      }
 								    },
 								    "numInputRows" : 10,
 								    "inputRowsPerSecond" : 120.0,
 								    "processedRowsPerSecond" : 200.0
 								  } ],
 								  "sink" : {
 								    "description" : "MemorySink"
 								  }
 								}
 								*/
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								println(query.status)
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								/*  Will print something like the following.
 								{
 								  "message" : "Waiting for data to arrive",
 								  "isDataAvailable" : false,
 								  "isTriggerActive" : false
 								}
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								*/
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								StreamingQuery query = ...
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								System.out.println(query.lastProgress());
 								/* Will print something like the following.
 								{
 								  "id" : "ce011fdc-8762-4dcb-84eb-a77333e28109",
 								  "runId" : "88e2ff94-ede0-45a8-b687-6316fbef529a",
 								  "name" : "MyQuery",
 								  "timestamp" : "2016-12-14T18:45:24.873Z",
 								  "numInputRows" : 10,
 								  "inputRowsPerSecond" : 120.0,
 								  "processedRowsPerSecond" : 200.0,
 								  "durationMs" : {
 								    "triggerExecution" : 3,
 								    "getOffset" : 2
 								  },
 								  "eventTime" : {
 								    "watermark" : "2016-12-14T18:45:24.873Z"
 								  },
 								  "stateOperators" : [ ],
 								  "sources" : [ {
 								    "description" : "KafkaSource[Subscribe[topic-0]]",
 								    "startOffset" : {
 								      "topic-0" : {
 								        "2" : 0,
 								        "4" : 1,
 								        "1" : 1,
 								        "3" : 1,
 								        "0" : 1
 								      }
 								    },
 								    "endOffset" : {
 								      "topic-0" : {
 								        "2" : 0,
 								        "4" : 115,
 								        "1" : 134,
 								        "3" : 21,
 								        "0" : 534
 								      }
 								    },
 								    "numInputRows" : 10,
 								    "inputRowsPerSecond" : 120.0,
 								    "processedRowsPerSecond" : 200.0
 								  } ],
 								  "sink" : {
 								    "description" : "MemorySink"
 								  }
 								}
 								*/
 								System.out.println(query.status());
 								/*  Will print something like the following.
 								{
 								  "message" : "Waiting for data to arrive",
 								  "isDataAvailable" : false,
 								  "isTriggerActive" : false
 								}
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								*/
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								query = ...  # a StreamingQuery
 								print(query.lastProgress)
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								'''
 								Will print something like the following.
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								{u'stateOperators': [], u'eventTime': {u'watermark': u'2016-12-14T18:45:24.873Z'}, u'name': u'MyQuery', u'timestamp': u'2016-12-14T18:45:24.873Z', u'processedRowsPerSecond': 200.0, u'inputRowsPerSecond': 120.0, u'numInputRows': 10, u'sources': [{u'description': u'KafkaSource[Subscribe[topic-0]]', u'endOffset': {u'topic-0': {u'1': 134, u'0': 534, u'3': 21, u'2': 0, u'4': 115}}, u'processedRowsPerSecond': 200.0, u'inputRowsPerSecond': 120.0, u'numInputRows': 10, u'startOffset': {u'topic-0': {u'1': 1, u'0': 1, u'3': 1, u'2': 0, u'4': 1}}}], u'durationMs': {u'getOffset': 2, u'triggerExecution': 3}, u'runId': u'88e2ff94-ede0-45a8-b687-6316fbef529a', u'id': u'ce011fdc-8762-4dcb-84eb-a77333e28109', u'sink': {u'description': u'MemorySink'}}
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								'''
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
 								print(query.status)
 								'''
 								Will print something like the following.
 								{u'message': u'Waiting for data to arrive', u'isTriggerActive': False, u'isDataAvailable': False}
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								'''
 								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								query <- ...  # a StreamingQuery
 								lastProgress(query)
 								'''
 								Will print something like the following.
 								{
 								  "id" : "8c57e1ec-94b5-4c99-b100-f694162df0b9",
 								  "runId" : "ae505c5a-a64e-4896-8c28-c7cbaf926f16",
 								  "name" : null,
 								  "timestamp" : "2017-04-26T08:27:28.835Z",
 								  "numInputRows" : 0,
 								  "inputRowsPerSecond" : 0.0,
 								  "processedRowsPerSecond" : 0.0,
 								  "durationMs" : {
 								    "getOffset" : 0,
 								    "triggerExecution" : 1
 								  },
 								  "stateOperators" : [ {
 								    "numRowsTotal" : 4,
 								    "numRowsUpdated" : 0
 								  } ],
 								  "sources" : [ {
 								    "description" : "TextSocketSource[host: localhost, port: 9999]",
 								    "startOffset" : 1,
 								    "endOffset" : 1,
 								    "numInputRows" : 0,
 								    "inputRowsPerSecond" : 0.0,
 								    "processedRowsPerSecond" : 0.0
 								  } ],
 								  "sink" : {
 								    "description" : "org.apache.spark.sql.execution.streaming.ConsoleSink@76b37531"
 								  }
 								}
 								'''
 								status(query)
 								'''
 								Will print something like the following.
 								{
 								  "message" : "Waiting for data to arrive",
 								  "isDataAvailable" : false,
 								  "isTriggerActive" : false
 								}
 								'''
 								{% endhighlight %}
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								</div>
 								</div>
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								### Reporting Metrics programmatically using Asynchronous APIs
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
 								You can also asynchronously monitor all queries associated with a
 								`SparkSession` by attaching a `StreamingQueryListener`
 								([Scala](api/scala/index.html#org.apache.spark.sql.streaming.StreamingQueryListener)/[Java](api/java/org/apache/spark/sql/streaming/StreamingQueryListener.html) docs).
 								Once you attach your custom `StreamingQueryListener` object with
 								`sparkSession.streams.attachListener()`, you will get callbacks when a query is started and
 								stopped and when there is progress made in an active query. Here is an example,
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								val spark: SparkSession = ...
 								spark.streams.addListener(new StreamingQueryListener() {
 								    override def onQueryStarted(queryStarted: QueryStartedEvent): Unit = {
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        println("Query started: " + queryStarted.id)
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								    }
 								    override def onQueryTerminated(queryTerminated: QueryTerminatedEvent): Unit = {
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        println("Query terminated: " + queryTerminated.id)
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								    }
 								    override def onQueryProgress(queryProgress: QueryProgressEvent): Unit = {
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        println("Query made progress: " + queryProgress.progress)
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								    }
 								})
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								SparkSession spark = ...
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								spark.streams().addListener(new StreamingQueryListener() {
 								    @Override
 								    public void onQueryStarted(QueryStartedEvent queryStarted) {
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        System.out.println("Query started: " + queryStarted.id());
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								    }
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								    @Override
 								    public void onQueryTerminated(QueryTerminatedEvent queryTerminated) {
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        System.out.println("Query terminated: " + queryTerminated.id());
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								    }
-												[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

											
										
										
											2017-01-02 09:40:06 -05:00
+								    @Override
 								    public void onQueryProgress(QueryProgressEvent queryProgress) {
-												[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

											
										
										
											2016-12-28 15:11:25 -05:00
+								        System.out.println("Query made progress: " + queryProgress.progress());
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								    }
 								});
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight bash %}
 								Not available in Python.
 								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight bash %}
 								Not available in R.
 								{% endhighlight %}
-												[SPARK-18461][DOCS][STRUCTUREDSTREAMING] Added more information about monitoring streaming queries

## What changes were proposed in this pull request?
<img width="941" alt="screen shot 2016-11-15 at 6 27 32 pm" src="https://cloud.githubusercontent.com/assets/663212/20332521/4190b858-ab61-11e6-93a6-4bdc05105ed9.png">
<img width="940" alt="screen shot 2016-11-15 at 6 27 45 pm" src="https://cloud.githubusercontent.com/assets/663212/20332525/44a0d01e-ab61-11e6-8668-47f925490d4f.png">

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15897 from tdas/SPARK-18461.

											
										
										
											2016-11-16 14:03:10 -05:00
+								</div>
 								</div>
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								### Reporting Metrics using Dropwizard
 								Spark supports reporting metrics using the [Dropwizard Library](monitoring.html#metrics). To enable metrics of Structured Streaming queries to be reported as well, you have to explicitly enable the configuration `spark.sql.streaming.metricsEnabled` in the SparkSession.
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
 								// or
 								spark.sql("SET spark.sql.streaming.metricsEnabled=true")
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								spark.conf().set("spark.sql.streaming.metricsEnabled", "true");
 								// or
 								spark.sql("SET spark.sql.streaming.metricsEnabled=true");
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								spark.conf.set("spark.sql.streaming.metricsEnabled", "true")
 								# or
 								spark.sql("SET spark.sql.streaming.metricsEnabled=true")
 								{% endhighlight %}
 								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								sql("SET spark.sql.streaming.metricsEnabled=true")
 								{% endhighlight %}
 								</div>
 								</div>
 								All queries started in the SparkSession after this configuration has been enabled will report metrics through Dropwizard to whatever [sinks](monitoring.html#metrics) have been configured (e.g. Ganglia, Graphite, JMX, etc.).
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								## Recovering from Failures with Checkpointing
-												[MINOR][DOC] Fix some typos and grammar issues

## What changes were proposed in this pull request?

Easy fix in the documentation.

## How was this patch tested?

N/A

Closes #20948

Author: Daniel Sakuma <dsakuma@gmail.com>

Closes #20928 from dsakuma/fix_typo_configuration_docs.

											
										
										
											2018-04-06 01:37:08 -04:00
+								In case of a failure or intentional shutdown, you can recover the previous progress and state of a previous query, and continue where it left off. This is done using checkpointing and write-ahead logs. You can configure a query with a checkpoint location, and the query will save all the progress information (i.e. range of offsets processed in each trigger) and the running aggregates (e.g. word counts in the [quick example](#quick-example)) to the checkpoint location. This checkpoint location has to be a path in an HDFS compatible file system, and can be set as an option in the DataStreamWriter when [starting a query](#starting-streaming-queries).
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								aggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream
 								  .outputMode("complete")
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								  .option("checkpointLocation", "path/to/HDFS/dir")
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .format("memory")
 								  .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								aggDF
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .writeStream()
 								  .outputMode("complete")
-												[MINOR][DOC] Use standard quotes instead of "curly quote" marks from Mac in structured streaming programming guides

## What changes were proposed in this pull request?

This PR fixes curly quotes (`“` and `”` ) to standard quotes (`"`).

This will be a actual problem when users copy and paste the examples. This would not work.

This seems only happening in `structured-streaming-programming-guide.md`.

## How was this patch tested?

Manually built.

This will change some examples to be correctly marked down as below:

![2016-08-23 3 24 13](https://cloud.githubusercontent.com/assets/6477701/17882878/2a38332e-694a-11e6-8e84-76bdb89151e0.png)

to

![2016-08-23 3 26 06](https://cloud.githubusercontent.com/assets/6477701/17882888/376eaa28-694a-11e6-8b88-32ea83997037.png)

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14770 from HyukjinKwon/minor-quotes.

											
										
										
											2016-08-23 16:21:43 -04:00
+								  .option("checkpointLocation", "path/to/HDFS/dir")
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								  .format("memory")
 								  .start();
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								aggDF \
-												[DOCS][SS] fix structured streaming python example

## What changes were proposed in this pull request?

- SS python example: `TypeError: 'xxx' object is not callable`
- some other doc issue.

## How was this patch tested?

Jenkins.

Author: uncleGen <hustyugm@gmail.com>

Closes #17257 from uncleGen/docs-ss-python.

											
										
										
											2017-03-12 04:29:37 -04:00
+								    .writeStream \
-												[MINOR][DOCS] Fix minor typos in python example code

## What changes were proposed in this pull request?

Fix minor typos python example code in streaming programming guide

## How was this patch tested?

N/A

Author: Dmitriy Sokolov <silentsokolov@gmail.com>

Closes #14805 from silentsokolov/fix-typos.

											
										
										
											2016-08-30 06:23:37 -04:00
+								    .outputMode("complete") \
 								    .option("checkpointLocation", "path/to/HDFS/dir") \
 								    .format("memory") \
-												[SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation

## What changes were proposed in this pull request?

Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.

This PR fixes three things below:

 - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.

- Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.

- Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).

## How was this patch tested?

N/A

Closes https://github.com/apache/spark/pull/14491

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>

Closes #14564 from HyukjinKwon/SPARK-16886.

											
										
										
											2016-08-11 06:31:52 -04:00
+								    .start()
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								{% endhighlight %}
-												[SPARK-20015][SPARKR][SS][DOC][EXAMPLE] Document R Structured Streaming (experimental) in R vignettes and R & SS programming guide, R example

## What changes were proposed in this pull request?

Add
- R vignettes
- R programming guide
- SS programming guide
- R example

Also disable spark.als in vignettes for now since it's failing (SPARK-20402)

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #17814 from felixcheung/rdocss.

											
										
										
											2017-05-04 03:27:10 -04:00
+								</div>
 								<div data-lang="r"  markdown="1">
 								{% highlight r %}
 								write.stream(aggDF, "memory", outputMode = "complete", checkpointLocation = "path/to/HDFS/dir")
 								{% endhighlight %}
-												[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide

Title defines all.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13945 from tdas/SPARK-16256.

											
										
										
											2016-06-29 14:45:57 -04:00
+								</div>
 								</div>
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
 								## Recovery Semantics after Changes in a Streaming Query
 								There are limitations on what changes in a streaming query are allowed between restarts from the
 								same checkpoint location. Here are a few kinds of changes that are either not allowed, or
 								the effect of the change is not well-defined. For all of them:
 								- The term *allowed* means you can do the specified change but whether the semantics of its effect
 								  is well-defined depends on the query and the change.
 								- The term *not allowed* means you should not do the specified change as the restarted query is likely
 								  to fail with unpredictable errors. `sdf` represents a streaming DataFrame/Dataset
 								  generated with sparkSession.readStream.
 								**Types of changes**
 								- *Changes in the number or type (i.e. different source) of input sources*: This is not allowed.
 								- *Changes in the parameters of input sources*: Whether this is allowed and whether the semantics
 								  of the change are well-defined depends on the source and the query. Here are a few examples.
 								  - Addition/deletion/modification of rate limits is allowed: `spark.readStream.format("kafka").option("subscribe", "topic")` to `spark.readStream.format("kafka").option("subscribe", "topic").option("maxOffsetsPerTrigger", ...)`
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								  - Changes to subscribed topics/files are generally not allowed as the results are unpredictable: `spark.readStream.format("kafka").option("subscribe", "topic")` to `spark.readStream.format("kafka").option("subscribe", "newTopic")`
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
 								- *Changes in the type of output sink*: Changes between a few specific combinations of sinks
 								  are allowed. This needs to be verified on a case-by-case basis. Here are a few examples.
 								  - File sink to Kafka sink is allowed. Kafka will see only the new data.
 								  - Kafka sink to file sink is not allowed.
 								  - Kafka sink changed to foreach, or vice versa is allowed.
 								- *Changes in the parameters of output sink*: Whether this is allowed and whether the semantics of
 								  the change are well-defined depends on the sink and the query. Here are a few examples.
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								  - Changes to output directory of a file sink are not allowed: `sdf.writeStream.format("parquet").option("path", "/somePath")` to `sdf.writeStream.format("parquet").option("path", "/anotherPath")`
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								  - Changes to output topic are allowed: `sdf.writeStream.format("kafka").option("topic", "someTopic")` to `sdf.writeStream.format("kafka").option("topic", "anotherTopic")`
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								  - Changes to the user-defined foreach sink (that is, the `ForeachWriter` code) are allowed, but the semantics of the change depends on the code.
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
 								- *Changes in projection / filter / map-like operations**: Some cases are allowed. For example:
 								  - Addition / deletion of filters is allowed: `sdf.selectExpr("a")` to `sdf.where(...).selectExpr("a").filter(...)`.
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								  - Changes in projections with same output schema are allowed: `sdf.selectExpr("stringColumn AS json").writeStream` to `sdf.selectExpr("anotherStringColumn AS json").writeStream`
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
 								  - Changes in projections with different output schema are conditionally allowed: `sdf.selectExpr("a").writeStream` to `sdf.selectExpr("b").writeStream` is allowed only if the output sink allows the schema change from `"a"` to `"b"`.
 								- *Changes in stateful operations*: Some operations in streaming queries need to maintain
 								  state data in order to continuously update the result. Structured Streaming automatically checkpoints
 								  the state data to fault-tolerant storage (for example, HDFS, AWS S3, Azure Blob storage) and restores it after restart.
 								  However, this assumes that the schema of the state data remains same across restarts. This means that
 								  *any changes (that is, additions, deletions, or schema modifications) to the stateful operations of a streaming query are not allowed between restarts*.
 								  Here is the list of stateful operations whose schema should not be changed between restarts in order to ensure state recovery:
 								  - *Streaming aggregation*: For example, `sdf.groupBy("a").agg(...)`. Any change in number or type of grouping keys or aggregates is not allowed.
 								  - *Streaming deduplication*: For example, `sdf.dropDuplicates("a")`. Any change in number or type of grouping keys or aggregates is not allowed.
 								  - *Stream-stream join*: For example, `sdf1.join(sdf2, ...)` (i.e. both inputs are generated with `sparkSession.readStream`). Changes
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								    in the schema or equi-joining columns are not allowed. Changes in join type (outer or inner) are not allowed. Other changes in the join condition are ill-defined.
-												[SPARK-25639][DOCS] Added docs for foreachBatch, python foreach and multiple watermarks

## What changes were proposed in this pull request?

Added
- Python foreach
- Scala, Java and Python foreachBatch
- Multiple watermark policy
- The semantics of what changes are allowed to the streaming between restarts.

## How was this patch tested?
No tests

Closes #22627 from tdas/SPARK-25639.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

											
										
										
											2018-10-08 17:32:04 -04:00
 								  - *Arbitrary stateful operation*: For example, `sdf.groupByKey(...).mapGroupsWithState(...)` or `sdf.groupByKey(...).flatMapGroupsWithState(...)`.
 								    Any change to the schema of the user-defined state and the type of timeout is not allowed.
 								    Any change within the user-defined state-mapping function are allowed, but the semantic effect of the change depends on the user-defined logic.
 								    If you really want to support state schema changes, then you can explicitly encode/decode your complex state data
 								    structures into bytes using an encoding/decoding scheme that supports schema migration. For example,
 								    if you save your state as Avro-encoded bytes, then you are free to change the Avro-state-schema between query
 								    restarts as the binary state will always be restored successfully.
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								# Continuous Processing
 								## [Experimental]
 								{:.no_toc}
-												[SPARK-23142][SS][DOCS] Added docs for continuous processing

## What changes were proposed in this pull request?

Added documentation for continuous processing. Modified two locations.
- Modified the overview to have a mention of Continuous Processing.
- Added a new section on Continuous Processing at the end.

![image](https://user-images.githubusercontent.com/663212/35083551-a3dd23f6-fbd4-11e7-9e7e-90866f131ca9.png)
![image](https://user-images.githubusercontent.com/663212/35083618-d844027c-fbd4-11e7-9fde-75992cc517bd.png)

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20308 from tdas/SPARK-23142.

											
										
										
											2018-01-18 19:29:45 -05:00
+								**Continuous processing** is a new, experimental streaming execution mode introduced in Spark 2.3 that enables low (~1 ms) end-to-end latency with at-least-once fault-tolerance guarantees. Compare this with the default *micro-batch processing* engine which can achieve exactly-once guarantees but achieve latencies of ~100ms at best. For some types of queries (discussed below), you can choose which mode to execute them in without modifying the application logic (i.e. without changing the DataFrame/Dataset operations).
 								To run a supported query in continuous processing mode, all you need to do is specify a **continuous trigger** with the desired checkpoint interval as a parameter. For example,
 								<div class="codetabs">
 								<div data-lang="scala"  markdown="1">
 								{% highlight scala %}
 								import org.apache.spark.sql.streaming.Trigger
 								spark
 								  .readStream
 								  .format("rate")
 								  .option("rowsPerSecond", "10")
 								  .option("")
 								spark
 								  .readStream
 								  .format("kafka")
 								  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
 								  .option("subscribe", "topic1")
 								  .load()
 								  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 								  .writeStream
 								  .format("kafka")
 								  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
 								  .option("topic", "topic1")
 								  .trigger(Trigger.Continuous("1 second"))  // only change in query
 								  .start()
 								{% endhighlight %}
 								</div>
 								<div data-lang="java"  markdown="1">
 								{% highlight java %}
 								import org.apache.spark.sql.streaming.Trigger;
 								spark
 								  .readStream
 								  .format("kafka")
 								  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
 								  .option("subscribe", "topic1")
 								  .load()
 								  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
 								  .writeStream
 								  .format("kafka")
 								  .option("kafka.bootstrap.servers", "host1:port1,host2:port2")
 								  .option("topic", "topic1")
 								  .trigger(Trigger.Continuous("1 second"))  // only change in query
 								  .start();
 								{% endhighlight %}
 								</div>
 								<div data-lang="python"  markdown="1">
 								{% highlight python %}
 								spark \
 								  .readStream \
 								  .format("kafka") \
 								  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
 								  .option("subscribe", "topic1") \
 								  .load() \
 								  .selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)") \
 								  .writeStream \
 								  .format("kafka") \
 								  .option("kafka.bootstrap.servers", "host1:port1,host2:port2") \
 								  .option("topic", "topic1") \
 								  .trigger(continuous="1 second") \     # only change in query
 								  .start()
 								{% endhighlight %}
 								</div>
 								</div>
-												[MINOR][DOCS][WIP] Fix Typos

## What changes were proposed in this pull request?
Fix Typos.

## How was this patch tested?
NA

Closes #23145 from kjmrknsn/docUpdate.

Authored-by: Keiji Yoshida <kjmrknsn@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>

											
										
										
											2018-11-29 11:39:00 -05:00
+								A checkpoint interval of 1 second means that the continuous processing engine will record the progress of the query every second. The resulting checkpoints are in a format compatible with the micro-batch engine, hence any query can be restarted with any trigger. For example, a supported query started with the micro-batch mode can be restarted in continuous mode, and vice versa. Note that any time you switch to continuous mode, you will get at-least-once fault-tolerance guarantees.
-												[SPARK-23142][SS][DOCS] Added docs for continuous processing

## What changes were proposed in this pull request?

Added documentation for continuous processing. Modified two locations.
- Modified the overview to have a mention of Continuous Processing.
- Added a new section on Continuous Processing at the end.

![image](https://user-images.githubusercontent.com/663212/35083551-a3dd23f6-fbd4-11e7-9e7e-90866f131ca9.png)
![image](https://user-images.githubusercontent.com/663212/35083618-d844027c-fbd4-11e7-9fde-75992cc517bd.png)

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20308 from tdas/SPARK-23142.

											
										
										
											2018-01-18 19:29:45 -05:00
 								## Supported Queries
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								{:.no_toc}
-												[SPARK-23142][SS][DOCS] Added docs for continuous processing

## What changes were proposed in this pull request?

Added documentation for continuous processing. Modified two locations.
- Modified the overview to have a mention of Continuous Processing.
- Added a new section on Continuous Processing at the end.

![image](https://user-images.githubusercontent.com/663212/35083551-a3dd23f6-fbd4-11e7-9e7e-90866f131ca9.png)
![image](https://user-images.githubusercontent.com/663212/35083618-d844027c-fbd4-11e7-9fde-75992cc517bd.png)

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20308 from tdas/SPARK-23142.

											
										
										
											2018-01-18 19:29:45 -05:00
+								As of Spark 2.3, only the following type of queries are supported in the continuous processing mode.
 								- *Operations*: Only map-like Dataset/DataFrame operations are supported in continuous mode, that is, only projections (`select`, `map`, `flatMap`, `mapPartitions`, etc.) and selections (`where`, `filter`, etc.).
 								  + All SQL functions are supported except aggregation functions (since aggregations are not yet supported), `current_timestamp()` and `current_date()` (deterministic computations using time is challenging).
 								- *Sources*:
 								  + Kafka source: All options are supported.
 								  + Rate source: Good for testing. Only options that are supported in the continuous mode are `numPartitions` and `rowsPerSecond`.
 								- *Sinks*:
 								  + Kafka sink: All options are supported.
 								  + Memory sink: Good for debugging.
 								  + Console sink: Good for debugging. All options are supported. Note that the console will print every checkpoint interval that you have specified in the continuous trigger.
 								See [Input Sources](#input-sources) and [Output Sinks](#output-sinks) sections for more details on them. While the console sink is good for testing, the end-to-end low-latency processing can be best observed with Kafka as the source and sink, as this allows the engine to process the data and make the results available in the output topic within milliseconds of the input data being available in the input topic.
 								## Caveats
-												[SPARK-23454][SS][DOCS] Added trigger information to the Structured Streaming programming guide

## What changes were proposed in this pull request?

- Added clear information about triggers
- Made the semantics guarantees of watermarks more clear for streaming aggregations and stream-stream joins.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20631 from tdas/SPARK-23454.

											
										
										
											2018-02-20 21:16:10 -05:00
+								{:.no_toc}
-												[SPARK-23142][SS][DOCS] Added docs for continuous processing

## What changes were proposed in this pull request?

Added documentation for continuous processing. Modified two locations.
- Modified the overview to have a mention of Continuous Processing.
- Added a new section on Continuous Processing at the end.

![image](https://user-images.githubusercontent.com/663212/35083551-a3dd23f6-fbd4-11e7-9e7e-90866f131ca9.png)
![image](https://user-images.githubusercontent.com/663212/35083618-d844027c-fbd4-11e7-9fde-75992cc517bd.png)

## How was this patch tested?
N/A

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20308 from tdas/SPARK-23142.

											
										
										
											2018-01-18 19:29:45 -05:00
+								- Continuous processing engine launches multiple long-running tasks that continuously read data from sources, process it and continuously write to sinks. The number of tasks required by the query depends on how many partitions the query can read from the sources in parallel. Therefore, before starting a continuous processing query, you must ensure there are enough cores in the cluster to all the tasks in parallel. For example, if you are reading from a Kafka topic that has 10 partitions, then the cluster must have at least 10 cores for the query to make progress.
 								- Stopping a continuous processing stream may produce spurious task termination warnings. These can be safely ignored.
 								- There are currently no automatic retries of failed tasks. Any failure will lead to the query being stopped and it needs to be manually restarted from the checkpoint.
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00
+								# Additional Information
 								**Further Reading**
 								- See and run the
 								  [Scala]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/scala/org/apache/spark/examples/sql/streaming)/[Java]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/java/org/apache/spark/examples/sql/streaming)/[Python]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/python/sql/streaming)/[R]({{site.SPARK_GITHUB_URL}}/tree/v{{site.SPARK_VERSION_SHORT}}/examples/src/main/r/streaming)
 								  examples.
 								    - [Instructions](index.html#running-the-examples-and-shell) on how to run Spark examples
 								- Read about integrating with Kafka in the [Structured Streaming Kafka Integration Guide](structured-streaming-kafka-integration.html)
 								- Read more details about using DataFrames/Datasets in the [Spark SQL Programming Guide](sql-programming-guide.html)
 								- Third-party Blog Posts
 								    - [Real-time Streaming ETL with Structured Streaming in Apache Spark 2.1 (Databricks Blog)](https://databricks.com/blog/2017/01/19/real-time-streaming-etl-structured-streaming-apache-spark-2-1.html)
 								    - [Real-Time End-to-End Integration with Apache Kafka in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/04/04/real-time-end-to-end-integration-with-apache-kafka-in-apache-sparks-structured-streaming.html)
 								    - [Event-time Aggregation and Watermarking in Apache Spark’s Structured Streaming (Databricks Blog)](https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html)
 								**Talks**
-												[SPARK-23064][DOCS][SS] Added documentation for stream-stream joins

## What changes were proposed in this pull request?
Added documentation for stream-stream joins

![image](https://user-images.githubusercontent.com/663212/35018744-e999895a-fad7-11e7-9d6a-8c7a73e6eb9c.png)

![image](https://user-images.githubusercontent.com/663212/35018775-157eb464-fad8-11e7-879e-47a2fcbd8690.png)

![image](https://user-images.githubusercontent.com/663212/35018784-27791a24-fad8-11e7-98f4-7ff246f62a74.png)

![image](https://user-images.githubusercontent.com/663212/35018791-36a80334-fad8-11e7-9791-f85efa7c6ba2.png)

## How was this patch tested?

N/a

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #20255 from tdas/join-docs.

											
										
										
											2018-01-17 19:41:43 -05:00
+								- Spark Summit Europe 2017
 								  - Easy, Scalable, Fault-tolerant Stream Processing with Structured Streaming in Apache Spark -
 								    [Part 1 slides/video](https://databricks.com/session/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-in-apache-spark), [Part 2 slides/video](https://databricks.com/session/easy-scalable-fault-tolerant-stream-processing-with-structured-streaming-in-apache-spark-continues)
 								  - Deep Dive into Stateful Stream Processing in Structured Streaming - [slides/video](https://databricks.com/session/deep-dive-into-stateful-stream-processing-in-structured-streaming)
 								- Spark Summit 2016
 								  - A Deep Dive into Structured Streaming - [slides/video](https://spark-summit.org/2016/events/a-deep-dive-into-structured-streaming/)
-												[SPARK-21267][SS][DOCS] Update Structured Streaming Documentation

## What changes were proposed in this pull request?

Few changes to the Structured Streaming documentation
- Clarify that the entire stream input table is not materialized
- Add information for Ganglia
- Add Kafka Sink to the main docs
- Removed a couple of leftover experimental tags
- Added more associated reading material and talk videos.

In addition, https://github.com/apache/spark/pull/16856 broke the link to the RDD programming guide in several places while renaming the page. This PR fixes those sameeragarwal cloud-fan.
- Added a redirection to avoid breaking internal and possible external links.
- Removed unnecessary redirection pages that were there since the separate scala, java, and python programming guides were merged together in 2013 or 2014.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #18485 from tdas/SPARK-21267.

											
										
										
											2017-07-06 20:28:20 -04:00