spark-instrumented-optimizer/python/pyspark/streaming
Tathagata Das 2cb976355c [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame
## What changes were proposed in this pull request?

Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful.
- Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning).
- Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source).
- Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice.

The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`.

## How was this patch tested?
New unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #21571 from tdas/foreachBatch.
2018-06-19 13:56:51 -07:00
..
__init__.py [SPARK-6328][PYTHON] Python API for StreamingListener 2015-11-16 11:29:27 -08:00
context.py [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame 2018-06-19 13:56:51 -07:00
dstream.py [MINOR] Fix Typos 'an -> a' 2016-06-06 09:35:47 +01:00
flume.py [SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnings namespace in flume.py 2017-12-29 14:46:03 +09:00
kafka.py [SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingListener 2018-04-19 10:00:57 +08:00
kinesis.py [SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS 2017-02-22 11:32:36 -05:00
listener.py [SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingListener 2018-04-19 10:00:57 +08:00
tests.py [SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction 2018-06-09 01:27:51 +07:00
util.py [SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction 2018-06-09 01:27:51 +07:00