spark-instrumented-optimizer

History

Tathagata Das 2cb976355c [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame ## What changes were proposed in this pull request? Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful. - Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning). - Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source). - Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice. The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`. ## How was this patch tested? New unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21571 from tdas/foreachBatch.		2018-06-19 13:56:51 -07:00
..
__init__.py	[SPARK-6328][PYTHON] Python API for StreamingListener	2015-11-16 11:29:27 -08:00
context.py	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame	2018-06-19 13:56:51 -07:00
dstream.py	[MINOR] Fix Typos 'an -> a'	2016-06-06 09:35:47 +01:00
flume.py	[SPARK-22313][PYTHON][FOLLOWUP] Explicitly import warnings namespace in flume.py	2017-12-29 14:46:03 +09:00
kafka.py	[SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingListener	2018-04-19 10:00:57 +08:00
kinesis.py	[SPARK-19405][STREAMING] Support for cross-account Kinesis reads via STS	2017-02-22 11:32:36 -05:00
listener.py	[SPARK-24014][PYSPARK] Add onStreamingStarted method to StreamingListener	2018-04-19 10:00:57 +08:00
tests.py	[SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction	2018-06-09 01:27:51 +07:00
util.py	[SPARK-17756][PYTHON][STREAMING] Workaround to avoid return type mismatch in PythonTransformFunction	2018-06-09 01:27:51 +07:00