spark-instrumented-optimizer

History

Tathagata Das 2cb976355c [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame ## What changes were proposed in this pull request? Currently, the micro-batches in the MicroBatchExecution is not exposed to the user through any public API. This was because we did not want to expose the micro-batches, so that all the APIs we expose, we can eventually support them in the Continuous engine. But now that we have better sense of buiding a ContinuousExecution, I am considering adding APIs which will run only the MicroBatchExecution. I have quite a few use cases where exposing the microbatch output as a dataframe is useful. - Pass the output rows of each batch to a library that is designed only the batch jobs (example, uses many ML libraries need to collect() while learning). - Reuse batch data sources for output whose streaming version does not exists (e.g. redshift data source). - Writer the output rows to multiple places by writing twice for each batch. This is not the most elegant thing to do for multiple-output streaming queries but is likely to be better than running two streaming queries processing the same data twice. The proposal is to add a method `foreachBatch(f: Dataset[T] => Unit)` to Scala/Java/Python `DataStreamWriter`. ## How was this patch tested? New unit tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #21571 from tdas/foreachBatch.		2018-06-19 13:56:51 -07:00
..
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-23700][PYTHON] Cleanup imports in pyspark.sql	2018-03-26 12:42:32 +09:00
context.py	[SPARK-23706][PYTHON] spark.conf.get(value, default=None) should produce None in PySpark	2018-03-18 20:24:14 +09:00
dataframe.py	[SPARK-24215][PYSPARK] Implement _repr_html_ for dataframes in PySpark	2018-06-05 08:23:08 +07:00
functions.py	[SPARK-24543][SQL] Support any type as DDL string for from_json's schema	2018-06-14 13:27:27 -07:00
group.py	[SPARK-24392][PYTHON] Label pandas_udf as Experimental	2018-05-28 12:56:05 +08:00
readwriter.py	[SPARK-23772][SQL] Provide an option to ignore column of all null values or empty array during JSON schema inference	2018-06-19 00:24:54 +08:00
session.py	[SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf when creating pysp…	2018-06-14 13:16:20 -07:00
streaming.py	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame	2018-06-19 13:56:51 -07:00
tests.py	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame	2018-06-19 13:56:51 -07:00
types.py	[SPARK-24057][PYTHON] put the real data type in the AssertionError message	2018-04-26 14:21:22 -07:00
udf.py	[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor	2018-06-11 10:15:42 +08:00
utils.py	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame	2018-06-19 13:56:51 -07:00
window.py	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause	2018-04-07 00:15:54 +08:00