spark-instrumented-optimizer/python/pyspark/sql
Shixiong Zhu 9bf4e2baad [SPARK-19497][SS] Implement streaming deduplication
## What changes were proposed in this pull request?

This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.

The following cases are supported:

- one or multiple `dropDuplicates()` without aggregation (with or without watermark)
- `dropDuplicates` before aggregation

Not supported cases:

- `dropDuplicates` after aggregation

Breaking changes:
- `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16970 from zsxwing/dedup.
2017-02-23 11:25:39 -08:00
..
__init__.py [SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes" 2016-08-06 05:02:59 +01:00
catalog.py [SPARK-19148][SQL] do not expose the external table concept in Catalog 2017-01-17 12:54:50 +08:00
column.py [SPARK-18541][PYTHON] Add metadata parameter to pyspark.sql.Column.alias() 2017-02-14 09:57:43 -08:00
conf.py [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code 2016-05-23 18:14:48 -07:00
context.py [SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error 2017-01-13 18:35:51 +08:00
dataframe.py [SPARK-19497][SS] Implement streaming deduplication 2017-02-23 11:25:39 -08:00
functions.py [SPARK-19160][PYTHON][SQL] Add udf decorator 2017-02-15 10:16:34 -08:00
group.py [MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation 2016-07-06 10:45:51 -07:00
readwriter.py [SPARK-18352][SQL] Support parsing multiline json files 2017-02-16 20:51:19 -08:00
session.py [SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped 2017-01-12 20:53:31 +08:00
streaming.py [SPARK-18352][SQL] Support parsing multiline json files 2017-02-16 20:51:19 -08:00
tests.py [SPARK-18352][SQL] Support parsing multiline json files 2017-02-16 20:51:19 -08:00
types.py [SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row 2017-01-07 12:52:41 +00:00
utils.py [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo 2017-01-04 15:07:29 +00:00
window.py [SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames 2016-12-02 17:39:28 -08:00