spark-instrumented-optimizer/python/pyspark/sql
Dilip Biswal 10f1f19659 [SPARK-21274][SQL] Implement EXCEPT ALL clause.
## What changes were proposed in this pull request?
Implements EXCEPT ALL clause through query rewrites using existing operators in Spark. In this PR, an internal UDTF (replicate_rows) is added to aid in preserving duplicate rows. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design.

**Note** This proposed UDTF is kept as a internal function that is purely used to aid with this particular rewrite to give us flexibility to change to a more generalized UDTF in future.

Input Query
``` SQL
SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2
```
Rewritten Query
```SQL
SELECT c1
    FROM (
     SELECT replicate_rows(sum_val, c1)
       FROM (
         SELECT c1, sum_val
           FROM (
             SELECT c1, sum(vcol) AS sum_val
               FROM (
                 SELECT 1L as vcol, c1 FROM ut1
                 UNION ALL
                 SELECT -1L as vcol, c1 FROM ut2
              ) AS union_all
            GROUP BY union_all.c1
          )
        WHERE sum_val > 0
       )
   )
```

## How was this patch tested?
Added test cases in SQLQueryTestSuite, DataFrameSuite and SetOperationSuite

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #21857 from dilipbiswal/dkb_except_all_final.
2018-07-27 13:47:33 -07:00
..
__init__.py [SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark 2017-11-02 15:22:52 +01:00
catalog.py [SPARK-23522][PYTHON] always use sys.exit over builtin exit 2018-03-08 20:38:34 +09:00
column.py [SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark 2018-04-08 12:09:06 +08:00
conf.py [SPARK-24761][SQL] Adding of isModifiable() to RuntimeConfig 2018-07-11 17:38:43 -07:00
context.py [SPARK-24665][PYSPARK] Use SQLConf in PySpark to manage all sql configs 2018-07-02 14:35:37 +08:00
dataframe.py [SPARK-21274][SQL] Implement EXCEPT ALL clause. 2018-07-27 13:47:33 -07:00
functions.py [SPARK-23928][SQL] Add shuffle collection function. 2018-07-27 23:02:48 +09:00
group.py [SPARK-24392][PYTHON] Label pandas_udf as Experimental 2018-05-28 12:56:05 +08:00
readwriter.py [SPARK-19018][SQL] Add support for custom encoding on csv writer 2018-07-25 14:17:20 +08:00
session.py [SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf when creating pysp… 2018-06-14 13:16:20 -07:00
streaming.py [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame 2018-06-19 13:56:51 -07:00
tests.py [PYSPARK][TEST][MINOR] Fix UDFInitializationTests 2018-07-20 19:48:32 -07:00
types.py [SPARK-24057][PYTHON] put the real data type in the AssertionError message 2018-04-26 14:21:22 -07:00
udf.py [SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor 2018-06-11 10:15:42 +08:00
utils.py [SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame 2018-06-19 13:56:51 -07:00
window.py [SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause 2018-04-07 00:15:54 +08:00