spark-instrumented-optimizer

History

Dilip Biswal 10f1f19659 [SPARK-21274][SQL] Implement EXCEPT ALL clause. ## What changes were proposed in this pull request? Implements EXCEPT ALL clause through query rewrites using existing operators in Spark. In this PR, an internal UDTF (replicate_rows) is added to aid in preserving duplicate rows. Please refer to [Link](https://drive.google.com/open?id=1nyW0T0b_ajUduQoPgZLAsyHK8s3_dko3ulQuxaLpUXE) for the design. Note This proposed UDTF is kept as a internal function that is purely used to aid with this particular rewrite to give us flexibility to change to a more generalized UDTF in future. Input Query ``` SQL SELECT c1 FROM ut1 EXCEPT ALL SELECT c1 FROM ut2 ``` Rewritten Query ```SQL SELECT c1 FROM ( SELECT replicate_rows(sum_val, c1) FROM ( SELECT c1, sum_val FROM ( SELECT c1, sum(vcol) AS sum_val FROM ( SELECT 1L as vcol, c1 FROM ut1 UNION ALL SELECT -1L as vcol, c1 FROM ut2 ) AS union_all GROUP BY union_all.c1 ) WHERE sum_val > 0 ) ) ``` ## How was this patch tested? Added test cases in SQLQueryTestSuite, DataFrameSuite and SetOperationSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #21857 from dilipbiswal/dkb_except_all_final.		2018-07-27 13:47:33 -07:00
..
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-23522][PYTHON] always use sys.exit over builtin exit	2018-03-08 20:38:34 +09:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-24761][SQL] Adding of isModifiable() to RuntimeConfig	2018-07-11 17:38:43 -07:00
context.py	[SPARK-24665][PYSPARK] Use SQLConf in PySpark to manage all sql configs	2018-07-02 14:35:37 +08:00
dataframe.py	[SPARK-21274][SQL] Implement EXCEPT ALL clause.	2018-07-27 13:47:33 -07:00
functions.py	[SPARK-23928][SQL] Add shuffle collection function.	2018-07-27 23:02:48 +09:00
group.py	[SPARK-24392][PYTHON] Label pandas_udf as Experimental	2018-05-28 12:56:05 +08:00
readwriter.py	[SPARK-19018][SQL] Add support for custom encoding on csv writer	2018-07-25 14:17:20 +08:00
session.py	[SPARK-24563][PYTHON] Catch TypeError when testing existence of HiveConf when creating pysp…	2018-06-14 13:16:20 -07:00
streaming.py	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame	2018-06-19 13:56:51 -07:00
tests.py	[PYSPARK][TEST][MINOR] Fix UDFInitializationTests	2018-07-20 19:48:32 -07:00
types.py	[SPARK-24057][PYTHON] put the real data type in the AssertionError message	2018-04-26 14:21:22 -07:00
udf.py	[SPARK-23754][PYTHON][FOLLOWUP] Move UDF stop iteration wrapping from driver to executor	2018-06-11 10:15:42 +08:00
utils.py	[SPARK-24565][SS] Add API for in Structured Streaming for exposing output rows of each microbatch as a DataFrame	2018-06-19 13:56:51 -07:00
window.py	[SPARK-23861][SQL][DOC] Clarify default window frame with and without orderBy clause	2018-04-07 00:15:54 +08:00