spark-instrumented-optimizer

History

Li Jin 86100df54b [SPARK-24561][SQL][PYTHON] User-defined window aggregation functions with Pandas UDF (bounded window) ## What changes were proposed in this pull request? This PR implements a new feature - window aggregation Pandas UDF for bounded window. #### Doc: https://docs.google.com/document/d/14EjeY5z4-NC27-SmIP9CsMPCANeTcvxN44a7SIJtZPc/edit#heading=h.c87w44wcj3wj #### Example: ``` from pyspark.sql.functions import pandas_udf, PandasUDFType from pyspark.sql.window import Window df = spark.range(0, 10, 2).toDF('v') w1 = Window.partitionBy().orderBy('v').rangeBetween(-2, 4) w2 = Window.partitionBy().orderBy('v').rowsBetween(-2, 2) pandas_udf('double', PandasUDFType.GROUPED_AGG) def avg(v): return v.mean() df.withColumn('v_mean', avg(df['v']).over(w1)).show() # +---+------+ # \| v\|v_mean\| # +---+------+ # \| 0\| 1.0\| # \| 2\| 2.0\| # \| 4\| 4.0\| # \| 6\| 6.0\| # \| 8\| 7.0\| # +---+------+ df.withColumn('v_mean', avg(df['v']).over(w2)).show() # +---+------+ # \| v\|v_mean\| # +---+------+ # \| 0\| 2.0\| # \| 2\| 3.0\| # \| 4\| 4.0\| # \| 6\| 5.0\| # \| 8\| 6.0\| # +---+------+ ``` #### High level changes: This PR modifies the existing WindowInPandasExec physical node to deal with unbounded (growing, shrinking and sliding) windows. * `WindowInPandasExec` now share the same base class as `WindowExec` and share utility functions. See `WindowExecBase` * `WindowFunctionFrame` now has two new functions `currentLowerBound` and `currentUpperBound` - to return the lower and upper window bound for the current output row. It is also modified to allow `AggregateProcessor` == null. Null aggregator processor is used for `WindowInPandasExec` where we don't have an aggregator and only uses lower and upper bound functions from `WindowFunctionFrame` * The biggest change is in `WindowInPandasExec`, where it is modified to take `currentLowerBound` and `currentUpperBound` and write those values together with the input data to the python process for rolling window aggregation. See `WindowInPandasExec` for more details. #### Discussion In benchmarking, I found numpy variant of the rolling window UDF is much faster than the pandas version: Spark SQL window function: 20s Pandas variant: ~80s Numpy variant: 10s Numpy variant with numba: 4s Allowing numpy variant of the vectorized UDFs is something I want to discuss because of the performance improvement, but doesn't have to be in this PR. ## How was this patch tested? New tests Closes #22305 from icexelloss/SPARK-24561-bounded-window-udf. Authored-by: Li Jin <ice.xelloss@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2018-12-18 09:15:21 +08:00
..
tests	[SPARK-24561][SQL][PYTHON] User-defined window aggregation functions with Pandas UDF (bounded window)	2018-12-18 09:15:21 +08:00
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-24665][PYSPARK][FOLLOWUP] Use SQLConf in PySpark to manage all sql configs	2018-08-17 10:18:08 +08:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-23698][PYTHON] Resolve undefined names in Python 3	2018-08-22 10:06:59 -07:00
context.py	[SPARK-25540][SQL][PYSPARK] Make HiveContext in PySpark behave as the same as Scala.	2018-09-27 09:51:20 +08:00
dataframe.py	[SPARK-25274][PYTHON][SQL] In toPandas with Arrow send un-ordered record batches to improve performance	2018-12-06 10:07:28 -08:00
functions.py	[SPARK-24561][SQL][PYTHON] User-defined window aggregation functions with Pandas UDF (bounded window)	2018-12-18 09:15:21 +08:00
group.py	[SPARK-24722][SQL] pivot() with Column type argument	2018-08-04 14:17:32 +08:00
readwriter.py	[SPARK-26303][SQL] Return partial results for bad JSON records	2018-12-11 16:06:57 +08:00
session.py	[SPARK-25255][PYTHON] Add getActiveSession to SparkSession in PySpark	2018-10-26 09:40:13 -07:00
streaming.py	[SPARK-26303][SQL] Return partial results for bad JSON records	2018-12-11 16:06:57 +08:00
types.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
udf.py	[SPARK-25601][PYTHON] Register Grouped aggregate UDF Vectorized UDFs for SQL Statement	2018-10-04 09:36:23 +08:00
utils.py	[SPARK-24721][SQL] Exclude Python UDFs filters in FileSourceStrategy	2018-08-28 10:57:13 +08:00
window.py	[SPARK-25842][SQL] Deprecate rangeBetween APIs introduced in SPARK-21608	2018-10-26 13:17:24 +08:00