spark-instrumented-optimizer

History

Chris Martin 05988b256e [SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs ### What changes were proposed in this pull request? Adds a new cogroup Pandas UDF. This allows two grouped dataframes to be cogrouped together and apply a (pandas.DataFrame, pandas.DataFrame) -> pandas.DataFrame UDF to each cogroup. Example usage ``` from pyspark.sql.functions import pandas_udf, PandasUDFType df1 = spark.createDataFrame( [(20000101, 1, 1.0), (20000101, 2, 2.0), (20000102, 1, 3.0), (20000102, 2, 4.0)], ("time", "id", "v1")) df2 = spark.createDataFrame( [(20000101, 1, "x"), (20000101, 2, "y")], ("time", "id", "v2")) pandas_udf("time int, id int, v1 double, v2 string", PandasUDFType.COGROUPED_MAP) def asof_join(l, r): return pd.merge_asof(l, r, on="time", by="id") df1.groupby("id").cogroup(df2.groupby("id")).apply(asof_join).show() ``` +--------+---+---+---+ \| time\| id\| v1\| v2\| +--------+---+---+---+ \|20000101\| 1\|1.0\| x\| \|20000102\| 1\|3.0\| x\| \|20000101\| 2\|2.0\| y\| \|20000102\| 2\|4.0\| y\| +--------+---+---+---+ ### How was this patch tested? Added unit test test_pandas_udf_cogrouped_map Closes #24981 from d80tb7/SPARK-27463-poc-arrow-stream. Authored-by: Chris Martin <chris@cmartinit.co.uk> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>		2019-09-17 17:13:50 -07:00
..
ml	[SPARK-22797][ML][PYTHON] Bucketizer support multi-column	2019-09-17 11:52:20 +08:00
mllib	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3	2019-09-09 10:19:40 -05:00
sql	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs	2019-09-17 17:13:50 -07:00
streaming	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3	2019-09-09 10:19:40 -05:00
testing	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
tests	[SPARK-27371][CORE] Support GPU-aware resources scheduling in Standalone	2019-08-09 07:49:03 -05:00
__init__.py	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3	2019-09-09 10:19:40 -05:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
broadcast.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
cloudpickle.py	[SPARK-27000][PYTHON] Upgrades cloudpickle to v0.8.0	2019-02-28 02:33:10 +09:00
conf.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
context.py	[MINOR][DOCS] Fix few typos in the java docs	2019-09-12 09:30:03 +09:00
daemon.py	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon	2019-07-31 09:10:24 +09:00
files.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
find_spark_home.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
heapq3.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
java_gateway.py	[SPARK-27870][SQL][PYTHON] Add a runtime buffer size configuration for Pandas UDFs	2019-06-15 20:56:22 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
rdd.py	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs	2019-09-17 17:13:50 -07:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resourceinformation.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00
resultiterable.py	[SPARK-3074] [PySpark] support groupByKey() with single huge key	2015-04-09 17:07:23 -07:00
serializers.py	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs	2019-09-17 17:13:50 -07:00
shell.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
shuffle.py	[SPARK-25696] The storage memory displayed on spark Application UI is…	2018-12-10 18:27:01 -06:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
taskcontext.py	[SPARK-28855][CORE][ML][SQL][STREAMING] Remove outdated usages of Experimental, Evolving annotations	2019-09-01 10:15:00 -05:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs	2019-03-11 10:15:07 +09:00
version.py	[SPARK-25592] Setting version to 3.0.0-SNAPSHOT	2018-10-02 08:48:24 -07:00
worker.py	[SPARK-27463][PYTHON] Support Dataframe Cogroup via Pandas UDFs	2019-09-17 17:13:50 -07:00