spark-instrumented-optimizer

History

Bryan Cutler be08b415da [SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality ## What changes were proposed in this pull request? This change is a cleanup and consolidation of 3 areas related to Pandas UDFs: 1) `ArrowStreamPandasSerializer` now inherits from `ArrowStreamSerializer` and uses the base class `dump_stream`, `load_stream` to create Arrow reader/writer and send Arrow record batches. `ArrowStreamPandasSerializer` makes the conversions to/from Pandas and converts to Arrow record batch iterators. This change removed duplicated creation of Arrow readers/writers. 2) `createDataFrame` with Arrow now uses `ArrowStreamPandasSerializer` instead of doing its own conversions from Pandas to Arrow and sending record batches through `ArrowStreamSerializer`. 3) Grouped Map UDFs now reuse existing logic in `ArrowStreamPandasSerializer` to send Pandas DataFrame results as a `StructType` instead of separating each column from the DataFrame. This makes the code a little more consistent with the Python worker, but does require that the returned StructType column is flattened out in `FlatMapGroupsInPandasExec` in Scala. ## How was this patch tested? Existing tests and ran tests with pyarrow 0.12.0 Closes #24095 from BryanCutler/arrow-refactor-cleanup-UDFs. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2019-03-21 17:44:51 +09:00
..
avro	[SPARK-26856][PYSPARK] Python support for from_avro and to_avro APIs	2019-03-11 10:15:07 +09:00
tests	[SPARK-26979][PYTHON][FOLLOW-UP] Make binary math/string functions take string as columns as well	2019-03-20 08:06:10 +09:00
__init__.py	[SPARK-22369][PYTHON][DOCS] Exposes catalog API documentation in PySpark	2017-11-02 15:22:52 +01:00
catalog.py	[SPARK-24665][PYSPARK][FOLLOWUP] Use SQLConf in PySpark to manage all sql configs	2018-08-17 10:18:08 +08:00
column.py	[SPARK-23847][PYTHON][SQL] Add asc_nulls_first, asc_nulls_last to PySpark	2018-04-08 12:09:06 +08:00
conf.py	[SPARK-23698][PYTHON] Resolve undefined names in Python 3	2018-08-22 10:06:59 -07:00
context.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
dataframe.py	[SPARK-27096][SQL][FOLLOWUP] Do the correct validation of join types in R side and fix join docs for scala, python and r	2019-03-16 13:04:54 +09:00
functions.py	[SPARK-27099][SQL] Add 'xxhash64' for hashing arbitrary columns to Long	2019-03-20 16:34:34 +08:00
group.py	[SPARK-24722][SQL] pivot() with Column type argument	2018-08-04 14:17:32 +08:00
readwriter.py	[SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8	2019-03-05 08:03:39 +09:00
session.py	[SPARK-27163][PYTHON] Cleanup and consolidate Pandas UDF functionality	2019-03-21 17:44:51 +09:00
streaming.py	[SPARK-26016][DOCS] Clarify that text DataSource read/write, and RDD methods that read text, always use UTF-8	2019-03-05 08:03:39 +09:00
types.py	[SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF	2019-03-07 08:52:24 -08:00
udf.py	[SPARK-23836][PYTHON] Add support for StructType return in Scalar Pandas UDF	2019-03-07 08:52:24 -08:00
utils.py	[SPARK-24721][SQL] Exclude Python UDFs filters in FileSourceStrategy	2018-08-28 10:57:13 +08:00
window.py	[SPARK-26860][PYSPARK][SPARKR] Fix for RangeBetween and RowsBetween docs to be in sync with spark documentation	2019-03-11 08:53:09 -05:00