spark-instrumented-optimizer

History

zero323 7de33f56e8 [SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark ### What changes were proposed in this pull request? This PR add Python API for invoking following higher functions: - `transform` - `exists` - `forall` - `filter` - `aggregate` - `zip_with` - `transform_keys` - `transform_values` - `map_filter` - `map_zip_with` to `pyspark.sql`. Each of these accepts plain Python functions of one of the following types - `(Column) -> Column: ...` - `(Column, Column) -> Column: ...` - `(Column, Column, Column) -> Column: ...` Internally this proposal piggbacks on objects supporting Scala implementation ([SPARK-27297](https://issues.apache.org/jira/browse/SPARK-27297)) by: 1. Creating required `UnresolvedNamedLambdaVariables` exposing these as PySpark `Columns` 2. Invoking Python function with these columns as arguments. 3. Using the result, and underlying JVM objects from 1., to create `expressions.LambdaFunction` which is passed to desired expression, and repacked as Python `Column`. ### Why are the changes needed? Currently higher order functions are available only using SQL and Scala API and can use only SQL expressions ```python df.selectExpr("transform(values, x -> x + 1)") ``` This works reasonably well for simple functions, but can get really ugly with complex functions (complex functions, casts), resulting objects are somewhat verbose and we don't get any IDE support. Additionally DSL used, though very simple, is not documented. With changes propose here, above query could be rewritten as: ```python df.select(transform("values", lambda x: x + 1)) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - For positive cases this PR adds doctest strings covering possible usage patterns. - For negative cases (unsupported function types) this PR adds unit tests. ### Notes If approved, the same approach can be used in SparkR. Closes #27406 from zero323/SPARK-30681. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-02-28 12:59:39 +09:00
..
__init__.py	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files	2018-11-14 14:51:11 +08:00
test_arrow.py	[SPARK-30777][PYTHON][TESTS] Fix test failures for Pandas >= 1.0.0	2020-02-11 10:03:01 +09:00
test_catalog.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_column.py	[SPARK-29664][PYTHON][SQL] Column.getItem behavior is not consistent with Scala	2019-11-01 12:25:48 +09:00
test_conf.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_context.py	[SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted	2020-02-20 12:21:24 +09:00
test_dataframe.py	[SPARK-30791][SQL][PYTHON] Add 'sameSemantics' and 'sementicHash' methods in Dataset	2020-02-18 09:22:26 +08:00
test_datasources.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_functions.py	[SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark	2020-02-28 12:59:39 +09:00
test_group.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_pandas_cogrouped_map.py	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types	2020-01-22 15:32:58 +09:00
test_pandas_grouped_map.py	[SPARK-30777][PYTHON][TESTS] Fix test failures for Pandas >= 1.0.0	2020-02-11 10:03:01 +09:00
test_pandas_map.py	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types	2020-01-22 15:32:58 +09:00
test_pandas_udf.py	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy	2020-02-18 20:39:50 +08:00
test_pandas_udf_grouped_agg.py	[SPARK-30777][PYTHON][TESTS] Fix test failures for Pandas >= 1.0.0	2020-02-11 10:03:01 +09:00
test_pandas_udf_scalar.py	[SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size	2020-02-05 11:38:33 +09:00
test_pandas_udf_typehints.py	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types	2020-01-22 15:32:58 +09:00
test_pandas_udf_window.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_readwriter.py	[SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored	2019-07-18 13:37:59 +09:00
test_serde.py	[SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type	2019-09-12 08:52:25 +09:00
test_session.py	[SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted	2020-02-20 12:21:24 +09:00
test_streaming.py	[SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark	2019-06-24 09:58:17 +09:00
test_types.py	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy	2020-02-18 20:39:50 +08:00
test_udf.py	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types	2020-01-22 15:32:58 +09:00
test_utils.py	[SPARK-19926][PYSPARK] make captured exception from JVM side user friendly	2019-09-18 23:32:10 +09:00