spark-instrumented-optimizer

History

HyukjinKwon 3165a95a04 [SPARK-31287][PYTHON][SQL] Ignore type hints in groupby.(cogroup.)applyInPandas and mapInPandas ### What changes were proposed in this pull request? This PR proposes to make pandas function APIs (`groupby.(cogroup.)applyInPandas` and `mapInPandas`) to ignore Python type hints. ### Why are the changes needed? Python type hints are optional. It shouldn't affect where pandas UDFs are not used. This is also a future work for them to support other type hints. We shouldn't at least throw an exception at this moment. ### Does this PR introduce any user-facing change? No, it's master-only change. ```python import pandas as pd def pandas_plus_one(pdf: pd.DataFrame) -> pd.DataFrame: return pdf + 1 spark.range(10).groupby('id').applyInPandas(pandas_plus_one, schema="id long").show() ``` ```python import pandas as pd def pandas_plus_one(left: pd.DataFrame, right: pd.DataFrame) -> pd.DataFrame: return left + 1 spark.range(10).groupby('id').cogroup(spark.range(10).groupby("id")).applyInPandas(pandas_plus_one, schema="id long").show() ``` ```python from typing import Iterator import pandas as pd def pandas_plus_one(iter: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]: return map(lambda v: v + 1, iter) spark.range(10).mapInPandas(pandas_plus_one, schema="id long").show() ``` Before: Exception After: ``` +---+ \| id\| +---+ \| 1\| \| 2\| \| 3\| \| 4\| \| 5\| \| 6\| \| 7\| \| 8\| \| 9\| \| 10\| +---+ ``` ### How was this patch tested? Closes #28052 from HyukjinKwon/SPARK-31287. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-03-29 13:59:18 +09:00
..
ml	[SPARK-31243][ML][PYSPARK] Add ANOVATest and FValueTest to PySpark	2020-03-27 14:05:49 +08:00
mllib	[SPARK-30930][ML] Remove ML/MLLIB DeveloperApi annotations	2020-03-16 12:41:22 -05:00
sql	[SPARK-31287][PYTHON][SQL] Ignore type hints in groupby.(cogroup.)applyInPandas and mapInPandas	2020-03-29 13:59:18 +09:00
streaming	[SPARK-28980][CORE][SQL][STREAMING][MLLIB] Remove most items deprecated in Spark 2.2.0 or earlier, for Spark 3	2019-09-09 10:19:40 -05:00
testing	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
tests	[SPARK-30969][CORE] Remove resource coordination support from Standalone	2020-03-02 11:23:07 -08:00
__init__.py	[SPARK-31088][SQL] Add back HiveContext and createExternalTable	2020-03-26 23:51:15 -07:00
_globals.py	[SPARK-23328][PYTHON] Disallow default value None in na.replace/replace when 'to_replace' is not a dictionary	2018-02-09 14:21:10 +08:00
accumulators.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
broadcast.py	[SPARK-29341][PYTHON] Upgrade cloudpickle to 1.0.0	2019-10-03 19:20:51 +09:00
cloudpickle.py	[SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8	2019-10-22 16:18:34 +09:00
conf.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
context.py	[SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode	2019-11-21 10:54:01 +09:00
daemon.py	[SPARK-26175][PYTHON] Redirect the standard input of the forked child to devnull in daemon	2019-07-31 09:10:24 +09:00
files.py	[SPARK-28206][PYTHON] Remove the legacy Epydoc in PySpark API documentation	2019-07-05 10:08:22 -07:00
find_spark_home.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
heapq3.py	Fix typos detected by github.com/client9/misspell	2018-08-11 21:23:36 -05:00
java_gateway.py	[SPARK-22340][PYTHON] Add a mode to pin Python thread into JVM's	2019-11-08 06:44:58 +09:00
join.py	[SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…	2016-03-28 14:51:36 -07:00
profiler.py	[SPARK-26640][CORE][ML][SQL][STREAMING][PYSPARK] Code cleanup from lgtm.com analysis	2019-01-17 19:40:39 -06:00
rdd.py	[SPARK-29499][CORE][PYSPARK] Add mapPartitionsWithIndex for RDDBarrier	2019-10-23 13:46:09 +02:00
rddsampler.py	[SPARK-4897] [PySpark] Python 3 support	2015-04-16 16:20:57 -07:00
resourceinformation.py	[SPARK-28234][CORE][PYTHON] Add python and JavaSparkContext support to get resources	2019-07-11 09:32:58 +09:00
resultiterable.py	[SPARK-30205][PYSPARK] Import ABCs from collections.abc to remove deprecation warnings	2019-12-10 11:08:13 -08:00
serializers.py	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
shell.py	[SPARK-25238][PYTHON] lint-python: Fix W605 warnings for pycodestyle 2.4	2018-09-13 11:19:43 +08:00
shuffle.py	[SPARK-25696] The storage memory displayed on spark Application UI is…	2018-12-10 18:27:01 -06:00
statcounter.py	[SPARK-6919] [PYSPARK] Add asDict method to StatCounter	2015-09-29 13:38:15 -07:00
status.py	[SPARK-4172] [PySpark] Progress API in Python	2015-02-17 13:36:43 -08:00
storagelevel.py	[SPARK-25908][CORE][SQL] Remove old deprecated items in Spark 3	2018-11-07 22:48:50 -06:00
taskcontext.py	[SPARK-30667][CORE] Add all gather method to BarrierTaskContext	2020-02-21 11:40:28 -08:00
traceback_utils.py	[SPARK-1087] Move python traceback utilities into new traceback_utils.py file.	2014-09-15 19:28:17 -07:00
util.py	[SPARK-22340][PYTHON][FOLLOW-UP] Add a better message and improve documentation for pinned thread mode	2019-11-21 10:54:01 +09:00
version.py	[SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT	2020-02-25 19:44:31 -08:00
worker.py	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy	2020-02-18 20:39:50 +08:00