spark-instrumented-optimizer

History

Takuya UESHIN b8a440f098 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-11-01 20:28:12 +09:00
..
__init__.py	[SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files	2018-11-14 14:51:11 +08:00
test_arrow.py	[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures	2020-10-06 18:11:24 +09:00
test_catalog.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_column.py	[SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark	2020-10-08 10:37:42 +09:00
test_conf.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_context.py	[SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py	2020-09-28 21:54:00 -07:00
test_dataframe.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_datasources.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_functions.py	[SPARK-32084][PYTHON][SQL] Expand dictionary functions	2020-10-27 11:05:53 +09:00
test_group.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_pandas_cogrouped_map.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_pandas_grouped_map.py	[SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures	2020-10-06 18:11:24 +09:00
test_pandas_map.py	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends	2020-11-01 20:28:12 +09:00
test_pandas_udf.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_pandas_udf_grouped_agg.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_pandas_udf_scalar.py	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends	2020-11-01 20:28:12 +09:00
test_pandas_udf_typehints.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_pandas_udf_window.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_readwriter.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_serde.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_session.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_streaming.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
test_types.py	[SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType	2020-10-28 08:33:02 -07:00
test_udf.py	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends	2020-11-01 20:28:12 +09:00
test_utils.py	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00