spark-instrumented-optimizer/python/pyspark/sql/tests
Takuya UESHIN b8a440f098 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends
### What changes were proposed in this pull request?

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.

### Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

```py
spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()
```

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests, and manually.

Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-01 20:28:12 +09:00
..
__init__.py [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files 2018-11-14 14:51:11 +08:00
test_arrow.py [SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures 2020-10-06 18:11:24 +09:00
test_catalog.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_column.py [SPARK-32511][FOLLOW-UP][SQL][R][PYTHON] Add dropFields to SparkR and PySpark 2020-10-08 10:37:42 +09:00
test_conf.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_context.py [SPARK-33021][PYTHON][TESTS] Move functions related test cases into test_functions.py 2020-09-28 21:54:00 -07:00
test_dataframe.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_datasources.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_functions.py [SPARK-32084][PYTHON][SQL] Expand dictionary functions 2020-10-27 11:05:53 +09:00
test_group.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_pandas_cogrouped_map.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_pandas_grouped_map.py [SPARK-33073][PYTHON] Improve error handling on Pandas to Arrow conversion failures 2020-10-06 18:11:24 +09:00
test_pandas_map.py [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends 2020-11-01 20:28:12 +09:00
test_pandas_udf.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_pandas_udf_grouped_agg.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_pandas_udf_scalar.py [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends 2020-11-01 20:28:12 +09:00
test_pandas_udf_typehints.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_pandas_udf_window.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_readwriter.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_serde.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_session.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_streaming.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00
test_types.py [SPARK-33268][SQL][PYTHON] Fix bugs for casting data from/to PythonUserDefinedType 2020-10-28 08:33:02 -07:00
test_udf.py [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends 2020-11-01 20:28:12 +09:00
test_utils.py [SPARK-32714][PYTHON] Initial pyspark-stubs port 2020-09-24 14:15:36 +09:00