spark-instrumented-optimizer/python/pyspark/sql/tests
Bryan Cutler 5ad1053f3e [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions
## What changes were proposed in this pull request?

When running FlatMapGroupsInPandasExec or AggregateInPandasExec the shuffle uses a default number of partitions of 200 in "spark.sql.shuffle.partitions". If the data is small, e.g. in testing, many of the partitions will be empty but are treated just the same.

This PR checks the `mapPartitionsInternal` iterator to be non-empty before calling `ArrowPythonRunner` to start computation on the iterator.

## How was this patch tested?

Existing tests. Ran the following benchmarks a simple example where most partitions are empty:

```python
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.types import *

df = spark.createDataFrame(
     [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
     ("id", "v"))

pandas_udf("id long, v double", PandasUDFType.GROUPED_MAP)
def normalize(pdf):
    v = pdf.v
    return pdf.assign(v=(v - v.mean()) / v.std())

df.groupby("id").apply(normalize).count()
```

**Before**
```
In [4]: %timeit df.groupby("id").apply(normalize).count()
1.58 s ± 62.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [5]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 29.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [6]: %timeit df.groupby("id").apply(normalize).count()
1.52 s ± 37.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

**After this Change**
```
In [2]: %timeit df.groupby("id").apply(normalize).count()
646 ms ± 89.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [3]: %timeit df.groupby("id").apply(normalize).count()
408 ms ± 84.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [4]: %timeit df.groupby("id").apply(normalize).count()
381 ms ± 29.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

Closes #24926 from BryanCutler/pyspark-pandas_udf-map-agg-skip-empty-parts-SPARK-28128.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-06-22 11:20:35 +09:00
..
__init__.py [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files 2018-11-14 14:51:11 +08:00
test_appsubmit.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_arrow.py [SPARK-28041][PYTHON] Increase minimum supported Pandas to 0.23.2 2019-06-18 09:10:58 +09:00
test_catalog.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_column.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_conf.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_context.py [SPARK-26676][PYTHON] Make HiveContextSQLTests.test_unbounded_frames test compatible with Python 2 and PyPy 2019-01-21 14:27:17 -08:00
test_dataframe.py [SPARK-27834][SQL][R][PYTHON] Make separate PySpark/SparkR vectorization configurations 2019-06-03 10:01:37 +09:00
test_datasources.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_functions.py [SPARK-27711][CORE] Unset InputFileBlockHolder at the end of tasks 2019-05-22 18:35:50 -07:00
test_group.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_pandas_udf.py [SPARK-27276][PYTHON][SQL] Increase minimum version of pyarrow to 0.12.1 and remove prior workarounds 2019-04-22 19:30:31 +09:00
test_pandas_udf_grouped_agg.py [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions 2019-06-22 11:20:35 +09:00
test_pandas_udf_grouped_map.py [SPARK-28128][PYTHON][SQL] Pandas Grouped UDFs skip empty partitions 2019-06-22 11:20:35 +09:00
test_pandas_udf_scalar.py [SPARK-26412][PYSPARK][SQL] Allow Pandas UDF to take an iterator of pd.Series or an iterator of tuple of pd.Series 2019-06-15 08:29:20 -07:00
test_pandas_udf_window.py [SPARK-27387][PYTHON][TESTS] Replace sqlutils.assertPandasEqual with Pandas assert_frame_equals 2019-04-10 07:50:25 +09:00
test_readwriter.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00
test_serde.py [SPARK-27629][PYSPARK] Prevent Unpickler from intervening each unpickling 2019-05-04 13:21:08 +09:00
test_session.py [SPARK-27101][PYTHON] Drop the created database after the test in test_session 2019-03-09 09:12:33 +09:00
test_streaming.py [SPARK-23014][SS] Fully remove V1 memory sink. 2019-04-29 09:44:23 -07:00
test_types.py [SPARK-23299][SQL][PYSPARK] Fix __repr__ behaviour for Rows 2019-05-06 10:00:49 -07:00
test_udf.py [MINOR][PYTHON] Remove explain(True) in test_udf.py 2019-05-21 23:39:31 +09:00
test_utils.py [SPARK-26036][PYTHON] Break large tests.py files into smaller files 2018-11-15 12:30:52 +08:00