04a19963e3
### What changes were proposed in this pull request?
Fix `DataFrameGroupBy.apply` without shortcut.
Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:
```py
>>> pdf = pd.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"],
... )
>>> pdf.groupby('b').apply(lambda x: x['a'])
b
1 0 1
1 2
2 2 3
3 3 4
5 4 5
8 5 6
Name: a, dtype: int64
>>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
a 0 1
b
1 1 2
```
If there is only one group, it returns a "wide" `DataFrame` instead of `Series`.
In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.
### Why are the changes needed?
`DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.
```py
>>> ps.options.compute.shortcut_limit = 3
>>> psdf = ps.DataFrame(
... {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
... columns=["a", "b", "c"],
... )
>>> psdf.groupby("b").apply(lambda x: x["a"])
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
```
### Does this PR introduce _any_ user-facing change?
The error above will be gone:
```py
>>> psdf.groupby("b").apply(lambda x: x["a"])
b
1 0 1
1 2
2 2 3
3 3 4
5 4 5
8 5 6
Name: a, dtype: int64
```
### How was this patch tested?
Added tests.
Closes #34160 from ueshin/issues/SPARK-36907/groupby-apply.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit
|
||
---|---|---|
.. | ||
data_type_ops | ||
indexes | ||
missing | ||
plot | ||
spark | ||
tests | ||
typedef | ||
usage_logging | ||
__init__.py | ||
_typing.py | ||
accessors.py | ||
base.py | ||
categorical.py | ||
config.py | ||
datetimes.py | ||
exceptions.py | ||
extensions.py | ||
frame.py | ||
generic.py | ||
groupby.py | ||
indexing.py | ||
internal.py | ||
ml.py | ||
mlflow.py | ||
namespace.py | ||
numpy_compat.py | ||
series.py | ||
sql_processor.py | ||
strings.py | ||
utils.py | ||
window.py |