spark-instrumented-optimizer/python/pyspark/sql/tests
HyukjinKwon 00d06cad56 [SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs
### What changes were proposed in this pull request?

This is another approach to fix the issue. See the previous try https://github.com/apache/spark/pull/28745. It was too invasive so I took more conservative approach.

This PR proposes to resolve grouping attributes separately first so it can be properly referred when `FlatMapGroupsInPandas` and `FlatMapCoGroupsInPandas` are resolved without ambiguity.

Previously,

```python
from pyspark.sql.functions import *
df = spark.createDataFrame([[1, 1]], ["column", "Score"])
pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
    return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()
```

was failed as below:

```
pyspark.sql.utils.AnalysisException: "Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;"
```
because the unresolved `COLUMN` in `FlatMapGroupsInPandas` doesn't know which reference to take from the child projection.

After this fix, it resolves the child projection first with grouping keys and pass, to `FlatMapGroupsInPandas`, the attribute as a grouping key from the child projection that is positionally selected.

### Why are the changes needed?

To resolve grouping keys correctly.

### Does this PR introduce _any_ user-facing change?

Yes,

```python
from pyspark.sql.functions import *
df = spark.createDataFrame([[1, 1]], ["column", "Score"])
pandas_udf("column integer, Score float", PandasUDFType.GROUPED_MAP)
def my_pandas_udf(pdf):
    return pdf.assign(Score=0.5)

df.groupby('COLUMN').apply(my_pandas_udf).show()
```

```python
df1 = spark.createDataFrame([(1, 1)], ("column", "value"))
df2 = spark.createDataFrame([(1, 1)], ("column", "value"))

df1.groupby("COLUMN").cogroup(
    df2.groupby("COLUMN")
).applyInPandas(lambda r, l: r + l, df1.schema).show()
```

Before:

```
pyspark.sql.utils.AnalysisException: Reference 'COLUMN' is ambiguous, could be: COLUMN, COLUMN.;
```

```
pyspark.sql.utils.AnalysisException: cannot resolve '`COLUMN`' given input columns: [COLUMN, COLUMN, value, value];;
'FlatMapCoGroupsInPandas ['COLUMN], ['COLUMN], <lambda>(column#9L, value#10L, column#13L, value#14L), [column#22L, value#23L]
:- Project [COLUMN#9L, column#9L, value#10L]
:  +- LogicalRDD [column#9L, value#10L], false
+- Project [COLUMN#13L, column#13L, value#14L]
   +- LogicalRDD [column#13L, value#14L], false
```

After:

```
+------+-----+
|column|Score|
+------+-----+
|     1|  0.5|
+------+-----+
```

```
+------+-----+
|column|value|
+------+-----+
|     2|    2|
+------+-----+
```

### How was this patch tested?

Unittests were added and manually tested.

Closes #28777 from HyukjinKwon/SPARK-31915-another.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-06-10 15:54:07 -07:00
..
__init__.py [SPARK-26032][PYTHON] Break large sql/tests.py files into smaller files 2018-11-14 14:51:11 +08:00
test_arrow.py [SPARK-25351][PYTHON][TEST][FOLLOWUP] Fix test assertions to be consistent 2020-05-28 10:27:15 +09:00
test_catalog.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_column.py [SPARK-29664][PYTHON][SQL][FOLLOW-UP] Add deprecation warnings for getItem instead 2020-04-27 14:49:22 +09:00
test_conf.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_context.py [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions 2020-05-07 14:40:26 +09:00
test_dataframe.py [SPARK-31763][PYSPARK] Add inputFiles method in PySpark DataFrame Class 2020-05-28 09:52:08 +09:00
test_datasources.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_functions.py [SPARK-30569][SQL][PYSPARK][SPARKR] Add percentile_approx DSL functions 2020-03-17 10:44:21 +09:00
test_group.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_pandas_cogrouped_map.py [SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs 2020-06-10 15:54:07 -07:00
test_pandas_grouped_map.py [SPARK-31915][SQL][PYTHON] Resolve the grouping column properly per the case sensitivity in grouped and cogrouped pandas UDFs 2020-06-10 15:54:07 -07:00
test_pandas_map.py [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types 2020-01-22 15:32:58 +09:00
test_pandas_udf.py [SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic 2020-06-01 09:45:21 +09:00
test_pandas_udf_grouped_agg.py [SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate 2020-04-06 09:36:20 +09:00
test_pandas_udf_scalar.py [SPARK-25351][PYTHON][TEST][FOLLOWUP] Fix test assertions to be consistent 2020-05-28 10:27:15 +09:00
test_pandas_udf_typehints.py [SPARK-31287][PYTHON][SQL] Ignore type hints in groupby.(cogroup.)applyInPandas and mapInPandas 2020-03-29 13:59:18 +09:00
test_pandas_udf_window.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_readwriter.py [SPARK-28411][PYTHON][SQL] InsertInto with overwrite is not honored 2019-07-18 13:37:59 +09:00
test_serde.py [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type 2019-09-12 08:52:25 +09:00
test_session.py [SPARK-30856][SQL][PYSPARK] Fix SQLContext.getOrCreate() when SparkContext is restarted 2020-02-20 12:21:24 +09:00
test_streaming.py [SPARK-28130][PYTHON] Print pretty messages for skipped tests when xmlrunner is available in PySpark 2019-06-24 09:58:17 +09:00
test_types.py [SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy 2020-02-18 20:39:50 +08:00
test_udf.py [SPARK-31945][SQL][PYSPARK] Enable cache for the same Python function 2020-06-10 16:38:59 +09:00
test_utils.py [SPARK-19926][PYSPARK] make captured exception from JVM side user friendly 2019-09-18 23:32:10 +09:00