spark-instrumented-optimizer/sql/catalyst
allisonwang-db d0c97d6ed9 [SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list
### What changes were proposed in this pull request?

This PR adds a check in the optimizer rule `CollapseProject` to avoid combining Project with Aggregate when the project list contains one or more correlated scalar subqueries that reference the output of the aggregate. Combining Project with Aggregate can lead to an invalid plan after correlated subquery rewrite. This is because correlated scalar subqueries' references are used as join conditions, which cannot host aggregate expressions.

For example
```sql
select (select sum(c2) from t where c1 = cast(s as int)) from (select sum(c2) s from t)
```

```
== Optimized Logical Plan ==
Aggregate [sum(c2)#10L AS scalarsubquery(s)#11L] <--- Aggregate has neither grouping nor aggregate expressions.
+- Project [sum(c2)#10L]
   +- Join LeftOuter, (c1#2 = cast(sum(c2#3) as int))  <--- Aggregate expression in join condition
      :- LocalRelation [c2#3]
      +- Aggregate [c1#2], [sum(c2#3) AS sum(c2)#10L, c1#2]
         +- LocalRelation [c1#2, c2#3]

java.lang.UnsupportedOperationException: Cannot generate code for expression: sum(input[0, int, false])
```
Currently, we only allow a correlated scalar subquery in Aggregate if it is also in the grouping expressions.
079a9c5292/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala (L661-L666)

### Why are the changes needed?

To fix an existing optimizer issue.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Authored-by: allisonwang-db <allison.wangdatabricks.com>
Signed-off-by: Wenchen Fan <wenchendatabricks.com>
(cherry picked from commit 4a8dc5f7a3)
Signed-off-by: allisonwang-db <allison.wangdatabricks.com>

Closes #34081 from allisonwang-db/cp-spark-36747.

Authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-24 16:14:49 +08:00
..
benchmarks [SPARK-34950][TESTS] Update benchmark results to the ones created by GitHub Actions machines 2021-04-03 23:02:56 +03:00
src [SPARK-36747][SQL][3.2] Do not collapse Project with Aggregate when correlated subqueries are present in the project list 2021-09-24 16:14:49 +08:00
pom.xml Preparing development version 3.2.1-SNAPSHOT 2021-09-23 08:46:28 +00:00