[SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS

### What changes were proposed in this pull request?
PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax.
![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png)

In this PR, we support it in Spark SQL too

### Why are the changes needed?
Keep consistent with PG and oracle

### Does this PR introduce _any_ user-facing change?
User can write grouping analytics like
```
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
```

### How was this patch tested?
Added Test

Closes #32201 from AngersZhuuuu/SPARK-35026.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This commit is contained in:
Angerszhuuuu 2021-04-22 13:08:22 +00:00 committed by Wenchen Fan
parent 7242d7f774
commit b22d54a58a
6 changed files with 514 additions and 29 deletions

View file

@ -24,8 +24,8 @@ license: |
The `GROUP BY` clause is used to group the rows based on a set of specified grouping expressions and compute aggregations on
the group of rows based on one or more specified aggregate functions. Spark also supports advanced aggregations to do multiple
aggregations for the same input record set via `GROUPING SETS`, `CUBE`, `ROLLUP` clauses.
The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause.
See more details in the `Mixed Grouping Analytics` section. When a FILTER clause is attached to
The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause and nested in a `GROUPING SETS` clause.
See more details in the `Mixed/Nested Grouping Analytics` section. When a FILTER clause is attached to
an aggregate function, only the matching rows are passed to that function.
### Syntax
@ -95,13 +95,17 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
(product, warehouse, location), (warehouse), (product), (warehouse, product), ())`.
The N elements of a `CUBE` specification results in 2^N `GROUPING SETS`.
* **Mixed Grouping Analytics**
* **Mixed/Nested Grouping Analytics**
A GROUP BY clause can include multiple `group_expression`s and multiple `CUBE|ROLLUP|GROUPING SETS`s.
A GROUP BY clause can include multiple `group_expression`s and multiple `CUBE|ROLLUP|GROUPING SETS`s.
`GROUPING SETS` can also have nested `CUBE|ROLLUP|GROUPING SETS` clauses, e.g.
`GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))`,
`GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))`.
`CUBE|ROLLUP` is just a syntax sugar for `GROUPING SETS`, please refer to the sections above for
how to translate `CUBE|ROLLUP` to `GROUPING SETS`. `group_expression` can be treated as a single-group
`GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example,
a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For nested `GROUPING SETS` in the `GROUPING SETS` clause,
we simply take its grouping sets and strip it. For example,
`GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())`
and `GROUP BY warehouse, ROLLUP(product), CUBE(location, size)` is equivalent to
`GROUP BY GROUPING SETS(
@ -114,6 +118,9 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
(warehouse, size),
(warehouse))`.
`GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))` is equivalent to
`GROUP BY GROUPING SETS((warehouse), (warehouse, product))`.
* **aggregate_name**
Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).

View file

@ -606,7 +606,13 @@ groupByClause
;
groupingAnalytics
: (ROLLUP | CUBE | GROUPING SETS) '(' groupingSet (',' groupingSet)* ')'
: (ROLLUP | CUBE) '(' groupingSet (',' groupingSet)* ')'
| GROUPING SETS '(' groupingElement (',' groupingElement)* ')'
;
groupingElement
: groupingAnalytics
| groupingSet
;
groupingSet

View file

@ -993,26 +993,7 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
.map(groupByExpr => {
val groupingAnalytics = groupByExpr.groupingAnalytics
if (groupingAnalytics != null) {
val groupingSets = groupingAnalytics.groupingSet.asScala
.map(_.expression.asScala.map(e => expression(e)).toSeq)
if (groupingAnalytics.CUBE != null) {
// CUBE(A, B, (A, B), ()) is not supported.
if (groupingSets.exists(_.isEmpty)) {
throw new ParseException("Empty set in CUBE grouping sets is not supported.",
groupingAnalytics)
}
Cube(groupingSets.toSeq)
} else if (groupingAnalytics.ROLLUP != null) {
// ROLLUP(A, B, (A, B), ()) is not supported.
if (groupingSets.exists(_.isEmpty)) {
throw new ParseException("Empty set in ROLLUP grouping sets is not supported.",
groupingAnalytics)
}
Rollup(groupingSets.toSeq)
} else {
assert(groupingAnalytics.GROUPING != null && groupingAnalytics.SETS != null)
GroupingSets(groupingSets.toSeq)
}
visitGroupingAnalytics(groupingAnalytics)
} else {
expression(groupByExpr.expression)
}
@ -1021,6 +1002,38 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
}
}
override def visitGroupingAnalytics(
groupingAnalytics: GroupingAnalyticsContext): BaseGroupingSets = {
val groupingSets = groupingAnalytics.groupingSet.asScala
.map(_.expression.asScala.map(e => expression(e)).toSeq)
if (groupingAnalytics.CUBE != null) {
// CUBE(A, B, (A, B), ()) is not supported.
if (groupingSets.exists(_.isEmpty)) {
throw new ParseException("Empty set in CUBE grouping sets is not supported.",
groupingAnalytics)
}
Cube(groupingSets.toSeq)
} else if (groupingAnalytics.ROLLUP != null) {
// ROLLUP(A, B, (A, B), ()) is not supported.
if (groupingSets.exists(_.isEmpty)) {
throw new ParseException("Empty set in ROLLUP grouping sets is not supported.",
groupingAnalytics)
}
Rollup(groupingSets.toSeq)
} else {
assert(groupingAnalytics.GROUPING != null && groupingAnalytics.SETS != null)
val groupingSets = groupingAnalytics.groupingElement.asScala.flatMap { expr =>
val groupingAnalytics = expr.groupingAnalytics()
if (groupingAnalytics != null) {
visitGroupingAnalytics(groupingAnalytics).selectedGroupByExprs
} else {
Seq(expr.groupingSet().expression().asScala.map(e => expression(e)).toSeq)
}
}
GroupingSets(groupingSets.toSeq)
}
}
/**
* Add [[UnresolvedHint]]s to a logical plan.
*/

View file

@ -80,3 +80,14 @@ SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), GROUPING SETS((a, b), (a), ());
SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), ROLLUP(a, b), GROUPING SETS((a, b), (a), ());
-- Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), GROUPING SETS(ROLLUP(a, b)));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b, a, b), (a, b, a), (a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b, a, b), (a, b, a), (a, b)));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b), CUBE(a, b));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()), GROUPING SETS((a, b), (a), (b), ()));
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), (), (a, b), (a), (b), ());

View file

@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 44
-- Number of queries: 52
-- !query
@ -1067,3 +1067,227 @@ struct<a:int,b:int,count(1):bigint>
3 NULL 2
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 2 1
1 NULL 2
1 NULL 2
2 1 1
2 2 1
2 NULL 2
2 NULL 2
3 1 1
3 2 1
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 2 1
1 NULL 2
1 NULL 2
2 1 1
2 2 1
2 NULL 2
2 NULL 2
3 1 1
3 2 1
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), GROUPING SETS(ROLLUP(a, b)))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b, a, b), (a, b, a), (a, b))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b, a, b), (a, b, a), (a, b)))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b), CUBE(a, b))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()), GROUPING SETS((a, b), (a), (b), ()))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), (), (a, b), (a), (b), ())
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
3 NULL 2
3 NULL 2

View file

@ -1,5 +1,5 @@
-- Automatically generated by SQLQueryTestSuite
-- Number of queries: 44
-- Number of queries: 52
-- !query
@ -1087,3 +1087,227 @@ struct<a:int,b:int,count(1):bigint>
3 NULL 2
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 2 1
1 NULL 2
1 NULL 2
2 1 1
2 2 1
2 NULL 2
2 NULL 2
3 1 1
3 2 1
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 2 1
1 NULL 2
1 NULL 2
2 1 1
2 2 1
2 NULL 2
2 NULL 2
3 1 1
3 2 1
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), GROUPING SETS(ROLLUP(a, b)))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b, a, b), (a, b, a), (a, b))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b, a, b), (a, b, a), (a, b)))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b), CUBE(a, b))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()), GROUPING SETS((a, b), (a), (b), ()))
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
3 NULL 2
3 NULL 2
-- !query
SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), (), (a, b), (a), (b), ())
-- !query schema
struct<a:int,b:int,count(1):bigint>
-- !query output
1 1 1
1 1 1
1 1 1
1 2 1
1 2 1
1 2 1
1 NULL 2
1 NULL 2
1 NULL 2
1 NULL 2
2 1 1
2 1 1
2 1 1
2 2 1
2 2 1
2 2 1
2 NULL 2
2 NULL 2
2 NULL 2
2 NULL 2
3 1 1
3 1 1
3 1 1
3 2 1
3 2 1
3 2 1
3 NULL 2
3 NULL 2
3 NULL 2
3 NULL 2