[SPARK-35026][SQL] Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS

### What changes were proposed in this pull request? PG and Oracle both support use CUBE/ROLLUP/GROUPING SETS in GROUPING SETS's grouping set as a sugar syntax. ![image](https://user-images.githubusercontent.com/46485123/114975588-139a1180-9eb7-11eb-8f53-498c1db934e0.png) In this PR, we support it in Spark SQL too ### Why are the changes needed? Keep consistent with PG and oracle ### Does this PR introduce _any_ user-facing change? User can write grouping analytics like ``` SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b)); SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ()); SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ())); ``` ### How was this patch tested? Added Test Closes #32201 from AngersZhuuuu/SPARK-35026. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-22 13:08:22 +00:00 · 2021-04-22 13:08:22 +00:00 · b22d54a58a
parent 7242d7f774
commit b22d54a58a
6 changed files with 514 additions and 29 deletions
--- a/docs/sql-ref-syntax-qry-select-groupby.md
+++ b/docs/sql-ref-syntax-qry-select-groupby.md
@ -24,8 +24,8 @@ license: |
 The `GROUP BY` clause is used to group the rows based on a set of specified grouping expressions and compute aggregations on
 the group of rows based on one or more specified aggregate functions. Spark also supports advanced aggregations to do multiple
 aggregations for the same input record set via `GROUPING SETS`, `CUBE`, `ROLLUP` clauses.
-The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause.
-See more details in the `Mixed Grouping Analytics` section. When a FILTER clause is attached to
+The grouping expressions and advanced aggregations can be mixed in the `GROUP BY` clause and nested in a `GROUPING SETS` clause.
+See more details in the `Mixed/Nested Grouping Analytics` section. When a FILTER clause is attached to
 an aggregate function, only the matching rows are passed to that function.

 ### Syntax
@ -95,13 +95,17 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
     (product, warehouse, location), (warehouse), (product), (warehouse, product), ())`.
    The N elements of a `CUBE` specification results in 2^N `GROUPING SETS`.

-* **Mixed Grouping Analytics**
+* **Mixed/Nested Grouping Analytics**

-    A GROUP BY clause can include multiple  `group_expression`s and multiple `CUBE|ROLLUP|GROUPING SETS`s.
+    A GROUP BY clause can include multiple `group_expression`s and multiple `CUBE|ROLLUP|GROUPING SETS`s.
+    `GROUPING SETS` can also have nested `CUBE|ROLLUP|GROUPING SETS` clauses, e.g.
+    `GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))`,
+    `GROUPING SETS(warehouse, GROUPING SETS(location, GROUPING SETS(ROLLUP(warehouse, location), CUBE(warehouse, location))))`.
    `CUBE|ROLLUP` is just a syntax sugar for `GROUPING SETS`, please refer to the sections above for
    how to translate `CUBE|ROLLUP` to `GROUPING SETS`. `group_expression` can be treated as a single-group
    `GROUPING SETS` under this context. For multiple `GROUPING SETS` in the `GROUP BY` clause, we generate
-    a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For example,
+    a single `GROUPING SETS` by doing a cross-product of the original `GROUPING SETS`s. For nested `GROUPING SETS` in the `GROUPING SETS` clause,
+    we simply take its grouping sets and strip it. For example,
    `GROUP BY warehouse, GROUPING SETS((product), ()), GROUPING SETS((location, size), (location), (size), ())`
    and `GROUP BY warehouse, ROLLUP(product), CUBE(location, size)` is equivalent to 
    `GROUP BY GROUPING SETS(
@ -114,6 +118,9 @@ aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_ex
        (warehouse, size),
        (warehouse))`.
    
+    `GROUP BY GROUPING SETS(GROUPING SETS(warehouse), GROUPING SETS((warehouse, product)))` is equivalent to 
+    `GROUP BY GROUPING SETS((warehouse), (warehouse, product))`.
+
 * **aggregate_name**

    Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
--- a/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
+++ b/sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
@ -606,7 +606,13 @@ groupByClause
    ;

 groupingAnalytics
-    : (ROLLUP | CUBE | GROUPING SETS)  '(' groupingSet (',' groupingSet)* ')'
+    : (ROLLUP | CUBE) '(' groupingSet (',' groupingSet)* ')'
+    | GROUPING SETS '(' groupingElement (',' groupingElement)* ')'
+    ;
+
+groupingElement
+    : groupingAnalytics
+    | groupingSet
    ;

 groupingSet
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala
@ -993,26 +993,7 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
          .map(groupByExpr => {
            val groupingAnalytics = groupByExpr.groupingAnalytics
            if (groupingAnalytics != null) {
-              val groupingSets = groupingAnalytics.groupingSet.asScala
-                .map(_.expression.asScala.map(e => expression(e)).toSeq)
-              if (groupingAnalytics.CUBE != null) {
-                // CUBE(A, B, (A, B), ()) is not supported.
-                if (groupingSets.exists(_.isEmpty)) {
-                  throw new ParseException("Empty set in CUBE grouping sets is not supported.",
-                    groupingAnalytics)
-                }
-                Cube(groupingSets.toSeq)
-              } else if (groupingAnalytics.ROLLUP != null) {
-                // ROLLUP(A, B, (A, B), ()) is not supported.
-                if (groupingSets.exists(_.isEmpty)) {
-                  throw new ParseException("Empty set in ROLLUP grouping sets is not supported.",
-                    groupingAnalytics)
-                }
-                Rollup(groupingSets.toSeq)
-              } else {
-                assert(groupingAnalytics.GROUPING != null && groupingAnalytics.SETS != null)
-                GroupingSets(groupingSets.toSeq)
-              }
+              visitGroupingAnalytics(groupingAnalytics)
            } else {
              expression(groupByExpr.expression)
            }
@ -1021,6 +1002,38 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with SQLConfHelper with Logg
    }
  }

+  override def visitGroupingAnalytics(
+      groupingAnalytics: GroupingAnalyticsContext): BaseGroupingSets = {
+    val groupingSets = groupingAnalytics.groupingSet.asScala
+      .map(_.expression.asScala.map(e => expression(e)).toSeq)
+    if (groupingAnalytics.CUBE != null) {
+      // CUBE(A, B, (A, B), ()) is not supported.
+      if (groupingSets.exists(_.isEmpty)) {
+        throw new ParseException("Empty set in CUBE grouping sets is not supported.",
+          groupingAnalytics)
+      }
+      Cube(groupingSets.toSeq)
+    } else if (groupingAnalytics.ROLLUP != null) {
+      // ROLLUP(A, B, (A, B), ()) is not supported.
+      if (groupingSets.exists(_.isEmpty)) {
+        throw new ParseException("Empty set in ROLLUP grouping sets is not supported.",
+          groupingAnalytics)
+      }
+      Rollup(groupingSets.toSeq)
+    } else {
+      assert(groupingAnalytics.GROUPING != null && groupingAnalytics.SETS != null)
+      val groupingSets = groupingAnalytics.groupingElement.asScala.flatMap { expr =>
+        val groupingAnalytics = expr.groupingAnalytics()
+        if (groupingAnalytics != null) {
+          visitGroupingAnalytics(groupingAnalytics).selectedGroupByExprs
+        } else {
+          Seq(expr.groupingSet().expression().asScala.map(e => expression(e)).toSeq)
+        }
+      }
+      GroupingSets(groupingSets.toSeq)
+    }
+  }
+
  /**
   * Add [[UnresolvedHint]]s to a logical plan.
   */
--- a/sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/group-analytics.sql
@ -80,3 +80,14 @@ SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), ());
 SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), GROUPING SETS((a, b), (a), ());
 SELECT a, b, count(1) FROM testData GROUP BY a, CUBE(a, b), ROLLUP(a, b), GROUPING SETS((a, b), (a), ());

+-- Support nested CUBE/ROLLUP/GROUPING SETS in GROUPING SETS
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b));
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()));
+
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), GROUPING SETS(ROLLUP(a, b)));
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b, a, b), (a, b, a), (a, b));
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b, a, b), (a, b, a), (a, b)));
+
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b), CUBE(a, b));
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()), GROUPING SETS((a, b), (a), (b), ()));
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), (), (a, b), (a), (b), ());
--- a/sql/core/src/test/resources/sql-tests/results/ansi/group-analytics.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/group-analytics.sql.out
@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
-- Number of queries: 44
+-- Number of queries: 52


 -- !query
@ -1067,3 +1067,227 @@ struct<a:int,b:int,count(1):bigint>
 3	NULL	2
 3	NULL	2
 3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	2	1
+1	NULL	2
+1	NULL	2
+2	1	1
+2	2	1
+2	NULL	2
+2	NULL	2
+3	1	1
+3	2	1
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	2	1
+1	NULL	2
+1	NULL	2
+2	1	1
+2	2	1
+2	NULL	2
+2	NULL	2
+3	1	1
+3	2	1
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), GROUPING SETS(ROLLUP(a, b)))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b, a, b), (a, b, a), (a, b))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b, a, b), (a, b, a), (a, b)))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b), CUBE(a, b))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()), GROUPING SETS((a, b), (a), (b), ()))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), (), (a, b), (a), (b), ())
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+3	NULL	2
+3	NULL	2
--- a/sql/core/src/test/resources/sql-tests/results/group-analytics.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/group-analytics.sql.out
@ -1,5 +1,5 @@
 -- Automatically generated by SQLQueryTestSuite
-- Number of queries: 44
+-- Number of queries: 52


 -- !query
@ -1087,3 +1087,227 @@ struct<a:int,b:int,count(1):bigint>
 3	NULL	2
 3	NULL	2
 3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	2	1
+1	NULL	2
+1	NULL	2
+2	1	1
+2	2	1
+2	NULL	2
+2	NULL	2
+3	1	1
+3	2	1
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	2	1
+1	NULL	2
+1	NULL	2
+2	1	1
+2	2	1
+2	NULL	2
+2	NULL	2
+3	1	1
+3	2	1
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), GROUPING SETS(ROLLUP(a, b)))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b, a, b), (a, b, a), (a, b))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b, a, b), (a, b, a), (a, b)))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(ROLLUP(a, b), CUBE(a, b))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS(GROUPING SETS((a, b), (a), ()), GROUPING SETS((a, b), (a), (b), ()))
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+3	NULL	2
+3	NULL	2
+
+
+-- !query
+SELECT a, b, count(1) FROM testData GROUP BY a, GROUPING SETS((a, b), (a), (), (a, b), (a), (b), ())
+-- !query schema
+struct<a:int,b:int,count(1):bigint>
+-- !query output
+1	1	1
+1	1	1
+1	1	1
+1	2	1
+1	2	1
+1	2	1
+1	NULL	2
+1	NULL	2
+1	NULL	2
+1	NULL	2
+2	1	1
+2	1	1
+2	1	1
+2	2	1
+2	2	1
+2	2	1
+2	NULL	2
+2	NULL	2
+2	NULL	2
+2	NULL	2
+3	1	1
+3	1	1
+3	1	1
+3	2	1
+3	2	1
+3	2	1
+3	NULL	2
+3	NULL	2
+3	NULL	2
+3	NULL	2