[SPARK-35028][SQL] ANSI mode: disallow group by aliases

### What changes were proposed in this pull request? Disallow group by aliases under ANSI mode. ### Why are the changes needed? As per the ANSI SQL standard secion 7.12 <group by clause>: >Each `grouping column reference` shall unambiguously reference a column of the table resulting from the `from clause`. A column referenced in a `group by clause` is a grouping column. By forbidding it, we can avoid ambiguous SQL queries like: ``` SELECT col + 1 as col FROM t GROUP BY col ``` ### Does this PR introduce _any_ user-facing change? Yes, group by aliases is not allowed under ANSI mode. ### How was this patch tested? Unit tests Closes #32129 from gengliangwang/disallowGroupByAlias. Authored-by: Gengliang Wang <ltnwgl@gmail.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>
2021-04-13 10:42:57 +08:00 · 2021-04-13 10:42:57 +08:00 · 79e55b44f7
parent 278203d969
commit 79e55b44f7
5 changed files with 1089 additions and 14 deletions
--- a/docs/sql-ref-ansi-compliance.md
+++ b/docs/sql-ref-ansi-compliance.md
@ -183,6 +183,7 @@ The behavior of some SQL functions can be different under ANSI mode (`spark.sql.
 The behavior of some SQL operators can be different under ANSI mode (`spark.sql.ansi.enabled=true`).
  - `array_col[index]`: This operator throws `ArrayIndexOutOfBoundsException` if using invalid indices.
  - `map_col[key]`: This operator throws `NoSuchElementException` if key does not exist in map.
  - `GROUP BY`: aliases in a select list can not be used in GROUP BY clauses. Each column referenced in a GROUP BY clause shall unambiguously reference a column of the table resulting from the FROM clause.
 ### SQL Keywords
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
@ -1847,9 +1847,12 @@ class Analyzer(override val catalogManager: CatalogManager)
      }}
    }
    // Group by alias is not allowed in ANSI mode.
    private def allowGroupByAlias: Boolean = conf.groupByAliases && !conf.ansiEnabled
    override def apply(plan: LogicalPlan): LogicalPlan = plan.resolveOperatorsUp {
      case agg @ Aggregate(groups, aggs, child)
-          if conf.groupByAliases && child.resolved && aggs.forall(_.resolved) &&
+          if allowGroupByAlias && child.resolved && aggs.forall(_.resolved) &&
            groups.exists(!_.resolved) =>
        agg.copy(groupingExpressions = mayResolveAttrByAggregateExprs(groups, aggs, child))
    }
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala
@ -206,6 +206,17 @@ object SQLConf {
    .intConf
    .createWithDefault(100)
  val ANSI_ENABLED = buildConf("spark.sql.ansi.enabled")
    .doc("When true, Spark SQL uses an ANSI compliant dialect instead of being Hive compliant. " +
      "For example, Spark will throw an exception at runtime instead of returning null results " +
      "when the inputs to a SQL operator/function are invalid." +
      "For full details of this dialect, you can find them in the section \"ANSI Compliance\" of " +
      "Spark's documentation. Some ANSI dialect features may be not from the ANSI SQL " +
      "standard directly, but their behaviors align with ANSI SQL's style")
    .version("3.0.0")
    .booleanConf
    .createWithDefault(false)
  val OPTIMIZER_EXCLUDED_RULES = buildConf("spark.sql.optimizer.excludedRules")
    .doc("Configures a list of rules to be disabled in the optimizer, in which the rules are " +
      "specified by their rule names and separated by comma. It is not guaranteed that all the " +
@ -1092,8 +1103,9 @@ object SQLConf {
    .createWithDefault(true)
  val GROUP_BY_ALIASES = buildConf("spark.sql.groupByAliases")
-    .doc("When true, aliases in a select list can be used in group by clauses. When false, " +
+    .doc("This configuration is only effective when ANSI mode is disabled. When it is true and " +
-      "an analysis exception is thrown in the case.")
+      s"${ANSI_ENABLED.key} is false, aliases in a select list can be used in group by clauses. " +
      "Otherwise, an analysis exception is thrown in the case.")
    .version("2.2.0")
    .booleanConf
    .createWithDefault(true)
@ -2348,17 +2360,6 @@ object SQLConf {
      .checkValues(StoreAssignmentPolicy.values.map(_.toString))
      .createWithDefault(StoreAssignmentPolicy.ANSI.toString)
  val ANSI_ENABLED = buildConf("spark.sql.ansi.enabled")
    .doc("When true, Spark SQL uses an ANSI compliant dialect instead of being Hive compliant. " +
      "For example, Spark will throw an exception at runtime instead of returning null results " +
      "when the inputs to a SQL operator/function are invalid." +
      "For full details of this dialect, you can find them in the section \"ANSI Compliance\" of " +
      "Spark's documentation. Some ANSI dialect features may be not from the ANSI SQL " +
      "standard directly, but their behaviors align with ANSI SQL's style")
    .version("3.0.0")
    .booleanConf
    .createWithDefault(false)
  val SORT_BEFORE_REPARTITION =
    buildConf("spark.sql.execution.sortBeforeRepartition")
      .internal()
--- a/sql/core/src/test/resources/sql-tests/inputs/ansi/group-analytics.sql
+++ b/sql/core/src/test/resources/sql-tests/inputs/ansi/group-analytics.sql
@ -0,0 +1 @@
 --IMPORT group-analytics.sql
--- a/sql/core/src/test/resources/sql-tests/results/ansi/group-analytics.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/ansi/group-analytics.sql.out