From e24f0dcd2754a1db27e9b0f3cf27ee6d7229f717 Mon Sep 17 00:00:00 2001
From: Takeshi Yamamuro <yamamuro@apache.org>
Date: Mon, 6 Apr 2020 21:36:51 +0900
Subject: [PATCH] [SPARK-31358][SQL][DOC] Document FILTER clauses of aggregate
 functions in SQL references

### What changes were proposed in this pull request?

This PR intends to improve the SQL document of `GROUP BY`; it added the description about FILTER clauses of aggregate functions.

### Why are the changes needed?

To improve the SQL documents

### Does this PR introduce any user-facing change?

Yes.

<img src="https://user-images.githubusercontent.com/692303/78558612-e2234a80-784d-11ea-9353-b3feac4d57a7.png" width="500">

### How was this patch tested?

Manually checked.

Closes #28134 from maropu/SPARK-31358.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
---
 docs/sql-ref-syntax-qry-select-groupby.md | 44 +++++++++++++++++++++++
 docs/sql-ref-syntax-qry-select.md         |  1 +
 2 files changed, 45 insertions(+)
diff --git a/docs/sql-ref-syntax-qry-select-groupby.md b/docs/sql-ref-syntax-qry-select-groupby.md
index 49a11cafbb..c461a18e00 100644
--- a/docs/sql-ref-syntax-qry-select-groupby.md
+++ b/docs/sql-ref-syntax-qry-select-groupby.md
@@ -21,6 +21,7 @@ license: |
 The <code>GROUP BY</code> clause is used to group the rows based on a set of specified grouping expressions and compute aggregations on
 the group of rows based on one or more specified aggregate functions. Spark also supports advanced aggregations to do multiple
 aggregations for the same input record set via `GROUPING SETS`, `CUBE`, `ROLLUP` clauses.
+When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
 
 ### Syntax
 {% highlight sql %}
@@ -30,6 +31,11 @@ GROUP BY group_expression [ , group_expression [ , ... ] ]
 GROUP BY GROUPING SETS (grouping_set [ , ...])
 {% endhighlight %}
 
+While aggregate functions are defined as
+{% highlight sql %}
+aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
+{% endhighlight %}
+
 ### Parameters
 <dl>
   <dt><code><em>GROUPING SETS</em></code></dt>
@@ -70,6 +76,19 @@ GROUP BY GROUPING SETS (grouping_set [ , ...])
     ((warehouse, product), (warehouse), (product), ())</code>.
     The N elements of a <code>CUBE</code> specification results in 2^N <code>GROUPING SETS</code>.
   </dd>
+  <dt><code><em>aggregate_name</em></code></dt>
+  <dd>
+    Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
+  </dd>
+  <dt><code><em>DISTINCT</em></code></dt>
+  <dd>
+    Removes duplicates in input rows before they are passed to aggregate functions.
+  </dd>
+  <dt><code><em>FILTER</em></code></dt>
+  <dd>
+    Filters the input rows for which the <code>boolean_expression</code> in the <code>WHERE</code> clause evaluates
+    to true are passed to the aggregate function; other rows are discarded.
+  </dd>
 </dl>
 
 ### Examples
@@ -120,6 +139,31 @@ SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id OR
   |300|13 |8  |
   +---+---+---+
 
+-- Count the number of distinct dealer cities per car_model.
+SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
+
+  +------------+-----+
+  |   car_model|count|
+  +------------+-----+
+  | Honda Civic|    3|
+  |   Honda CRV|    2|
+  |Honda Accord|    3|
+  +------------+-----+
+
+-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
+SELECT id, sum(quantity) FILTER (
+            WHERE car_model IN ('Honda Civic', 'Honda CRV')
+        ) AS `sum(quantity)` FROM dealer
+    GROUP BY id ORDER BY id;
+
+   +---+-------------+
+   | id|sum(quantity)|
+   +---+-------------+
+   |100|           17|
+   |200|           23|
+   |300|            5|
+   +---+-------------+
+
 -- Aggregations using multiple sets of grouping columns in a single statement.
 -- Following performs aggregations based on four sets of grouping columns.
 -- 1. city, car_model
diff --git a/docs/sql-ref-syntax-qry-select.md b/docs/sql-ref-syntax-qry-select.md
index e87c4a587e..7ad1dd15c4 100644
--- a/docs/sql-ref-syntax-qry-select.md
+++ b/docs/sql-ref-syntax-qry-select.md
@@ -92,6 +92,7 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression [ , ... ] }
   <dd>
     Specifies the expressions that are used to group the rows. This is used in conjunction with aggregate functions
     (MIN, MAX, COUNT, SUM, AVG, etc.) to group rows based on the grouping expressions and aggregate values in each group.
+    When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
   </dd>
   <dt><code><em>HAVING</em></code></dt>
   <dd>