[SPARK-31358][SQL][DOC] Document FILTER clauses of aggregate functions in SQL references

### What changes were proposed in this pull request?

This PR intends to improve the SQL document of `GROUP BY`; it added the description about FILTER clauses of aggregate functions.

### Why are the changes needed?

To improve the SQL documents

### Does this PR introduce any user-facing change?

Yes.

<img src="https://user-images.githubusercontent.com/692303/78558612-e2234a80-784d-11ea-9353-b3feac4d57a7.png" width="500">

### How was this patch tested?

Manually checked.

Closes #28134 from maropu/SPARK-31358.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
This commit is contained in:
Takeshi Yamamuro 2020-04-06 21:36:51 +09:00
parent 112caea214
commit e24f0dcd27
2 changed files with 45 additions and 0 deletions

View file

@ -21,6 +21,7 @@ license: |
The <code>GROUP BY</code> clause is used to group the rows based on a set of specified grouping expressions and compute aggregations on
the group of rows based on one or more specified aggregate functions. Spark also supports advanced aggregations to do multiple
aggregations for the same input record set via `GROUPING SETS`, `CUBE`, `ROLLUP` clauses.
When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
### Syntax
{% highlight sql %}
@ -30,6 +31,11 @@ GROUP BY group_expression [ , group_expression [ , ... ] ]
GROUP BY GROUPING SETS (grouping_set [ , ...])
{% endhighlight %}
While aggregate functions are defined as
{% highlight sql %}
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>GROUPING SETS</em></code></dt>
@ -70,6 +76,19 @@ GROUP BY GROUPING SETS (grouping_set [ , ...])
((warehouse, product), (warehouse), (product), ())</code>.
The N elements of a <code>CUBE</code> specification results in 2^N <code>GROUPING SETS</code>.
</dd>
<dt><code><em>aggregate_name</em></code></dt>
<dd>
Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
</dd>
<dt><code><em>DISTINCT</em></code></dt>
<dd>
Removes duplicates in input rows before they are passed to aggregate functions.
</dd>
<dt><code><em>FILTER</em></code></dt>
<dd>
Filters the input rows for which the <code>boolean_expression</code> in the <code>WHERE</code> clause evaluates
to true are passed to the aggregate function; other rows are discarded.
</dd>
</dl>
### Examples
@ -120,6 +139,31 @@ SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id OR
|300|13 |8 |
+---+---+---+
-- Count the number of distinct dealer cities per car_model.
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
+------------+-----+
| car_model|count|
+------------+-----+
| Honda Civic| 3|
| Honda CRV| 2|
|Honda Accord| 3|
+------------+-----+
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
SELECT id, sum(quantity) FILTER (
WHERE car_model IN ('Honda Civic', 'Honda CRV')
) AS `sum(quantity)` FROM dealer
GROUP BY id ORDER BY id;
+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100| 17|
|200| 23|
|300| 5|
+---+-------------+
-- Aggregations using multiple sets of grouping columns in a single statement.
-- Following performs aggregations based on four sets of grouping columns.
-- 1. city, car_model

View file

@ -92,6 +92,7 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression [ , ... ] }
<dd>
Specifies the expressions that are used to group the rows. This is used in conjunction with aggregate functions
(MIN, MAX, COUNT, SUM, AVG, etc.) to group rows based on the grouping expressions and aggregate values in each group.
When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
</dd>
<dt><code><em>HAVING</em></code></dt>
<dd>