485145326a
### What changes were proposed in this pull request?
This PR intends to fix typos in the sub-modules:
* `bin`
* `core`
* `docs`
* `external`
* `mllib`
* `repl`
* `pom.xml`
Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618
NOTE: The misspellings have been reported at 706a726f87 (commitcomment-44064356)
### Why are the changes needed?
Misspelled words make it harder to read / understand content.
### Does this PR introduce _any_ user-facing change?
There are various fixes to documentation, etc...
### How was this patch tested?
No testing was performed
Closes #30530 from jsoref/spelling-bin-core-docs-external-mllib-repl.
Authored-by: Josh Soref <jsoref@users.noreply.github.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
302 lines
10 KiB
Markdown
302 lines
10 KiB
Markdown
---
|
|
layout: global
|
|
title: GROUP BY Clause
|
|
displayTitle: GROUP BY Clause
|
|
license: |
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
this work for additional information regarding copyright ownership.
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
(the "License"); you may not use this file except in compliance with
|
|
the License. You may obtain a copy of the License at
|
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
See the License for the specific language governing permissions and
|
|
limitations under the License.
|
|
---
|
|
|
|
### Description
|
|
|
|
The `GROUP BY` clause is used to group the rows based on a set of specified grouping expressions and compute aggregations on
|
|
the group of rows based on one or more specified aggregate functions. Spark also supports advanced aggregations to do multiple
|
|
aggregations for the same input record set via `GROUPING SETS`, `CUBE`, `ROLLUP` clauses.
|
|
When a FILTER clause is attached to an aggregate function, only the matching rows are passed to that function.
|
|
|
|
### Syntax
|
|
|
|
```sql
|
|
GROUP BY group_expression [ , group_expression [ , ... ] ]
|
|
[ { WITH ROLLUP | WITH CUBE | GROUPING SETS (grouping_set [ , ...]) } ]
|
|
|
|
GROUP BY GROUPING SETS (grouping_set [ , ...])
|
|
```
|
|
|
|
While aggregate functions are defined as
|
|
```sql
|
|
aggregate_name ( [ DISTINCT ] expression [ , ... ] ) [ FILTER ( WHERE boolean_expression ) ]
|
|
```
|
|
|
|
### Parameters
|
|
|
|
* **GROUPING SETS**
|
|
|
|
Groups the rows for each subset of the expressions specified in the grouping sets. For example,
|
|
`GROUP BY GROUPING SETS (warehouse, product)` is semantically equivalent
|
|
to union of results of `GROUP BY warehouse` and `GROUP BY product`. This clause
|
|
is a shorthand for a `UNION ALL` where each leg of the `UNION ALL`
|
|
operator performs aggregation of subset of the columns specified in the `GROUPING SETS` clause.
|
|
|
|
* **grouping_set**
|
|
|
|
A grouping set is specified by zero or more comma-separated expressions in parentheses.
|
|
|
|
**Syntax:** `( [ expression [ , ... ] ] )`
|
|
|
|
* **grouping_expression**
|
|
|
|
Specifies the criteria based on which the rows are grouped together. The grouping of rows is performed based on
|
|
result values of the grouping expressions. A grouping expression may be a column alias, a column position
|
|
or an expression.
|
|
|
|
* **ROLLUP**
|
|
|
|
Specifies multiple levels of aggregations in a single statement. This clause is used to compute aggregations
|
|
based on multiple grouping sets. `ROLLUP` is a shorthand for `GROUPING SETS`. For example,
|
|
`GROUP BY warehouse, product WITH ROLLUP` is equivalent to `GROUP BY GROUPING SETS
|
|
((warehouse, product), (warehouse), ())`.
|
|
The N elements of a `ROLLUP` specification results in N+1 `GROUPING SETS`.
|
|
|
|
* **CUBE**
|
|
|
|
`CUBE` clause is used to perform aggregations based on combination of grouping columns specified in the
|
|
`GROUP BY` clause. `CUBE` is a shorthand for `GROUPING SETS`. For example,
|
|
`GROUP BY warehouse, product WITH CUBE` is equivalent to `GROUP BY GROUPING SETS
|
|
((warehouse, product), (warehouse), (product), ())`.
|
|
The N elements of a `CUBE` specification results in 2^N `GROUPING SETS`.
|
|
|
|
* **aggregate_name**
|
|
|
|
Specifies an aggregate function name (MIN, MAX, COUNT, SUM, AVG, etc.).
|
|
|
|
* **DISTINCT**
|
|
|
|
Removes duplicates in input rows before they are passed to aggregate functions.
|
|
|
|
* **FILTER**
|
|
|
|
Filters the input rows for which the `boolean_expression` in the `WHERE` clause evaluates
|
|
to true are passed to the aggregate function; other rows are discarded.
|
|
|
|
### Examples
|
|
|
|
```sql
|
|
CREATE TABLE dealer (id INT, city STRING, car_model STRING, quantity INT);
|
|
INSERT INTO dealer VALUES
|
|
(100, 'Fremont', 'Honda Civic', 10),
|
|
(100, 'Fremont', 'Honda Accord', 15),
|
|
(100, 'Fremont', 'Honda CRV', 7),
|
|
(200, 'Dublin', 'Honda Civic', 20),
|
|
(200, 'Dublin', 'Honda Accord', 10),
|
|
(200, 'Dublin', 'Honda CRV', 3),
|
|
(300, 'San Jose', 'Honda Civic', 5),
|
|
(300, 'San Jose', 'Honda Accord', 8);
|
|
|
|
-- Sum of quantity per dealership. Group by `id`.
|
|
SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id;
|
|
+---+-------------+
|
|
| id|sum(quantity)|
|
|
+---+-------------+
|
|
|100| 32|
|
|
|200| 33|
|
|
|300| 13|
|
|
+---+-------------+
|
|
|
|
-- Use column position in GROUP by clause.
|
|
SELECT id, sum(quantity) FROM dealer GROUP BY 1 ORDER BY 1;
|
|
+---+-------------+
|
|
| id|sum(quantity)|
|
|
+---+-------------+
|
|
|100| 32|
|
|
|200| 33|
|
|
|300| 13|
|
|
+---+-------------+
|
|
|
|
-- Multiple aggregations.
|
|
-- 1. Sum of quantity per dealership.
|
|
-- 2. Max quantity per dealership.
|
|
SELECT id, sum(quantity) AS sum, max(quantity) AS max FROM dealer GROUP BY id ORDER BY id;
|
|
+---+---+---+
|
|
| id|sum|max|
|
|
+---+---+---+
|
|
|100| 32| 15|
|
|
|200| 33| 20|
|
|
|300| 13| 8|
|
|
+---+---+---+
|
|
|
|
-- Count the number of distinct dealer cities per car_model.
|
|
SELECT car_model, count(DISTINCT city) AS count FROM dealer GROUP BY car_model;
|
|
+------------+-----+
|
|
| car_model|count|
|
|
+------------+-----+
|
|
| Honda Civic| 3|
|
|
| Honda CRV| 2|
|
|
|Honda Accord| 3|
|
|
+------------+-----+
|
|
|
|
-- Sum of only 'Honda Civic' and 'Honda CRV' quantities per dealership.
|
|
SELECT id, sum(quantity) FILTER (
|
|
WHERE car_model IN ('Honda Civic', 'Honda CRV')
|
|
) AS `sum(quantity)` FROM dealer
|
|
GROUP BY id ORDER BY id;
|
|
+---+-------------+
|
|
| id|sum(quantity)|
|
|
+---+-------------+
|
|
|100| 17|
|
|
|200| 23|
|
|
|300| 5|
|
|
+---+-------------+
|
|
|
|
-- Aggregations using multiple sets of grouping columns in a single statement.
|
|
-- Following performs aggregations based on four sets of grouping columns.
|
|
-- 1. city, car_model
|
|
-- 2. city
|
|
-- 3. car_model
|
|
-- 4. Empty grouping set. Returns quantities for all city and car models.
|
|
SELECT city, car_model, sum(quantity) AS sum FROM dealer
|
|
GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
|
|
ORDER BY city;
|
|
+---------+------------+---+
|
|
| city| car_model|sum|
|
|
+---------+------------+---+
|
|
| null| null| 78|
|
|
| null| HondaAccord| 33|
|
|
| null| HondaCRV| 10|
|
|
| null| HondaCivic| 35|
|
|
| Dublin| null| 33|
|
|
| Dublin| HondaAccord| 10|
|
|
| Dublin| HondaCRV| 3|
|
|
| Dublin| HondaCivic| 20|
|
|
| Fremont| null| 32|
|
|
| Fremont| HondaAccord| 15|
|
|
| Fremont| HondaCRV| 7|
|
|
| Fremont| HondaCivic| 10|
|
|
| San Jose| null| 13|
|
|
| San Jose| HondaAccord| 8|
|
|
| San Jose| HondaCivic| 5|
|
|
+---------+------------+---+
|
|
|
|
-- Alternate syntax for `GROUPING SETS` in which both `GROUP BY` and `GROUPING SETS`
|
|
-- specifications are present.
|
|
SELECT city, car_model, sum(quantity) AS sum FROM dealer
|
|
GROUP BY city, car_model GROUPING SETS ((city, car_model), (city), (car_model), ())
|
|
ORDER BY city, car_model;
|
|
+---------+------------+---+
|
|
| city| car_model|sum|
|
|
+---------+------------+---+
|
|
| null| null| 78|
|
|
| null| HondaAccord| 33|
|
|
| null| HondaCRV| 10|
|
|
| null| HondaCivic| 35|
|
|
| Dublin| null| 33|
|
|
| Dublin| HondaAccord| 10|
|
|
| Dublin| HondaCRV| 3|
|
|
| Dublin| HondaCivic| 20|
|
|
| Fremont| null| 32|
|
|
| Fremont| HondaAccord| 15|
|
|
| Fremont| HondaCRV| 7|
|
|
| Fremont| HondaCivic| 10|
|
|
| San Jose| null| 13|
|
|
| San Jose| HondaAccord| 8|
|
|
| San Jose| HondaCivic| 5|
|
|
+---------+------------+---+
|
|
|
|
-- Group by processing with `ROLLUP` clause.
|
|
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), ())
|
|
SELECT city, car_model, sum(quantity) AS sum FROM dealer
|
|
GROUP BY city, car_model WITH ROLLUP
|
|
ORDER BY city, car_model;
|
|
+---------+------------+---+
|
|
| city| car_model|sum|
|
|
+---------+------------+---+
|
|
| null| null| 78|
|
|
| Dublin| null| 33|
|
|
| Dublin| HondaAccord| 10|
|
|
| Dublin| HondaCRV| 3|
|
|
| Dublin| HondaCivic| 20|
|
|
| Fremont| null| 32|
|
|
| Fremont| HondaAccord| 15|
|
|
| Fremont| HondaCRV| 7|
|
|
| Fremont| HondaCivic| 10|
|
|
| San Jose| null| 13|
|
|
| San Jose| HondaAccord| 8|
|
|
| San Jose| HondaCivic| 5|
|
|
+---------+------------+---+
|
|
|
|
-- Group by processing with `CUBE` clause.
|
|
-- Equivalent GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ())
|
|
SELECT city, car_model, sum(quantity) AS sum FROM dealer
|
|
GROUP BY city, car_model WITH CUBE
|
|
ORDER BY city, car_model;
|
|
+---------+------------+---+
|
|
| city| car_model|sum|
|
|
+---------+------------+---+
|
|
| null| null| 78|
|
|
| null| HondaAccord| 33|
|
|
| null| HondaCRV| 10|
|
|
| null| HondaCivic| 35|
|
|
| Dublin| null| 33|
|
|
| Dublin| HondaAccord| 10|
|
|
| Dublin| HondaCRV| 3|
|
|
| Dublin| HondaCivic| 20|
|
|
| Fremont| null| 32|
|
|
| Fremont| HondaAccord| 15|
|
|
| Fremont| HondaCRV| 7|
|
|
| Fremont| HondaCivic| 10|
|
|
| San Jose| null| 13|
|
|
| San Jose| HondaAccord| 8|
|
|
| San Jose| HondaCivic| 5|
|
|
+---------+------------+---+
|
|
|
|
--Prepare data for ignore nulls example
|
|
CREATE TABLE person (id INT, name STRING, age INT);
|
|
INSERT INTO person VALUES
|
|
(100, 'Mary', NULL),
|
|
(200, 'John', 30),
|
|
(300, 'Mike', 80),
|
|
(400, 'Dan', 50);
|
|
|
|
--Select the first row in column age
|
|
SELECT FIRST(age) FROM person;
|
|
+--------------------+
|
|
| first(age, false) |
|
|
+--------------------+
|
|
| NULL |
|
|
+--------------------+
|
|
|
|
--Get the first row in column `age` ignore nulls,last row in column `id` and sum of column `id`.
|
|
SELECT FIRST(age IGNORE NULLS), LAST(id), SUM(id) FROM person;
|
|
+-------------------+------------------+----------+
|
|
| first(age, true) | last(id, false) | sum(id) |
|
|
+-------------------+------------------+----------+
|
|
| 30 | 400 | 1000 |
|
|
+-------------------+------------------+----------+
|
|
```
|
|
|
|
### Related Statements
|
|
|
|
* [SELECT Main](sql-ref-syntax-qry-select.html)
|
|
* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
|
|
* [HAVING Clause](sql-ref-syntax-qry-select-having.html)
|
|
* [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
|
* [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
|
* [CLUSTER BY Clause](sql-ref-syntax-qry-select-clusterby.html)
|
|
* [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
|
* [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
|
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
|
|
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
|
|
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
|