2020-01-27 09:59:48 -05:00
|
|
|
---
|
|
|
|
layout: global
|
|
|
|
title: CLUSTER BY Clause
|
|
|
|
displayTitle: CLUSTER BY Clause
|
|
|
|
license: |
|
|
|
|
Licensed to the Apache Software Foundation (ASF) under one or more
|
|
|
|
contributor license agreements. See the NOTICE file distributed with
|
|
|
|
this work for additional information regarding copyright ownership.
|
|
|
|
The ASF licenses this file to You under the Apache License, Version 2.0
|
|
|
|
(the "License"); you may not use this file except in compliance with
|
|
|
|
the License. You may obtain a copy of the License at
|
2020-03-11 19:52:40 -04:00
|
|
|
|
2020-01-27 09:59:48 -05:00
|
|
|
http://www.apache.org/licenses/LICENSE-2.0
|
2020-03-11 19:52:40 -04:00
|
|
|
|
2020-01-27 09:59:48 -05:00
|
|
|
Unless required by applicable law or agreed to in writing, software
|
|
|
|
distributed under the License is distributed on an "AS IS" BASIS,
|
|
|
|
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
|
|
See the License for the specific language governing permissions and
|
|
|
|
limitations under the License.
|
|
|
|
---
|
[SPARK-31383][SQL][DOC] Clean up the SQL documents in docs/sql-ref*
### What changes were proposed in this pull request?
This PR intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;
- Fixes wrong syntaxes and capitalize sub-titles
- Adds some DDL queries in `Examples` so that users can run examples there
- Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format
- Adds/Removes spaces, Indents, or blank lines to follow the format below;
```
---
license...
---
### Description
Writes what's the syntax is.
### Syntax
{% highlight sql %}
SELECT...
WHERE... // 4 indents after the second line
...
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>Param Name</em></code></dt>
<dd>
Param Description
</dd>
...
</dl>
### Examples
{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1.0|
| a| 2.0|
| b| 3.0|
| c| 4.0|
+---+-----+
-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
FROM t
GROUP BY key;
+---+----------+
|key|sum(value)|
+---+----------+
| c| 4.0|
| b| 3.0|
| a| 3.0|
+---+----------+
...
{% endhighlight %}
### Related Statements
* [XXX](xxx.html)
* ...
```
### Why are the changes needed?
The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually checked.
Closes #28151 from maropu/MakeRightAligned.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-13 00:40:36 -04:00
|
|
|
|
|
|
|
### Description
|
|
|
|
|
2020-05-10 13:57:25 -04:00
|
|
|
The `CLUSTER BY` clause is used to first repartition the data based
|
2020-01-27 09:59:48 -05:00
|
|
|
on the input expressions and then sort the data within each partition. This is
|
2020-01-29 09:41:40 -05:00
|
|
|
semantically equivalent to performing a
|
|
|
|
[DISTRIBUTE BY](sql-ref-syntax-qry-select-distribute-by.html) followed by a
|
|
|
|
[SORT BY](sql-ref-syntax-qry-select-sortby.html). This clause only ensures that the
|
|
|
|
resultant rows are sorted within each partition and does not guarantee a total order of output.
|
2020-01-27 09:59:48 -05:00
|
|
|
|
|
|
|
### Syntax
|
[SPARK-31383][SQL][DOC] Clean up the SQL documents in docs/sql-ref*
### What changes were proposed in this pull request?
This PR intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;
- Fixes wrong syntaxes and capitalize sub-titles
- Adds some DDL queries in `Examples` so that users can run examples there
- Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format
- Adds/Removes spaces, Indents, or blank lines to follow the format below;
```
---
license...
---
### Description
Writes what's the syntax is.
### Syntax
{% highlight sql %}
SELECT...
WHERE... // 4 indents after the second line
...
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>Param Name</em></code></dt>
<dd>
Param Description
</dd>
...
</dl>
### Examples
{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1.0|
| a| 2.0|
| b| 3.0|
| c| 4.0|
+---+-----+
-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
FROM t
GROUP BY key;
+---+----------+
|key|sum(value)|
+---+----------+
| c| 4.0|
| b| 3.0|
| a| 3.0|
+---+----------+
...
{% endhighlight %}
### Related Statements
* [XXX](xxx.html)
* ...
```
### Why are the changes needed?
The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually checked.
Closes #28151 from maropu/MakeRightAligned.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-13 00:40:36 -04:00
|
|
|
|
2020-05-10 13:57:25 -04:00
|
|
|
```sql
|
2020-01-27 09:59:48 -05:00
|
|
|
CLUSTER BY { expression [ , ... ] }
|
2020-05-10 13:57:25 -04:00
|
|
|
```
|
2020-01-27 09:59:48 -05:00
|
|
|
|
|
|
|
### Parameters
|
[SPARK-31383][SQL][DOC] Clean up the SQL documents in docs/sql-ref*
### What changes were proposed in this pull request?
This PR intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;
- Fixes wrong syntaxes and capitalize sub-titles
- Adds some DDL queries in `Examples` so that users can run examples there
- Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format
- Adds/Removes spaces, Indents, or blank lines to follow the format below;
```
---
license...
---
### Description
Writes what's the syntax is.
### Syntax
{% highlight sql %}
SELECT...
WHERE... // 4 indents after the second line
...
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>Param Name</em></code></dt>
<dd>
Param Description
</dd>
...
</dl>
### Examples
{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1.0|
| a| 2.0|
| b| 3.0|
| c| 4.0|
+---+-----+
-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
FROM t
GROUP BY key;
+---+----------+
|key|sum(value)|
+---+----------+
| c| 4.0|
| b| 3.0|
| a| 3.0|
+---+----------+
...
{% endhighlight %}
### Related Statements
* [XXX](xxx.html)
* ...
```
### Why are the changes needed?
The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually checked.
Closes #28151 from maropu/MakeRightAligned.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-13 00:40:36 -04:00
|
|
|
|
2020-05-10 13:57:25 -04:00
|
|
|
* **expression**
|
|
|
|
|
2020-01-27 09:59:48 -05:00
|
|
|
Specifies combination of one or more values, operators and SQL functions that results in a value.
|
|
|
|
|
|
|
|
### Examples
|
[SPARK-31383][SQL][DOC] Clean up the SQL documents in docs/sql-ref*
### What changes were proposed in this pull request?
This PR intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;
- Fixes wrong syntaxes and capitalize sub-titles
- Adds some DDL queries in `Examples` so that users can run examples there
- Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format
- Adds/Removes spaces, Indents, or blank lines to follow the format below;
```
---
license...
---
### Description
Writes what's the syntax is.
### Syntax
{% highlight sql %}
SELECT...
WHERE... // 4 indents after the second line
...
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>Param Name</em></code></dt>
<dd>
Param Description
</dd>
...
</dl>
### Examples
{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1.0|
| a| 2.0|
| b| 3.0|
| c| 4.0|
+---+-----+
-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
FROM t
GROUP BY key;
+---+----------+
|key|sum(value)|
+---+----------+
| c| 4.0|
| b| 3.0|
| a| 3.0|
+---+----------+
...
{% endhighlight %}
### Related Statements
* [XXX](xxx.html)
* ...
```
### Why are the changes needed?
The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually checked.
Closes #28151 from maropu/MakeRightAligned.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-13 00:40:36 -04:00
|
|
|
|
2020-05-10 13:57:25 -04:00
|
|
|
```sql
|
2020-01-27 09:59:48 -05:00
|
|
|
CREATE TABLE person (name STRING, age INT);
|
2020-03-11 19:52:40 -04:00
|
|
|
INSERT INTO person VALUES
|
|
|
|
('Zen Hui', 25),
|
|
|
|
('Anil B', 18),
|
|
|
|
('Shone S', 16),
|
2020-01-27 09:59:48 -05:00
|
|
|
('Mike A', 25),
|
2020-03-11 19:52:40 -04:00
|
|
|
('John A', 18),
|
2020-01-27 09:59:48 -05:00
|
|
|
('Jack N', 16);
|
|
|
|
|
|
|
|
-- Reduce the number of shuffle partitions to 2 to illustrate the behavior of `CLUSTER BY`.
|
|
|
|
-- It's easier to see the clustering and sorting behavior with less number of partitions.
|
|
|
|
SET spark.sql.shuffle.partitions = 2;
|
2020-03-11 19:52:40 -04:00
|
|
|
|
2020-01-27 09:59:48 -05:00
|
|
|
-- Select the rows with no ordering. Please note that without any sort directive, the results
|
2020-03-11 19:52:40 -04:00
|
|
|
-- of the query is not deterministic. It's included here to show the difference in behavior
|
2020-01-27 09:59:48 -05:00
|
|
|
-- of a query when `CLUSTER BY` is not used vs when it's used. The query below produces rows
|
|
|
|
-- where age column is not sorted.
|
|
|
|
SELECT age, name FROM person;
|
2020-05-01 13:11:43 -04:00
|
|
|
+---+-------+
|
|
|
|
|age| name|
|
|
|
|
+---+-------+
|
|
|
|
| 16|Shone S|
|
|
|
|
| 25|Zen Hui|
|
|
|
|
| 16| Jack N|
|
|
|
|
| 25| Mike A|
|
|
|
|
| 18| John A|
|
|
|
|
| 18| Anil B|
|
|
|
|
+---+-------+
|
2020-01-27 09:59:48 -05:00
|
|
|
|
|
|
|
-- Produces rows clustered by age. Persons with same age are clustered together.
|
|
|
|
-- In the query below, persons with age 18 and 25 are in first partition and the
|
|
|
|
-- persons with age 16 are in the second partition. The rows are sorted based
|
|
|
|
-- on age within each partition.
|
|
|
|
SELECT age, name FROM person CLUSTER BY age;
|
2020-05-01 13:11:43 -04:00
|
|
|
+---+-------+
|
|
|
|
|age| name|
|
|
|
|
+---+-------+
|
|
|
|
| 18| John A|
|
|
|
|
| 18| Anil B|
|
|
|
|
| 25|Zen Hui|
|
|
|
|
| 25| Mike A|
|
|
|
|
| 16|Shone S|
|
|
|
|
| 16| Jack N|
|
|
|
|
+---+-------+
|
2020-05-10 13:57:25 -04:00
|
|
|
```
|
2020-01-29 09:41:40 -05:00
|
|
|
|
[SPARK-31383][SQL][DOC] Clean up the SQL documents in docs/sql-ref*
### What changes were proposed in this pull request?
This PR intends to clean up the SQL documents in `doc/sql-ref*`.
Main changes are as follows;
- Fixes wrong syntaxes and capitalize sub-titles
- Adds some DDL queries in `Examples` so that users can run examples there
- Makes query output in `Examples` follows the `Dataset.showString` (right-aligned) format
- Adds/Removes spaces, Indents, or blank lines to follow the format below;
```
---
license...
---
### Description
Writes what's the syntax is.
### Syntax
{% highlight sql %}
SELECT...
WHERE... // 4 indents after the second line
...
{% endhighlight %}
### Parameters
<dl>
<dt><code><em>Param Name</em></code></dt>
<dd>
Param Description
</dd>
...
</dl>
### Examples
{% highlight sql %}
-- It is better that users are able to execute example queries here.
-- So, we prepare test data in the first section if possible.
CREATE TABLE t (key STRING, value DOUBLE);
INSERT INTO t VALUES
('a', 1.0), ('a', 2.0), ('b', 3.0), ('c', 4.0);
-- query output has 2 indents and it follows the `Dataset.showString`
-- format (right-aligned).
SELECT * FROM t;
+---+-----+
|key|value|
+---+-----+
| a| 1.0|
| a| 2.0|
| b| 3.0|
| c| 4.0|
+---+-----+
-- Query statements after the second line have 4 indents.
SELECT key, SUM(value)
FROM t
GROUP BY key;
+---+----------+
|key|sum(value)|
+---+----------+
| c| 4.0|
| b| 3.0|
| a| 3.0|
+---+----------+
...
{% endhighlight %}
### Related Statements
* [XXX](xxx.html)
* ...
```
### Why are the changes needed?
The most changes of this PR are pretty minor, but I think the consistent formats/rules to write documents are important for long-term maintenance in our community
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually checked.
Closes #28151 from maropu/MakeRightAligned.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-04-13 00:40:36 -04:00
|
|
|
### Related Statements
|
|
|
|
|
2020-05-10 13:57:25 -04:00
|
|
|
* [SELECT Main](sql-ref-syntax-qry-select.html)
|
|
|
|
* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
|
|
|
|
* [GROUP BY Clause](sql-ref-syntax-qry-select-groupby.html)
|
|
|
|
* [HAVING Clause](sql-ref-syntax-qry-select-having.html)
|
|
|
|
* [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
|
|
|
* [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
|
|
|
* [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
|
|
|
* [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|