spark-instrumented-optimizer/docs/sql-ref-syntax-qry-select.md at 9628aca68ba0821b8f3fa934ed4872cabb2a5d7d

[SPARK-31110][DOCS][SQL] refine sql doc for SELECT

### What changes were proposed in this pull request?

A few improvements to the sql ref SELECT doc:
1. correct the syntax of SELECT query
2. correct the default of null sort order
3. correct the GROUP BY syntax
4. several minor fixes

### Why are the changes needed?

refine document

### Does this PR introduce any user-facing change?

N/A

### How was this patch tested?

N/A

Closes #27866 from cloud-fan/doc.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>

2020-03-11 16:52:40 -07:00

6.3 KiB

Raw Blame History

layout	title	displayTitle	license
global	SELECT	SELECT	Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Spark supports a SELECT statement and conforms to the ANSI SQL standard. Queries are used to retrieve result sets from one or more tables. The following section describes the overall query syntax and the sub-sections cover different constructs of a query along with examples.

Syntax

{% highlight sql %} [ WITH with_query [ , ... ] ] select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_statement, ... ] [ ORDER BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ...] } ] [ SORT BY { expression [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [ , ...] } ] [ CLUSTER BY { expression [ , ...] } ] [ DISTRIBUTE BY { expression [, ...] } ] [ WINDOW { named_window [ , WINDOW named_window, ... ] } ] [ LIMIT { ALL | expression } ] {% endhighlight %}

While select_statement is defined as {% highlight sql %} SELECT [ hints , ... ] [ ALL | DISTINCT ] { named_expression [ , ... ] } FROM { from_item [ , ...] } [ WHERE boolean_expression ] [ GROUP BY expression [ , ...] ] [ HAVING boolean_expression ] {% endhighlight %}

Parameters

with_query

Specifies the common table expressions (CTEs) before the main query block. These table expressions are allowed to be referenced later in the FROM clause. This is useful to abstract out repeated subquery blocks in the FROM clause and improves readability of the query.

hints

Hints can be specified to help spark optimizer make better planning decisions. Currently spark supports hints that influence selection of join strategies and repartitioning of the data.

ALL

Select all matching rows from the relation and is enabled by default.

DISTINCT

Select all matching rows from the relation after removing duplicates in results.

named_expression

An expression with an assigned name. In general, it denotes a column expression.

Syntax:


        expression [AS] [alias]

from_item

Specifies a source of input for the query. It can be one of the following:

Table relation
Join relation
Table valued function
Inlined table
Subquery

WHERE

Filters the result of the FROM clause based on the supplied predicates.

GROUP BY

Specifies the expressions that are used to group the rows. This is used in conjunction with aggregate functions (MIN, MAX, COUNT, SUM, AVG, etc.) to group rows based on the grouping expressions and aggregate values in each group.

HAVING

Specifies the predicates by which the rows produced by GROUP BY are filtered. The HAVING clause is used to filter rows after the grouping is performed. If HAVING is specified without GROUP BY, it indicates a GROUP BY without grouping expressions (global aggregate).

ORDER BY

Specifies an ordering of the rows of the complete result set of the query. The output rows are ordered across the partitions. This parameter is mutually exclusive with SORT BY, CLUSTER BY and DISTRIBUTE BY and can not be specified together.

SORT BY

Specifies an ordering by which the rows are ordered within each partition. This parameter is mutually exclusive with ORDER BY and CLUSTER BY and can not be specified together.

CLUSTER BY

Specifies a set of expressions that is used to repartition and sort the rows. Using this clause has the same effect of using DISTRIBUTE BY and SORT BY together.

DISTRIBUTE BY

Specifies a set of expressions by which the result rows are repartitioned. This parameter is mutually exclusive with ORDER BY and CLUSTER BY and can not be specified together.

LIMIT

Specifies the maximum number of rows that can be returned by a statement or subquery. This clause is mostly used in the conjunction with ORDER BY to produce a deterministic result.

boolean_expression

Specifies an expression with a return type of boolean.

expression

Specifies a combination of one or more values, operators, and SQL functions that evaluates to a value.

named_window

Specifies aliases for one or more source window specifications. The source window specifications can be referenced in the widow definitions in the query.

6.3 KiB Raw Blame History

Syntax

Parameters

Related Clauses

6.3 KiB

Raw Blame History