[SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause
### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31010 from AngersZhuuuu/SPARK-33976. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This commit is contained in:
parent
b219e37af3
commit
9c956abb1d
|
@ -192,6 +192,8 @@
|
|||
url: sql-ref-syntax-qry-select-lateral-view.html
|
||||
- text: PIVOT Clause
|
||||
url: sql-ref-syntax-qry-select-pivot.html
|
||||
- text: TRANSFORM Clause
|
||||
url: sql-ref-syntax-qry-select-transform.html
|
||||
- text: EXPLAIN
|
||||
url: sql-ref-syntax-qry-explain.html
|
||||
- text: Auxiliary Statements
|
||||
|
|
235
docs/sql-ref-syntax-qry-select-transform.md
Normal file
235
docs/sql-ref-syntax-qry-select-transform.md
Normal file
|
@ -0,0 +1,235 @@
|
|||
---
|
||||
layout: global
|
||||
title: TRANSFORM
|
||||
displayTitle: TRANSFORM
|
||||
license: |
|
||||
Licensed to the Apache Software Foundation (ASF) under one or more
|
||||
contributor license agreements. See the NOTICE file distributed with
|
||||
this work for additional information regarding copyright ownership.
|
||||
The ASF licenses this file to You under the Apache License, Version 2.0
|
||||
(the "License"); you may not use this file except in compliance with
|
||||
the License. You may obtain a copy of the License at
|
||||
|
||||
http://www.apache.org/licenses/LICENSE-2.0
|
||||
|
||||
Unless required by applicable law or agreed to in writing, software
|
||||
distributed under the License is distributed on an "AS IS" BASIS,
|
||||
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
See the License for the specific language governing permissions and
|
||||
limitations under the License.
|
||||
---
|
||||
|
||||
### Description
|
||||
|
||||
The `TRANSFORM` clause is used to specify a Hive-style transform query specification
|
||||
to transform the inputs by running a user-specified command or script.
|
||||
|
||||
### Syntax
|
||||
|
||||
```sql
|
||||
SELECT TRANSFORM ( expression [ , ... ] )
|
||||
[ ROW FORMAT row_format ]
|
||||
[ RECORDWRITER record_writer_class ]
|
||||
USING command_or_script [ AS ( [ col_name [ col_type ] ] [ , ... ] ) ]
|
||||
[ ROW FORMAT row_format ]
|
||||
[ RECORDREADER record_reader_class ]
|
||||
|
||||
row_format:
|
||||
SERDE serde_class [ WITH SERDEPROPERTIES (k1=v1, k2=v2, ... ) ]
|
||||
| DELIMITED [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escaped_char ] ]
|
||||
[ COLLECTION ITEMS TERMINATED BY collection_items_terminated_char ]
|
||||
[ MAP KEYS TERMINATED BY map_key_terminated_char ]
|
||||
[ LINES TERMINATED BY row_terminated_char ]
|
||||
[ NULL DEFINED AS null_char ]
|
||||
```
|
||||
|
||||
### Parameters
|
||||
|
||||
* **expression**
|
||||
|
||||
Specifies a combination of one or more values, operators and SQL functions that results in a value.
|
||||
|
||||
* **row_format**
|
||||
|
||||
Otherwise, uses the `DELIMITED` clause to specify the native SerDe and state the delimiter, escape character, null character and so on.
|
||||
|
||||
* **SERDE**
|
||||
|
||||
Specifies a custom SerDe for one table.
|
||||
|
||||
* **serde_class**
|
||||
|
||||
Specifies a fully-qualified class name of a custom SerDe.
|
||||
|
||||
* **DELIMITED**
|
||||
|
||||
The `DELIMITED` clause can be used to specify the native SerDe and state the delimiter, escape character, null character and so on.
|
||||
|
||||
* **FIELDS TERMINATED BY**
|
||||
|
||||
Used to define a column separator.
|
||||
|
||||
* **COLLECTION ITEMS TERMINATED BY**
|
||||
|
||||
Used to define a collection item separator.
|
||||
|
||||
* **MAP KEYS TERMINATED BY**
|
||||
|
||||
Used to define a map key separator.
|
||||
|
||||
* **LINES TERMINATED BY**
|
||||
|
||||
Used to define a row separator.
|
||||
|
||||
* **NULL DEFINED AS**
|
||||
|
||||
Used to define the specific value for NULL.
|
||||
|
||||
* **ESCAPED BY**
|
||||
|
||||
Used for escape mechanism.
|
||||
|
||||
* **RECORDWRITER**
|
||||
|
||||
Specifies a fully-qualified class name of a custom RecordWriter. The default value is `org.apache.hadoop.hive.ql.exec.TextRecordWriter`.
|
||||
|
||||
* **RECORDREADER**
|
||||
|
||||
Specifies a fully-qualified class name of a custom RecordReader. The default value is `org.apache.hadoop.hive.ql.exec.TextRecordReader`.
|
||||
|
||||
* **command_or_script**
|
||||
|
||||
Specifies a command or a path to script to process data.
|
||||
|
||||
### SerDe behavior
|
||||
|
||||
Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns will be casted
|
||||
to `STRING` and combined by tabs before feeding to the user script. All `NULL` values will be converted
|
||||
to the literal string `"\N"` in order to differentiate `NULL` values from empty strings. The standard output of the
|
||||
user script will be treated as tab-separated `STRING` columns, any cell containing only `"\N"` will be re-interpreted
|
||||
as a `NULL` value, and then the resulting STRING column will be cast to the data type specified in `col_type`. If the actual
|
||||
number of output columns is less than the number of specified output columns, insufficient output columns will be
|
||||
supplemented with `NULL`. If the actual number of output columns is more than the number of specified output columns,
|
||||
the output columns will only select the corresponding columns and the remaining part will be discarded.
|
||||
If there is no `AS` clause after `USING my_script`, an output schema will be `key: STRING, value: STRING`.
|
||||
The `key` column contains all the characters before the first tab and the `value` column contains the remaining characters after the first tab.
|
||||
If there is no enough tab, Spark will return `NULL` value. These defaults can be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`.
|
||||
|
||||
### Examples
|
||||
|
||||
```sql
|
||||
CREATE TABLE person (zip_code INT, name STRING, age INT);
|
||||
INSERT INTO person VALUES
|
||||
(94588, 'Zen Hui', 50),
|
||||
(94588, 'Dan Li', 18),
|
||||
(94588, 'Anil K', 27),
|
||||
(94588, 'John V', NULL),
|
||||
(94511, 'David K', 42),
|
||||
(94511, 'Aryan B.', 18),
|
||||
(94511, 'Lalit B.', NULL);
|
||||
|
||||
-- With specified output without data type
|
||||
SELECT TRANSFORM(zip_code, name, age)
|
||||
USING 'cat' AS (a, b, c)
|
||||
FROM person
|
||||
WHERE zip_code > 94511;
|
||||
+-------+---------+-----+
|
||||
| a | b| c|
|
||||
+-------+---------+-----+
|
||||
| 94588| Anil K| 27|
|
||||
| 94588| John V| NULL|
|
||||
| 94588| Zen Hui| 50|
|
||||
| 94588| Dan Li| 18|
|
||||
+-------+---------+-----+
|
||||
|
||||
-- With specified output with data type
|
||||
SELECT TRANSFORM(zip_code, name, age)
|
||||
USING 'cat' AS (a STRING, b STRING, c STRING)
|
||||
FROM person
|
||||
WHERE zip_code > 94511;
|
||||
+-------+---------+-----+
|
||||
| a | b| c|
|
||||
+-------+---------+-----+
|
||||
| 94588| Anil K| 27|
|
||||
| 94588| John V| NULL|
|
||||
| 94588| Zen Hui| 50|
|
||||
| 94588| Dan Li| 18|
|
||||
+-------+---------+-----+
|
||||
|
||||
-- Using ROW FORMAT DELIMITED
|
||||
SELECT TRANSFORM(name, age)
|
||||
ROW FORMAT DELIMITED
|
||||
FIELDS TERMINATED BY ','
|
||||
LINES TERMINATED BY '\n'
|
||||
NULL DEFINED AS 'NULL'
|
||||
USING 'cat' AS (name_age string)
|
||||
ROW FORMAT DELIMITED
|
||||
FIELDS TERMINATED BY '@'
|
||||
LINES TERMINATED BY '\n'
|
||||
NULL DEFINED AS 'NULL'
|
||||
FROM person;
|
||||
+---------------+
|
||||
| name_age|
|
||||
+---------------+
|
||||
| Anil K,27|
|
||||
| John V,null|
|
||||
| ryan B.,18|
|
||||
| David K,42|
|
||||
| Zen Hui,50|
|
||||
| Dan Li,18|
|
||||
| Lalit B.,null|
|
||||
+---------------+
|
||||
|
||||
-- Using Hive Serde
|
||||
SELECT TRANSFORM(zip_code, name, age)
|
||||
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
|
||||
WITH SERDEPROPERTIES (
|
||||
'field.delim' = '\t'
|
||||
)
|
||||
USING 'cat' AS (a STRING, b STRING, c STRING)
|
||||
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
|
||||
WITH SERDEPROPERTIES (
|
||||
'field.delim' = '\t'
|
||||
)
|
||||
FROM person
|
||||
WHERE zip_code > 94511;
|
||||
+-------+---------+-----+
|
||||
| a | b| c|
|
||||
+-------+---------+-----+
|
||||
| 94588| Anil K| 27|
|
||||
| 94588| John V| NULL|
|
||||
| 94588| Zen Hui| 50|
|
||||
| 94588| Dan Li| 18|
|
||||
+-------+---------+-----+
|
||||
|
||||
-- Schema-less mode
|
||||
SELECT TRANSFORM(zip_code, name, age)
|
||||
USING 'cat'
|
||||
FROM person
|
||||
WHERE zip_code > 94500;
|
||||
+-------+---------------------+
|
||||
| key| value|
|
||||
+-------+---------------------+
|
||||
| 94588| Anil K 27|
|
||||
| 94588| John V \N|
|
||||
| 94511| Aryan B. 18|
|
||||
| 94511| David K 42|
|
||||
| 94588| Zen Hui 50|
|
||||
| 94588| Dan Li 18|
|
||||
| 94511| Lalit B. \N|
|
||||
+-------+---------------------+
|
||||
```
|
||||
|
||||
### Related Statements
|
||||
|
||||
* [SELECT Main](sql-ref-syntax-qry-select.html)
|
||||
* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
|
||||
* [GROUP BY Clause](sql-ref-syntax-qry-select-groupby.html)
|
||||
* [HAVING Clause](sql-ref-syntax-qry-select-having.html)
|
||||
* [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
|
||||
* [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
|
||||
* [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
|
||||
* [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
|
||||
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
|
||||
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
|
||||
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
|
|
@ -41,7 +41,7 @@ select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_stat
|
|||
|
||||
While `select_statement` is defined as
|
||||
```sql
|
||||
SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_names ] [ , ... ] }
|
||||
SELECT [ hints , ... ] [ ALL | DISTINCT ] { [[ named_expression | regex_column_names ] [ , ... ] | TRANSFORM (...)) ] }
|
||||
FROM { from_item [ , ... ] }
|
||||
[ PIVOT clause ]
|
||||
[ LATERAL VIEW clause ] [ ... ]
|
||||
|
@ -164,6 +164,10 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
|
|||
)
|
||||
```
|
||||
|
||||
* **TRANSFORM**
|
||||
|
||||
Specifies a hive-style transform query specification to transform the input by forking and running user-specified command or script.
|
||||
|
||||
### Related Statements
|
||||
|
||||
* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
|
||||
|
@ -187,3 +191,4 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
|
|||
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
|
||||
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
|
||||
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
|
||||
* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
|
||||
|
|
|
@ -49,4 +49,5 @@ ability to generate logical and physical plan for a given query using
|
|||
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
|
||||
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
|
||||
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
|
||||
* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
|
||||
* [EXPLAIN Statement](sql-ref-syntax-qry-explain.html)
|
||||
|
|
|
@ -70,6 +70,7 @@ Spark SQL is Apache Spark's module for working with structured data. The SQL Syn
|
|||
* [CASE Clause](sql-ref-syntax-qry-select-case.html)
|
||||
* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
|
||||
* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
|
||||
* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
|
||||
* [EXPLAIN](sql-ref-syntax-qry-explain.html)
|
||||
|
||||
### Auxiliary Statements
|
||||
|
|
Loading…
Reference in a new issue