[SPARK-33976][SQL][DOCS] Add a SQL doc page for a TRANSFORM clause

### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ![image](https://user-images.githubusercontent.com/46485123/114332579-1627fe80-9b79-11eb-8fa7-131f0a20f72f.png) ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31010 from AngersZhuuuu/SPARK-33976. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-04-20 10:30:26 +00:00 · 2021-04-20 10:30:26 +00:00 · 9c956abb1d
parent b219e37af3
commit 9c956abb1d
5 changed files with 245 additions and 1 deletions
--- a/docs/_data/menu-sql.yaml
+++ b/docs/_data/menu-sql.yaml
@ -192,6 +192,8 @@
                  url: sql-ref-syntax-qry-select-lateral-view.html
                - text: PIVOT Clause
                  url: sql-ref-syntax-qry-select-pivot.html
+                - text: TRANSFORM Clause
+                  url: sql-ref-syntax-qry-select-transform.html
            - text: EXPLAIN
              url: sql-ref-syntax-qry-explain.html
        - text: Auxiliary Statements
--- a/docs/sql-ref-syntax-qry-select-transform.md
+++ b/docs/sql-ref-syntax-qry-select-transform.md
@ -0,0 +1,235 @@
+---
+layout: global
+title: TRANSFORM
+displayTitle: TRANSFORM
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+     http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+### Description
+
+The `TRANSFORM` clause is used to specify a Hive-style transform query specification 
+to transform the inputs by running a user-specified command or script.
+
+### Syntax
+
+```sql
+SELECT TRANSFORM ( expression [ , ... ] )
+    [ ROW FORMAT row_format ]
+    [ RECORDWRITER record_writer_class ]
+    USING command_or_script [ AS ( [ col_name [ col_type ] ] [ , ... ] ) ]
+    [ ROW FORMAT row_format ]
+    [ RECORDREADER record_reader_class ]
+
+row_format:    
+    SERDE serde_class [ WITH SERDEPROPERTIES (k1=v1, k2=v2, ... ) ]
+    | DELIMITED [ FIELDS TERMINATED BY fields_terminated_char [ ESCAPED BY escaped_char ] ] 
+        [ COLLECTION ITEMS TERMINATED BY collection_items_terminated_char ] 
+        [ MAP KEYS TERMINATED BY map_key_terminated_char ]
+        [ LINES TERMINATED BY row_terminated_char ]
+        [ NULL DEFINED AS null_char ]
+```
+
+### Parameters
+
+* **expression**
+    
+    Specifies a combination of one or more values, operators and SQL functions that results in a value.
+
+* **row_format**    
+
+    Otherwise, uses the `DELIMITED` clause to specify the native SerDe and state the delimiter, escape character, null character and so on.
+
+* **SERDE**
+
+    Specifies a custom SerDe for one table.
+
+* **serde_class**
+
+    Specifies a fully-qualified class name of a custom SerDe.
+
+* **DELIMITED**
+
+    The `DELIMITED` clause can be used to specify the native SerDe and state the delimiter, escape character, null character and so on.
+
+* **FIELDS TERMINATED BY**
+
+    Used to define a column separator.
+    
+* **COLLECTION ITEMS TERMINATED BY**
+
+    Used to define a collection item separator.
+   
+* **MAP KEYS TERMINATED BY**
+
+    Used to define a map key separator.
+    
+* **LINES TERMINATED BY**
+
+    Used to define a row separator.
+    
+* **NULL DEFINED AS**
+
+    Used to define the specific value for NULL.
+    
+* **ESCAPED BY**
+
+    Used for escape mechanism.
+
+* **RECORDWRITER**
+
+    Specifies a fully-qualified class name of a custom RecordWriter. The default value is `org.apache.hadoop.hive.ql.exec.TextRecordWriter`.
+
+* **RECORDREADER**
+
+    Specifies a fully-qualified class name of a custom RecordReader. The default value is `org.apache.hadoop.hive.ql.exec.TextRecordReader`.
+
+* **command_or_script**
+
+    Specifies a command or a path to script to process data.
+
+### SerDe behavior
+
+Spark uses the Hive SerDe `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe` by default, so columns will be casted
+to `STRING` and combined by tabs before feeding to the user script. All `NULL` values will be converted
+to the literal string `"\N"` in order to differentiate `NULL` values from empty strings. The standard output of the
+user script will be treated as tab-separated `STRING` columns, any cell containing only `"\N"` will be re-interpreted
+as a `NULL` value, and then the resulting STRING column will be cast to the data type specified in `col_type`. If the actual
+number of output columns is less than the number of specified output columns, insufficient output columns will be
+supplemented with `NULL`. If the actual number of output columns is more than the number of specified output columns,
+the output columns will only select the corresponding columns and the remaining part will be discarded.
+If there is no `AS` clause after `USING my_script`, an output schema will be `key: STRING, value: STRING`.
+The `key` column contains all the characters before the first tab and the `value` column contains the remaining characters after the first tab.
+If there is no enough tab, Spark will return `NULL` value. These defaults can be overridden with `ROW FORMAT SERDE` or `ROW FORMAT DELIMITED`. 
+
+### Examples
+
+```sql
+CREATE TABLE person (zip_code INT, name STRING, age INT);
+INSERT INTO person VALUES
+    (94588, 'Zen Hui', 50),
+    (94588, 'Dan Li', 18),
+    (94588, 'Anil K', 27),
+    (94588, 'John V', NULL),
+    (94511, 'David K', 42),
+    (94511, 'Aryan B.', 18),
+    (94511, 'Lalit B.', NULL);
+
+-- With specified output without data type
+SELECT TRANSFORM(zip_code, name, age)
+   USING 'cat' AS (a, b, c)
+FROM person
+WHERE zip_code > 94511;
+-------+---------+-----+
+|    a  |        b|    c|
+-------+---------+-----+
+|  94588|   Anil K|   27|
+|  94588|   John V| NULL|
+|  94588|  Zen Hui|   50|
+|  94588|   Dan Li|   18|
+-------+---------+-----+
+
+-- With specified output with data type
+SELECT TRANSFORM(zip_code, name, age)
+   USING 'cat' AS (a STRING, b STRING, c STRING)
+FROM person
+WHERE zip_code > 94511;
+-------+---------+-----+
+|    a  |        b|    c|
+-------+---------+-----+
+|  94588|   Anil K|   27|
+|  94588|   John V| NULL|
+|  94588|  Zen Hui|   50|
+|  94588|   Dan Li|   18|
+-------+---------+-----+
+
+-- Using ROW FORMAT DELIMITED
+SELECT TRANSFORM(name, age)
+    ROW FORMAT DELIMITED
+    FIELDS TERMINATED BY ','
+    LINES TERMINATED BY '\n'
+    NULL DEFINED AS 'NULL'
+    USING 'cat' AS (name_age string)
+    ROW FORMAT DELIMITED
+    FIELDS TERMINATED BY '@'
+    LINES TERMINATED BY '\n'
+    NULL DEFINED AS 'NULL'
+FROM person;
+---------------+
+|       name_age|
+---------------+
+|      Anil K,27|
+|    John V,null|
+|     ryan B.,18|
+|     David K,42|
+|     Zen Hui,50|
+|      Dan Li,18|
+|  Lalit B.,null|
+---------------+
+
+-- Using Hive Serde
+SELECT TRANSFORM(zip_code, name, age)
+    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
+    WITH SERDEPROPERTIES (
+      'field.delim' = '\t'
+    )
+    USING 'cat' AS (a STRING, b STRING, c STRING)
+    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
+    WITH SERDEPROPERTIES (
+      'field.delim' = '\t'
+    )
+FROM person
+WHERE zip_code > 94511;
+-------+---------+-----+
+|    a  |        b|    c|
+-------+---------+-----+
+|  94588|   Anil K|   27|
+|  94588|   John V| NULL|
+|  94588|  Zen Hui|   50|
+|  94588|   Dan Li|   18|
+-------+---------+-----+
+
+-- Schema-less mode
+SELECT TRANSFORM(zip_code, name, age)
+    USING 'cat'
+FROM person
+WHERE zip_code > 94500;
+-------+---------------------+
+|    key|                value|
+-------+---------------------+
+|  94588|	  Anil K    27|
+|  94588|	  John V    \N|
+|  94511|	Aryan B.    18|
+|  94511|	 David K    42|
+|  94588|	 Zen Hui    50|
+|  94588|	  Dan Li    18|
+|  94511|	Lalit B.    \N|
+-------+---------------------+
+```
+
+### Related Statements
+
+* [SELECT Main](sql-ref-syntax-qry-select.html)
+* [WHERE Clause](sql-ref-syntax-qry-select-where.html)
+* [GROUP BY Clause](sql-ref-syntax-qry-select-groupby.html)
+* [HAVING Clause](sql-ref-syntax-qry-select-having.html)
+* [ORDER BY Clause](sql-ref-syntax-qry-select-orderby.html)
+* [SORT BY Clause](sql-ref-syntax-qry-select-sortby.html)
+* [DISTRIBUTE BY Clause](sql-ref-syntax-qry-select-distribute-by.html)
+* [LIMIT Clause](sql-ref-syntax-qry-select-limit.html)
+* [CASE Clause](sql-ref-syntax-qry-select-case.html)
+* [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
+* [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
--- a/docs/sql-ref-syntax-qry-select.md
+++ b/docs/sql-ref-syntax-qry-select.md
@ -41,7 +41,7 @@ select_statement [ { UNION | INTERSECT | EXCEPT } [ ALL | DISTINCT ] select_stat

 While `select_statement` is defined as
 ```sql
-SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_names ] [ , ... ] }
+SELECT [ hints , ... ] [ ALL | DISTINCT ] { [[ named_expression | regex_column_names ] [ , ... ] | TRANSFORM (...)) ] }
    FROM { from_item [ , ... ] }
    [ PIVOT clause ]
    [ LATERAL VIEW clause ] [ ... ] 
@ -164,6 +164,10 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
     )
     ```

+* **TRANSFORM**
+
+     Specifies a hive-style transform query specification to transform the input by forking and running user-specified command or script.
+
 ### Related Statements

 * [WHERE Clause](sql-ref-syntax-qry-select-where.html)
@ -187,3 +191,4 @@ SELECT [ hints , ... ] [ ALL | DISTINCT ] { [ named_expression | regex_column_na
 * [CASE Clause](sql-ref-syntax-qry-select-case.html)
 * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
 * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
+* [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
--- a/docs/sql-ref-syntax-qry.md
+++ b/docs/sql-ref-syntax-qry.md
@ -49,4 +49,5 @@ ability to generate logical and physical plan for a given query using
  * [CASE Clause](sql-ref-syntax-qry-select-case.html)
  * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
  * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
+  * [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
 * [EXPLAIN Statement](sql-ref-syntax-qry-explain.html)
--- a/docs/sql-ref-syntax.md
+++ b/docs/sql-ref-syntax.md
@ -70,6 +70,7 @@ Spark SQL is Apache Spark's module for working with structured data. The SQL Syn
   * [CASE Clause](sql-ref-syntax-qry-select-case.html)
   * [PIVOT Clause](sql-ref-syntax-qry-select-pivot.html)
   * [LATERAL VIEW Clause](sql-ref-syntax-qry-select-lateral-view.html)
+   * [TRANSFORM Clause](sql-ref-syntax-qry-select-transform.html)
 * [EXPLAIN](sql-ref-syntax-qry-explain.html)

 ### Auxiliary Statements