[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page

### What changes were proposed in this pull request?

This PR proposes move text data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "Text Files" page
<img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png">

- Python
<img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png">

- Scala
<img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png">

- Java
<img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32660 from itholic/SPARK-35509.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
itholic 2021-05-26 17:12:49 +09:00 committed by Hyukjin Kwon
parent e3c6907c99
commit 79a6b0cc8a
6 changed files with 64 additions and 78 deletions

View file

@ -21,8 +21,6 @@ license: |
Spark SQL provides `spark.read().text("file_name")` to read a file or directory of text files into a Spark DataFrame, and `dataframe.write().text("path")` to write to a text file. When reading a text file, each line becomes each row that has string "value" column by default. The line separator can be changed as shown in the example below. The `option()` function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on.
<!--TODO(SPARK-34491): add `option()` document reference-->
<div class="codetabs">
<div data-lang="scala" markdown="1">
@ -38,3 +36,36 @@ Spark SQL provides `spark.read().text("file_name")` to read a file or directory
</div>
</div>
## Data Source Option
Data source options of text can be set via:
* the `.option`/`.options` methods of
* `DataFrameReader`
* `DataFrameWriter`
* `DataStreamReader`
* `DataStreamWriter`
* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<tr>
<td><code>wholetext</code></td>
<td>false</td>
<td>If true, read each file from input path(s) as a single row.</td>
<td>read</td>
</tr>
<tr>
<td><code>lineSep</code></td>
<td><code>\r</code>, <code>\r\n</code>, <code>\n</code> for reading, <code>\n</code> for writing</td>
<td>Defines the line separator that should be used for reading or writing.</td>
<td>read/write</td>
</tr>
<tr>
<td><code>compression</code></td>
<td>(none)</td>
<td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).</td>
<td>write</td>
</tr>
</table>
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>.

View file

@ -313,28 +313,13 @@ class DataFrameReader(OptionUtils):
----------
paths : str or list
string, or list of strings, for input path(s).
wholetext : str or bool, optional
if true, read each file from input path(s) as a single row.
lineSep : str, optional
defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedBefore (batch only) : an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter (batch only) : an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa
in the version you use.
Examples
--------
@ -1038,13 +1023,13 @@ class DataFrameWriter(OptionUtils):
----------
path : str
the path in any Hadoop supported file system
compression : str, optional
compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).
lineSep : str, optional
defines the line separator that should be used for writing. If None is
set, it uses the default value, ``\\n``.
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa
in the version you use.
The DataFrame must have only one column that is of string type.
Each row becomes a new line in the output file.

View file

@ -593,19 +593,13 @@ class DataStreamReader(OptionUtils):
----------
paths : str or list
string, or list of strings, for input path(s).
wholetext : str or bool, optional
if true, read each file from input path(s) as a single row.
lineSep : str, optional
defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of `partition discovery`_.
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option
disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option>`_ # noqa
in the version you use.
Notes
-----

View file

@ -773,24 +773,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* spark.read().text("/path/to/spark/README.md")
* }}}
*
* You can set the following text-specific option(s) for reading text files:
* <ul>
* <li>`wholetext` (default `false`): If true, read a file as a single row and not split by "\n".
* </li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
* that should be used for parsing.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`modifiedBefore` (batch only): an optional timestamp to only include files with
* modification times occurring before the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`modifiedAfter` (batch only): an optional timestamp to only include files with
* modification times occurring after the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* </ul>
* You can find the text-specific options for reading text files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @param paths input paths
* @since 1.6.0

View file

@ -833,13 +833,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* }}}
* The text files will be encoded as UTF-8.
*
* You can set the following option(s) for writing text files:
* <ul>
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
* <li>`lineSep` (default `\n`): defines the line separator that should be used for writing.</li>
* </ul>
* You can find the text-specific options for writing text files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 1.6.0
*/

View file

@ -413,21 +413,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* spark.readStream().text("/path/to/directory/")
* }}}
*
* You can set the following text-specific options to deal with text files:
* You can set the following option(s):
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`wholetext` (default `false`): If true, read a file as a single row and not split by "\n".
* </li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
* that should be used for parsing.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* </ul>
*
* You can find the text-specific options for reading text files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-text.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 2.0.0
*/
def text(path: String): DataFrame = format("text").load(path)