From 79a6b0cc8a75875329849650227120aa1324403b Mon Sep 17 00:00:00 2001 From: itholic Date: Wed, 26 May 2021 17:12:49 +0900 Subject: [PATCH] [SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move text data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Text Files" page Screen Shot 2021-05-26 at 3 20 11 PM - Python Screen Shot 2021-05-25 at 5 04 26 PM - Scala Screen Shot 2021-05-25 at 5 05 10 PM - Java Screen Shot 2021-05-25 at 5 05 36 PM ### How was this patch tested? Manually build docs and confirm the page. Closes #32660 from itholic/SPARK-35509. Authored-by: itholic Signed-off-by: Hyukjin Kwon --- docs/sql-data-sources-text.md | 35 +++++++++++++++- python/pyspark/sql/readwriter.py | 41 ++++++------------- python/pyspark/sql/streaming.py | 20 ++++----- .../apache/spark/sql/DataFrameReader.scala | 21 ++-------- .../apache/spark/sql/DataFrameWriter.scala | 10 ++--- .../sql/streaming/DataStreamReader.scala | 15 +++---- 6 files changed, 64 insertions(+), 78 deletions(-) diff --git a/docs/sql-data-sources-text.md b/docs/sql-data-sources-text.md index c32395f8eb..d72b543f54 100644 --- a/docs/sql-data-sources-text.md +++ b/docs/sql-data-sources-text.md @@ -21,8 +21,6 @@ license: | Spark SQL provides `spark.read().text("file_name")` to read a file or directory of text files into a Spark DataFrame, and `dataframe.write().text("path")` to write to a text file. When reading a text file, each line becomes each row that has string "value" column by default. The line separator can be changed as shown in the example below. The `option()` function can be used to customize the behavior of reading or writing, such as controlling behavior of the line separator, compression, and so on. - -
@@ -38,3 +36,36 @@ Spark SQL provides `spark.read().text("file_name")` to read a file or directory
+ +## Data Source Option + +Data source options of text can be set via: +* the `.option`/`.options` methods of + * `DataFrameReader` + * `DataFrameWriter` + * `DataStreamReader` + * `DataStreamWriter` + * `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html) + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaningScope
wholetextfalseIf true, read each file from input path(s) as a single row.read
lineSep\r, \r\n, \n for reading, \n for writingDefines the line separator that should be used for reading or writing.read/write
compression(none)Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).write
+Other generic options can be found in Generic File Source Options. diff --git a/python/pyspark/sql/readwriter.py b/python/pyspark/sql/readwriter.py index b9a975ffdc..7719d48f6e 100644 --- a/python/pyspark/sql/readwriter.py +++ b/python/pyspark/sql/readwriter.py @@ -313,28 +313,13 @@ class DataFrameReader(OptionUtils): ---------- paths : str or list string, or list of strings, for input path(s). - wholetext : str or bool, optional - if true, read each file from input path(s) as a single row. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of - `partition discovery `_. # noqa - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option disables - `partition discovery `_. # noqa - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedBefore (batch only) : an optional timestamp to only include files with - modification times occurring before the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) - modifiedAfter (batch only) : an optional timestamp to only include files with - modification times occurring after the specified time. The provided timestamp - must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00) + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. Examples -------- @@ -1038,13 +1023,13 @@ class DataFrameWriter(OptionUtils): ---------- path : str the path in any Hadoop supported file system - compression : str, optional - compression codec to use when saving to file. This can be one of the - known case-insensitive shorten names (none, bzip2, gzip, lz4, - snappy and deflate). - lineSep : str, optional - defines the line separator that should be used for writing. If None is - set, it uses the default value, ``\\n``. + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. The DataFrame must have only one column that is of string type. Each row becomes a new line in the output file. diff --git a/python/pyspark/sql/streaming.py b/python/pyspark/sql/streaming.py index 63f5e245cd..f7ec69a414 100644 --- a/python/pyspark/sql/streaming.py +++ b/python/pyspark/sql/streaming.py @@ -593,19 +593,13 @@ class DataStreamReader(OptionUtils): ---------- paths : str or list string, or list of strings, for input path(s). - wholetext : str or bool, optional - if true, read each file from input path(s) as a single row. - lineSep : str, optional - defines the line separator that should be used for parsing. If None is - set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``. - pathGlobFilter : str or bool, optional - an optional glob pattern to only include files with paths matching - the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`. - It does not change the behavior of `partition discovery`_. - recursiveFileLookup : str or bool, optional - recursively scan a directory for files. Using this option - disables - `partition discovery `_. # noqa + + Other Parameters + ---------------- + Extra options + For the extra options, refer to + `Data Source Option `_ # noqa + in the version you use. Notes ----- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala index e2c9e3126c..ea84785f27 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala @@ -773,24 +773,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging { * spark.read().text("/path/to/spark/README.md") * }}} * - * You can set the following text-specific option(s) for reading text files: - *
    - *
  • `wholetext` (default `false`): If true, read a file as a single row and not split by "\n". - *
  • - *
  • `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `modifiedBefore` (batch only): an optional timestamp to only include files with - * modification times occurring before the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `modifiedAfter` (batch only): an optional timestamp to only include files with - * modification times occurring after the specified Time. The provided timestamp - * must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • - *
+ * You can find the text-specific options for reading text files in + * + * Data Source Option in the version you use. * * @param paths input paths * @since 1.6.0 diff --git a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala index 8c8def396a..cb1029579a 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala @@ -833,13 +833,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) { * }}} * The text files will be encoded as UTF-8. * - * You can set the following option(s) for writing text files: - *
    - *
  • `compression` (default `null`): compression codec to use when saving to file. This can be - * one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`, - * `snappy` and `deflate`).
  • - *
  • `lineSep` (default `\n`): defines the line separator that should be used for writing.
  • - *
+ * You can find the text-specific options for writing text files in + * + * Data Source Option in the version you use. * * @since 1.6.0 */ diff --git a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala index b369a0a59a..6c3fbaf00e 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/streaming/DataStreamReader.scala @@ -413,21 +413,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo * spark.readStream().text("/path/to/directory/") * }}} * - * You can set the following text-specific options to deal with text files: + * You can set the following option(s): *
    *
  • `maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be * considered in every trigger.
  • - *
  • `wholetext` (default `false`): If true, read a file as a single row and not split by "\n". - *
  • - *
  • `lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator - * that should be used for parsing.
  • - *
  • `pathGlobFilter`: an optional glob pattern to only include files with paths matching - * the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter. - * It does not change the behavior of partition discovery.
  • - *
  • `recursiveFileLookup`: recursively scan a directory for files. Using this option - * disables partition discovery
  • *
* + * You can find the text-specific options for reading text files in + * + * Data Source Option in the version you use. + * * @since 2.0.0 */ def text(path: String): DataFrame = format("text").load(path)