[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page

### What changes were proposed in this pull request?

This PR proposes move JSON data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "JSON Files" page
<img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png">

- Python
<img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png">

- Scala
<img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png">

- Java
<img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png">

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32204 from itholic/SPARK-35081.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
This commit is contained in:
itholic 2021-05-21 18:05:13 +09:00 committed by Hyukjin Kwon
parent 0fe65b5365
commit 419ddcb2a4
9 changed files with 263 additions and 719 deletions

View file

@ -94,3 +94,168 @@ SELECT * FROM jsonTable
</div>
</div>
## Data Source Option
Data source options of JSON can be set via:
* the `.option`/`.options` methods of
* `DataFrameReader`
* `DataFrameWriter`
* `DataStreamReader`
* `DataStreamWriter`
<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<tr>
<!-- TODO(SPARK-35433): Add timeZone to Data Source Option for CSV, too. -->
<td><code>timeZone</code></td>
<td>None</td>
<td>Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of <code>timeZone</code> are supported:<br>
<ul>
<li>Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'.</li>
<li>Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
</ul>
Other short names like 'CST' are not recommended to use because they can be ambiguous. If it isn't set, the current value of the SQL config <code>spark.sql.session.timeZone</code> is used by default.
</td>
<td>read/write</td>
</tr>
<tr>
<td><code>primitivesAsString</code></td>
<td>None</td>
<td>Infers all primitive values as a string type. If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>prefersDecimal</code></td>
<td>None</td>
<td>Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>allowComments</code></td>
<td>None</td>
<td>Ignores Java/C++ style comment in JSON records. If None is set, it uses the default value, <code>false</code></td>
<td>read</td>
</tr>
<tr>
<td><code>allowUnquotedFieldNames</code></td>
<td>None</td>
<td>Allows unquoted JSON field names. If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>allowSingleQuotes</code></td>
<td>None</td>
<td>Allows single quotes in addition to double quotes. If None is set, it uses the default value, <code>true</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>allowNumericLeadingZero</code></td>
<td>None</td>
<td>Allows leading zeros in numbers (e.g. 00012). If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>allowBackslashEscapingAnyCharacter</code></td>
<td>None</td>
<td>Allows accepting quoting of all character using backslash quoting mechanism. If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>mode</code></td>
<td>None</td>
<td>Allows a mode for dealing with corrupt records during parsing. If None is set, it uses the default value, <code>PERMISSIVE</code><br>
<ul>
<li><code>PERMISSIVE</code>: when it meets a corrupted record, puts the malformed string into a field configured by <code>columnNameOfCorruptRecord</code>, and sets malformed fields to <code>null</code>. To keep corrupt records, an user can set a string type field named <code>columnNameOfCorruptRecord</code> in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a <code>columnNameOfCorruptRecord</code> field in an output schema.</li>
<li><code>DROPMALFORMED</code>: ignores the whole corrupted records.</li>
<li><code>FAILFAST</code>: throws an exception when it meets corrupted records.</li>
</ul>
</td>
<td>read</td>
</tr>
<tr>
<td><code>columnNameOfCorruptRecord</code></td>
<td>None</td>
<td>Allows renaming the new field having malformed string created by <code>PERMISSIVE</code> mode. This overrides spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value specified in <code>spark.sql.columnNameOfCorruptRecord</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>dateFormat</code></td>
<td>None</td>
<td>Sets the string that indicates a date format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> datetime pattern</a>. This applies to date type. If None is set, it uses the default value, <code>yyyy-MM-dd</code>.</td>
<td>read/write</td>
</tr>
<tr>
<td><code>timestampFormat</code></td>
<td>None</td>
<td>Sets the string that indicates a timestamp format. Custom date formats follow the formats at <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html"> datetime pattern</a>. This applies to timestamp type. If None is set, it uses the default value, <code>yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]</code>.</td>
<td>read/write</td>
</tr>
<tr>
<td><code>multiLine</code></td>
<td>None</td>
<td>Parse one record, which may span multiple lines, per file. If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>allowUnquotedControlChars</code></td>
<td>None</td>
<td>Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.</td>
<td>read</td>
</tr>
<tr>
<td><code>encoding</code></td>
<td>None</td>
<td>For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of input JSON will be detected automatically when the multiLine option is set to <code>true</code>. For writing, Specifies encoding (charset) of saved json files. If None is set, the default UTF-8 charset will be used.</td>
<td>read/write</td>
</tr>
<tr>
<td><code>lineSep</code></td>
<td>None</td>
<td>Defines the line separator that should be used for parsing. If None is set, it covers all <code>\r</code>, <code>\r\n</code> and <code>\n</code>.</td>
<td>read/write</td>
</tr>
<tr>
<td><code>samplingRatio</code></td>
<td>None</td>
<td>Defines fraction of input JSON objects used for schema inferring. If None is set, it uses the default value, <code>1.0</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>dropFieldIfAllNull</code></td>
<td>None</td>
<td>Whether to ignore column of all null values or empty array/struct during schema inference. If None is set, it uses the default value, <code>false</code>.</td>
<td>read</td>
</tr>
<tr>
<td><code>locale</code></td>
<td>None</td>
<td>Sets a locale as language tag in IETF BCP 47 format. If None is set, it uses the default value, <code>en-US</code>. For instance, <code>locale</code> is used while parsing dates and timestamps.</td>
<td>read</td>
</tr>
<tr>
<td><code>allowNonNumericNumbers</code></td>
<td>None</td>
<td>Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values. If None is set, it uses the default value, <code>true</code>.<br>
<ul>
<li><code>+INF</code>: for positive infinity, as well as alias of <code>+Infinity</code> and <code>Infinity</code>.</li>
<li><code>-INF</code>: for negative infinity, alias <code>-Infinity</code>.</li>
<li><code>NaN</code>: for other not-a-numbers, like result of division by zero.</li>
</ul>
</td>
<td>read</td>
</tr>
<tr>
<td><code>compression</code></td>
<td>None</td>
<td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).</td>
<td>write</td>
</tr>
<tr>
<td><code>ignoreNullFields</code></td>
<td>None</td>
<td>Whether to ignore null fields when generating JSON objects. If None is set, it uses the default value, <code>true</code>.</td>
<td>write</td>
</tr>
</table>
Other generic options can be found in <a href="https://spark.apache.org/docs/latest/sql-data-sources-generic-options.html"> Generic File Source Options</a>.

View file

@ -3711,7 +3711,9 @@ def schema_of_json(json, options=None):
json : :class:`~pyspark.sql.Column` or str
a JSON string or a foldable string column containing a JSON string.
options : dict, optional
options to control parsing. accepts the same options as the JSON datasource
options to control parsing. accepts the same options as the JSON datasource.
See `Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa
in the version you use.
.. versionchanged:: 3.0
It accepts `options` parameter to control schema inferring.

View file

@ -108,29 +108,6 @@ class DataFrameReader(OptionUtils):
@since(1.5)
def option(self, key, value):
"""Adds an input option for the underlying data source.
You can set the following option(s) for reading files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to parse
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
* ``pathGlobFilter``: an optional glob pattern to only include files with paths matching
the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter.
It does not change the behavior of partition discovery.
* ``modifiedBefore``: an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
* ``modifiedAfter``: an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
"""
self._jreader = self._jreader.option(key, to_str(value))
return self
@ -138,29 +115,6 @@ class DataFrameReader(OptionUtils):
@since(1.4)
def options(self, **options):
"""Adds input options for the underlying data source.
You can set the following option(s) for reading files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to parse
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
* ``pathGlobFilter``: an optional glob pattern to only include files with paths matching
the pattern. The syntax follows org.apache.hadoop.fs.GlobFilter.
It does not change the behavior of partition discovery.
* ``modifiedBefore``: an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
* ``modifiedAfter``: an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
"""
for k in options:
self._jreader = self._jreader.option(k, to_str(options[k]))
@ -236,112 +190,13 @@ class DataFrameReader(OptionUtils):
schema : :class:`pyspark.sql.types.StructType` or str, optional
an optional :class:`pyspark.sql.types.StructType` for the input schema or
a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
primitivesAsString : str or bool, optional
infers all primitive values as a string type. If None is set,
it uses the default value, ``false``.
prefersDecimal : str or bool, optional
infers all floating-point values as a decimal type. If the values
do not fit in decimal, then it infers them as doubles. If None is
set, it uses the default value, ``false``.
allowComments : str or bool, optional
ignores Java/C++ style comment in JSON records. If None is set,
it uses the default value, ``false``.
allowUnquotedFieldNames : str or bool, optional
allows unquoted JSON field names. If None is set,
it uses the default value, ``false``.
allowSingleQuotes : str or bool, optional
allows single quotes in addition to double quotes. If None is
set, it uses the default value, ``true``.
allowNumericLeadingZero : str or bool, optional
allows leading zeros in numbers (e.g. 00012). If None is
set, it uses the default value, ``false``.
allowBackslashEscapingAnyCharacter : str or bool, optional
allows accepting quoting of all character
using backslash quoting mechanism. If None is
set, it uses the default value, ``false``.
mode : str, optional
allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, ``PERMISSIVE``.
* ``PERMISSIVE``: when it meets a corrupted record, puts the malformed string \
into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \
fields to ``null``. To keep corrupt records, an user can set a string type \
field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
field in an output schema.
* ``DROPMALFORMED``: ignores the whole corrupted records.
* ``FAILFAST``: throws an exception when it meets corrupted records.
columnNameOfCorruptRecord: str, optional
allows renaming the new field having malformed string
created by ``PERMISSIVE`` mode. This overrides
``spark.sql.columnNameOfCorruptRecord``. If None is set,
it uses the value specified in
``spark.sql.columnNameOfCorruptRecord``.
dateFormat : str, optional
sets the string that indicates a date format. Custom date formats
follow the formats at
`datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa
This applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
timestampFormat : str, optional
sets the string that indicates a timestamp format.
Custom date formats follow the formats at
`datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
multiLine : str or bool, optional
parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, ``false``.
allowUnquotedControlChars : str or bool, optional
allows JSON Strings to contain unquoted control
characters (ASCII characters with value less than 32,
including tab and line feed characters) or not.
encoding : str or bool, optional
allows to forcibly set one of standard basic or extended encoding for
the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
the encoding of input JSON will be detected automatically
when the multiLine option is set to ``true``.
lineSep : str, optional
defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
samplingRatio : str or float, optional
defines fraction of input JSON objects used for schema inferring.
If None is set, it uses the default value, ``1.0``.
dropFieldIfAllNull : str or bool, optional
whether to ignore column of all null values or empty
array/struct during schema inference. If None is set, it
uses the default value, ``false``.
locale : str, optional
sets a locale as language tag in IETF BCP 47 format. If None is set,
it uses the default value, ``en-US``. For instance, ``locale`` is used while
parsing dates and timestamps.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option
disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
allowNonNumericNumbers : str or bool
allows JSON parser to recognize set of "Not-a-Number" (NaN)
tokens as legal floating number values. If None is set,
it uses the default value, ``true``.
* ``+INF``: for positive infinity, as well as alias of
``+Infinity`` and ``Infinity``.
* ``-INF``: for negative infinity, alias ``-Infinity``.
* ``NaN``: for other not-a-numbers, like result of division by zero.
modifiedBefore : an optional timestamp to only include files with
modification times occurring before the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
modifiedAfter : an optional timestamp to only include files with
modification times occurring after the specified time. The provided timestamp
must be in the following format: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa
in the version you use.
Examples
--------
@ -942,20 +797,6 @@ class DataFrameWriter(OptionUtils):
@since(1.5)
def option(self, key, value):
"""Adds an output option for the underlying data source.
You can set the following option(s) for writing files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to format
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
"""
self._jwrite = self._jwrite.option(key, to_str(value))
return self
@ -963,20 +804,6 @@ class DataFrameWriter(OptionUtils):
@since(1.4)
def options(self, **options):
"""Adds output options for the underlying data source.
You can set the following option(s) for writing files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to format
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
"""
for k in options:
self._jwrite = self._jwrite.option(k, to_str(options[k]))
@ -1189,31 +1016,13 @@ class DataFrameWriter(OptionUtils):
* ``ignore``: Silently ignore this operation if data already exists.
* ``error`` or ``errorifexists`` (default case): Throw an exception if data already \
exists.
compression : str, optional
compression codec to use when saving to file. This can be one of the
known case-insensitive shorten names (none, bzip2, gzip, lz4,
snappy and deflate).
dateFormat : str, optional
sets the string that indicates a date format. Custom date formats
follow the formats at
`datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa
This applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
timestampFormat : str, optional
sets the string that indicates a timestamp format.
Custom date formats follow the formats at
`datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
encoding : str, optional
specifies encoding (charset) of saved json files. If None is set,
the default UTF-8 charset will be used.
lineSep : str, optional
defines the line separator that should be used for writing. If None is
set, it uses the default value, ``\\n``.
ignoreNullFields : str or bool, optional
Whether to ignore null fields when generating JSON objects.
If None is set, it uses the default value, ``true``.
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa
in the version you use.
Examples
--------

View file

@ -378,20 +378,6 @@ class DataStreamReader(OptionUtils):
def option(self, key, value):
"""Adds an input option for the underlying data source.
You can set the following option(s) for reading files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to parse
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
.. versionadded:: 2.0.0
Notes
@ -408,20 +394,6 @@ class DataStreamReader(OptionUtils):
def options(self, **options):
"""Adds input options for the underlying data source.
You can set the following option(s) for reading files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to parse
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
.. versionadded:: 2.0.0
Notes
@ -507,102 +479,13 @@ class DataStreamReader(OptionUtils):
schema : :class:`pyspark.sql.types.StructType` or str, optional
an optional :class:`pyspark.sql.types.StructType` for the input schema
or a DDL-formatted string (For example ``col0 INT, col1 DOUBLE``).
primitivesAsString : str or bool, optional
infers all primitive values as a string type. If None is set,
it uses the default value, ``false``.
prefersDecimal : str or bool, optional
infers all floating-point values as a decimal type. If the values
do not fit in decimal, then it infers them as doubles. If None is
set, it uses the default value, ``false``.
allowComments : str or bool, optional
ignores Java/C++ style comment in JSON records. If None is set,
it uses the default value, ``false``.
allowUnquotedFieldNames : str or bool, optional
allows unquoted JSON field names. If None is set,
it uses the default value, ``false``.
allowSingleQuotes : str or bool, optional
allows single quotes in addition to double quotes. If None is
set, it uses the default value, ``true``.
allowNumericLeadingZero : str or bool, optional
allows leading zeros in numbers (e.g. 00012). If None is
set, it uses the default value, ``false``.
allowBackslashEscapingAnyCharacter : str or bool, optional
allows accepting quoting of all character
using backslash quoting mechanism. If None is
set, it uses the default value, ``false``.
mode : str, optional
allows a mode for dealing with corrupt records during parsing. If None is
set, it uses the default value, ``PERMISSIVE``.
* ``PERMISSIVE``: when it meets a corrupted record, puts the malformed string \
into a field configured by ``columnNameOfCorruptRecord``, and sets malformed \
fields to ``null``. To keep corrupt records, an user can set a string type \
field named ``columnNameOfCorruptRecord`` in an user-defined schema. If a \
schema does not have the field, it drops corrupt records during parsing. \
When inferring a schema, it implicitly adds a ``columnNameOfCorruptRecord`` \
field in an output schema.
* ``DROPMALFORMED``: ignores the whole corrupted records.
* ``FAILFAST``: throws an exception when it meets corrupted records.
columnNameOfCorruptRecord : str, optional
allows renaming the new field having malformed string
created by ``PERMISSIVE`` mode. This overrides
``spark.sql.columnNameOfCorruptRecord``. If None is set,
it uses the value specified in
``spark.sql.columnNameOfCorruptRecord``.
dateFormat : str, optional
sets the string that indicates a date format. Custom date formats
follow the formats at
`datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa
This applies to date type. If None is set, it uses the
default value, ``yyyy-MM-dd``.
timestampFormat : str, optional
sets the string that indicates a timestamp format.
Custom date formats follow the formats at
`datetime pattern <https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html>`_. # noqa
This applies to timestamp type. If None is set, it uses the
default value, ``yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]``.
multiLine : str or bool, optional
parse one record, which may span multiple lines, per file. If None is
set, it uses the default value, ``false``.
allowUnquotedControlChars : str or bool, optional
allows JSON Strings to contain unquoted control
characters (ASCII characters with value less than 32,
including tab and line feed characters) or not.
lineSep : str, optional
defines the line separator that should be used for parsing. If None is
set, it covers all ``\\r``, ``\\r\\n`` and ``\\n``.
locale : str, optional
sets a locale as language tag in IETF BCP 47 format. If None is set,
it uses the default value, ``en-US``. For instance, ``locale`` is used while
parsing dates and timestamps.
dropFieldIfAllNull : str or bool, optional
whether to ignore column of all null values or empty
array/struct during schema inference. If None is set, it
uses the default value, ``false``.
encoding : str or bool, optional
allows to forcibly set one of standard basic or extended encoding for
the JSON files. For example UTF-16BE, UTF-32LE. If None is set,
the encoding of input JSON will be detected automatically
when the multiLine option is set to ``true``.
pathGlobFilter : str or bool, optional
an optional glob pattern to only include files with paths matching
the pattern. The syntax follows `org.apache.hadoop.fs.GlobFilter`.
It does not change the behavior of
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
recursiveFileLookup : str or bool, optional
recursively scan a directory for files. Using this option
disables
`partition discovery <https://spark.apache.org/docs/latest/sql-data-sources-parquet.html#partition-discovery>`_. # noqa
allowNonNumericNumbers : str or bool, optional
allows JSON parser to recognize set of "Not-a-Number" (NaN)
tokens as legal floating number values. If None is set,
it uses the default value, ``true``.
* ``+INF``: for positive infinity, as well as alias of
``+Infinity`` and ``Infinity``.
* ``-INF``: for negative infinity, alias ``-Infinity``.
* ``NaN``: for other not-a-numbers, like result of division by zero.
Other Parameters
----------------
Extra options
For the extra options, refer to
`Data Source Option <https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option>`_ # noqa
in the version you use.
Notes
-----
@ -1075,20 +958,6 @@ class DataStreamWriter(object):
def option(self, key, value):
"""Adds an output option for the underlying data source.
You can set the following option(s) for writing files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to format
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
.. versionadded:: 2.0.0
Notes
@ -1101,20 +970,6 @@ class DataStreamWriter(object):
def options(self, **options):
"""Adds output options for the underlying data source.
You can set the following option(s) for writing files:
* ``timeZone``: sets the string that indicates a time zone ID to be used to format
timestamps in the JSON/CSV datasources or partition values. The following
formats of `timeZone` are supported:
* Region-based zone ID: It should have the form 'area/city', such as \
'America/Los_Angeles'.
* Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00' or \
'+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.
Other short names like 'CST' are not recommended to use because they can be
ambiguous. If it isn't set, the current value of the SQL config
``spark.sql.session.timeZone`` is used by default.
.. versionadded:: 2.0.0
Notes

View file

@ -101,23 +101,6 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* All options are maintained in a case-insensitive way in terms of key names.
* If a new option has the same key case-insensitively, it will override the existing option.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to parse timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 1.4.0
*/
def option(key: String, value: String): DataFrameReader = {
@ -161,23 +144,6 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* All options are maintained in a case-insensitive way in terms of key names.
* If a new option has the same key case-insensitively, it will override the existing option.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to parse timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 1.4.0
*/
def options(options: scala.collection.Map[String, String]): DataFrameReader = {
@ -191,23 +157,6 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* All options are maintained in a case-insensitive way in terms of key names.
* If a new option has the same key case-insensitively, it will override the existing option.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to parse timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 1.4.0
*/
def options(options: java.util.Map[String, String]): DataFrameReader = {
@ -441,81 +390,9 @@ class DataFrameReader private[sql](sparkSession: SparkSession) extends Logging {
* This function goes through the input once to determine the input schema. If you know the
* schema in advance, use the version that specifies the schema to avoid the extra scan.
*
* You can set the following JSON-specific options to deal with non-standard JSON files:
* <ul>
* <li>`primitivesAsString` (default `false`): infers all primitive values as a string type</li>
* <li>`prefersDecimal` (default `false`): infers all floating-point values as a decimal
* type. If the values do not fit in decimal, then it infers them as doubles.</li>
* <li>`allowComments` (default `false`): ignores Java/C++ style comment in JSON records</li>
* <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names</li>
* <li>`allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* </li>
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`allowUnquotedControlChars` (default `false`): allows JSON Strings to contain unquoted
* control characters (ASCII characters with value less than 32, including tab and line feed
* characters) or not.</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.
* <ul>
* <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a
* field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To
* keep corrupt records, an user can set a string type field named
* `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have the
* field, it drops corrupt records during parsing. When inferring a schema, it implicitly
* adds a `columnNameOfCorruptRecord` field in an output schema.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at
* <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
* Datetime Patterns</a>.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
* Datetime Patterns</a>.
* This applies to timestamp type.</li>
* <li>`multiLine` (default `false`): parse one record, which may span multiple lines,
* per file</li>
* <li>`encoding` (by default it is not set): allows to forcibly set one of standard basic
* or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If the encoding
* is not specified and `multiLine` is set to `true`, it will be detected automatically.</li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
* that should be used for parsing.</li>
* <li>`samplingRatio` (default is 1.0): defines fraction of input JSON objects used
* for schema inferring.</li>
* <li>`dropFieldIfAllNull` (default `false`): whether to ignore column of all null values or
* empty array/struct during schema inference.</li>
* <li>`locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format.
* For instance, this is used while parsing dates and timestamps.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`modifiedBefore` (batch only): an optional timestamp to only include files with
* modification times occurring before the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`modifiedAfter` (batch only): an optional timestamp to only include files with
* modification times occurring after the specified Time. The provided timestamp
* must be in the following form: YYYY-MM-DDTHH:mm:ss (e.g. 2020-06-01T13:00:00)</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* <li>`allowNonNumericNumbers` (default `true`): allows JSON parser to recognize set of
* "Not-a-Number" (NaN) tokens as legal floating number values:
* <ul>
* <li>`+INF` for positive infinity, as well as alias of `+Infinity` and `Infinity`.
* <li>`-INF` for negative infinity), alias `-Infinity`.
* <li>`NaN` for other not-a-numbers, like result of division by zero.
* </ul>
* </li>
* </ul>
* You can find the JSON-specific options for reading JSON files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 2.0.0
*/

View file

@ -108,23 +108,6 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* All options are maintained in a case-insensitive way in terms of key names.
* If a new option has the same key case-insensitively, it will override the existing option.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to format timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 1.4.0
*/
def option(key: String, value: String): DataFrameWriter[T] = {
@ -168,23 +151,6 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* All options are maintained in a case-insensitive way in terms of key names.
* If a new option has the same key case-insensitively, it will override the existing option.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to format timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 1.4.0
*/
def options(options: scala.collection.Map[String, String]): DataFrameWriter[T] = {
@ -198,23 +164,6 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* All options are maintained in a case-insensitive way in terms of key names.
* If a new option has the same key case-insensitively, it will override the existing option.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to format timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 1.4.0
*/
def options(options: java.util.Map[String, String]): DataFrameWriter[T] = {
@ -825,27 +774,9 @@ final class DataFrameWriter[T] private[sql](ds: Dataset[T]) {
* format("json").save(path)
* }}}
*
* You can set the following JSON-specific option(s) for writing JSON files:
* <ul>
* <li>`compression` (default `null`): compression codec to use when saving to file. This can be
* one of the known case-insensitive shorten names (`none`, `bzip2`, `gzip`, `lz4`,
* `snappy` and `deflate`). </li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at
* <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
* Datetime Patterns</a>.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
* Datetime Patterns</a>.
* This applies to timestamp type.</li>
* <li>`encoding` (by default it is not set): specifies encoding (charset) of saved json
* files. If it is not set, the UTF-8 charset will be used. </li>
* <li>`lineSep` (default `\n`): defines the line separator that should be used for writing.</li>
* <li>`ignoreNullFields` (default `true`): Whether to ignore null fields
* when generating JSON objects. </li>
* </ul>
* You can find the JSON-specific options for writing JSON files in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 1.4.0
*/

View file

@ -4142,6 +4142,7 @@ object functions {
JsonTuple(json.expr +: fields.map(Literal.apply))
}
// scalastyle:off line.size.limit
/**
* (Scala-specific) Parses a column containing a JSON string into a `StructType` with the
* specified schema. Returns `null`, in the case of an unparseable string.
@ -4150,13 +4151,19 @@ object functions {
* @param schema the schema to use when parsing the json string
* @param options options to control how the json is parsed. Accepts the same options as the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.1.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: StructType, options: Map[String, String]): Column =
from_json(e, schema.asInstanceOf[DataType], options)
// scalastyle:off line.size.limit
/**
* (Scala-specific) Parses a column containing a JSON string into a `MapType` with `StringType`
* as keys type, `StructType` or `ArrayType` with the specified schema.
@ -4166,14 +4173,20 @@ object functions {
* @param schema the schema to use when parsing the json string
* @param options options to control how the json is parsed. accepts the same options and the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.2.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: DataType, options: Map[String, String]): Column = withExpr {
JsonToStructs(CharVarcharUtils.failIfHasCharVarchar(schema), options, e.expr)
}
// scalastyle:off line.size.limit
/**
* (Java-specific) Parses a column containing a JSON string into a `StructType` with the
* specified schema. Returns `null`, in the case of an unparseable string.
@ -4182,13 +4195,19 @@ object functions {
* @param schema the schema to use when parsing the json string
* @param options options to control how the json is parsed. accepts the same options and the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.1.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: StructType, options: java.util.Map[String, String]): Column =
from_json(e, schema, options.asScala.toMap)
// scalastyle:off line.size.limit
/**
* (Java-specific) Parses a column containing a JSON string into a `MapType` with `StringType`
* as keys type, `StructType` or `ArrayType` with the specified schema.
@ -4198,10 +4217,15 @@ object functions {
* @param schema the schema to use when parsing the json string
* @param options options to control how the json is parsed. accepts the same options and the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.2.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: DataType, options: java.util.Map[String, String]): Column = {
from_json(e, CharVarcharUtils.failIfHasCharVarchar(schema), options.asScala.toMap)
}
@ -4233,6 +4257,7 @@ object functions {
def from_json(e: Column, schema: DataType): Column =
from_json(e, schema, Map.empty[String, String])
// scalastyle:off line.size.limit
/**
* (Java-specific) Parses a column containing a JSON string into a `MapType` with `StringType`
* as keys type, `StructType` or `ArrayType` with the specified schema.
@ -4240,14 +4265,22 @@ object functions {
*
* @param e a string column containing JSON data.
* @param schema the schema as a DDL-formatted string.
* @param options options to control how the json is parsed. accepts the same options and the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.1.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: String, options: java.util.Map[String, String]): Column = {
from_json(e, schema, options.asScala.toMap)
}
// scalastyle:off line.size.limit
/**
* (Scala-specific) Parses a column containing a JSON string into a `MapType` with `StringType`
* as keys type, `StructType` or `ArrayType` with the specified schema.
@ -4255,10 +4288,17 @@ object functions {
*
* @param e a string column containing JSON data.
* @param schema the schema as a DDL-formatted string.
* @param options options to control how the json is parsed. accepts the same options and the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.3.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: String, options: Map[String, String]): Column = {
val dataType = parseTypeWithFallback(
schema,
@ -4283,6 +4323,7 @@ object functions {
from_json(e, schema, Map.empty[String, String].asJava)
}
// scalastyle:off line.size.limit
/**
* (Java-specific) Parses a column containing a JSON string into a `MapType` with `StringType`
* as keys type, `StructType` or `ArrayType` of `StructType`s with the specified schema.
@ -4292,10 +4333,15 @@ object functions {
* @param schema the schema to use when parsing the json string
* @param options options to control how the json is parsed. accepts the same options and the
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @group collection_funcs
* @since 2.4.0
*/
// scalastyle:on line.size.limit
def from_json(e: Column, schema: Column, options: java.util.Map[String, String]): Column = {
withExpr(new JsonToStructs(e.expr, schema.expr, options.asScala.toMap))
}
@ -4320,21 +4366,28 @@ object functions {
*/
def schema_of_json(json: Column): Column = withExpr(new SchemaOfJson(json.expr))
// scalastyle:off line.size.limit
/**
* Parses a JSON string and infers its schema in DDL format using options.
*
* @param json a foldable string column containing JSON data.
* @param options options to control how the json is parsed. accepts the same options and the
* json data source. See [[DataFrameReader#json]].
* json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
* @return a column with string literal containing schema in DDL format.
*
* @group collection_funcs
* @since 3.0.0
*/
// scalastyle:on line.size.limit
def schema_of_json(json: Column, options: java.util.Map[String, String]): Column = {
withExpr(SchemaOfJson(json.expr, options.asScala.toMap))
}
// scalastyle:off line.size.limit
/**
* (Scala-specific) Converts a column containing a `StructType`, `ArrayType` or
* a `MapType` into a JSON string with the specified schema.
@ -4343,16 +4396,22 @@ object functions {
* @param e a column containing a struct, an array or a map.
* @param options options to control how the struct column is converted into a json string.
* accepts the same options and the json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
* Additionally the function supports the `pretty` option which enables
* pretty JSON generation.
*
* @group collection_funcs
* @since 2.1.0
*/
// scalastyle:on line.size.limit
def to_json(e: Column, options: Map[String, String]): Column = withExpr {
StructsToJson(options, e.expr)
}
// scalastyle:off line.size.limit
/**
* (Java-specific) Converts a column containing a `StructType`, `ArrayType` or
* a `MapType` into a JSON string with the specified schema.
@ -4361,12 +4420,17 @@ object functions {
* @param e a column containing a struct, an array or a map.
* @param options options to control how the struct column is converted into a json string.
* accepts the same options and the json data source.
* See
* <a href=
* "https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
* Additionally the function supports the `pretty` option which enables
* pretty JSON generation.
*
* @group collection_funcs
* @since 2.1.0
*/
// scalastyle:on line.size.limit
def to_json(e: Column, options: java.util.Map[String, String]): Column =
to_json(e, options.asScala.toMap)

View file

@ -85,23 +85,6 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
/**
* Adds an input option for the underlying data source.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to parse timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 2.0.0
*/
def option(key: String, value: String): DataStreamReader = {
@ -133,23 +116,6 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
/**
* (Scala-specific) Adds input options for the underlying data source.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to parse timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 2.0.0
*/
def options(options: scala.collection.Map[String, String]): DataStreamReader = {
@ -160,23 +126,6 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
/**
* (Java-specific) Adds input options for the underlying data source.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to parse timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 2.0.0
*/
def options(options: java.util.Map[String, String]): DataStreamReader = {
@ -269,73 +218,16 @@ final class DataStreamReader private[sql](sparkSession: SparkSession) extends Lo
* This function goes through the input once to determine the input schema. If you know the
* schema in advance, use the version that specifies the schema to avoid the extra scan.
*
* You can set the following JSON-specific options to deal with non-standard JSON files:
* You can set the following option(s):
* <ul>
* <li>`maxFilesPerTrigger` (default: no max limit): sets the maximum number of new files to be
* considered in every trigger.</li>
* <li>`primitivesAsString` (default `false`): infers all primitive values as a string type</li>
* <li>`prefersDecimal` (default `false`): infers all floating-point values as a decimal
* type. If the values do not fit in decimal, then it infers them as doubles.</li>
* <li>`allowComments` (default `false`): ignores Java/C++ style comment in JSON records</li>
* <li>`allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names</li>
* <li>`allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* </li>
* <li>`allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers
* (e.g. 00012)</li>
* <li>`allowBackslashEscapingAnyCharacter` (default `false`): allows accepting quoting of all
* character using backslash quoting mechanism</li>
* <li>`allowUnquotedControlChars` (default `false`): allows JSON Strings to contain unquoted
* control characters (ASCII characters with value less than 32, including tab and line feed
* characters) or not.</li>
* <li>`mode` (default `PERMISSIVE`): allows a mode for dealing with corrupt records
* during parsing.
* <ul>
* <li>`PERMISSIVE` : when it meets a corrupted record, puts the malformed string into a
* field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To
* keep corrupt records, an user can set a string type field named
* `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have the
* field, it drops corrupt records during parsing. When inferring a schema, it implicitly
* adds a `columnNameOfCorruptRecord` field in an output schema.</li>
* <li>`DROPMALFORMED` : ignores the whole corrupted records.</li>
* <li>`FAILFAST` : throws an exception when it meets corrupted records.</li>
* </ul>
* </li>
* <li>`columnNameOfCorruptRecord` (default is the value specified in
* `spark.sql.columnNameOfCorruptRecord`): allows renaming the new field having malformed string
* created by `PERMISSIVE` mode. This overrides `spark.sql.columnNameOfCorruptRecord`.</li>
* <li>`dateFormat` (default `yyyy-MM-dd`): sets the string that indicates a date format.
* Custom date formats follow the formats at
* <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
* Datetime Patterns</a>.
* This applies to date type.</li>
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* <a href="https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html">
* Datetime Patterns</a>.
* This applies to timestamp type.</li>
* <li>`multiLine` (default `false`): parse one record, which may span multiple lines,
* per file</li>
* <li>`lineSep` (default covers all `\r`, `\r\n` and `\n`): defines the line separator
* that should be used for parsing.</li>
* <li>`dropFieldIfAllNull` (default `false`): whether to ignore column of all null values or
* empty array/struct during schema inference.</li>
* <li>`locale` (default is `en-US`): sets a locale as language tag in IETF BCP 47 format.
* For instance, this is used while parsing dates and timestamps.</li>
* <li>`pathGlobFilter`: an optional glob pattern to only include files with paths matching
* the pattern. The syntax follows <code>org.apache.hadoop.fs.GlobFilter</code>.
* It does not change the behavior of partition discovery.</li>
* <li>`recursiveFileLookup`: recursively scan a directory for files. Using this option
* disables partition discovery</li>
* <li>`allowNonNumericNumbers` (default `true`): allows JSON parser to recognize set of
* "Not-a-Number" (NaN) tokens as legal floating number values:
* <ul>
* <li>`+INF` for positive infinity, as well as alias of `+Infinity` and `Infinity`.
* <li>`-INF` for negative infinity, alias `-Infinity`.
* <li>`NaN` for other not-a-numbers, like result of division by zero.
* </ul>
* </li>
* </ul>
*
* You can find the JSON-specific options for reading JSON file stream in
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-json.html#data-source-option">
* Data Source Option</a> in the version you use.
*
* @since 2.0.0
*/
def json(path: String): DataFrame = format("json").load(path)

View file

@ -167,23 +167,6 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {
/**
* Adds an output option for the underlying data source.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to format timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 2.0.0
*/
def option(key: String, value: String): DataStreamWriter[T] = {
@ -215,23 +198,6 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {
/**
* (Scala-specific) Adds output options for the underlying data source.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to format timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 2.0.0
*/
def options(options: scala.collection.Map[String, String]): DataStreamWriter[T] = {
@ -242,23 +208,6 @@ final class DataStreamWriter[T] private[sql](ds: Dataset[T]) {
/**
* Adds output options for the underlying data source.
*
* You can set the following option(s):
* <ul>
* <li>`timeZone` (default session local timezone): sets the string that indicates a time zone ID
* to be used to format timestamps in the JSON/CSV datasources or partition values. The following
* formats of `timeZone` are supported:
* <ul>
* <li> Region-based zone ID: It should have the form 'area/city', such as
* 'America/Los_Angeles'.</li>
* <li> Zone offset: It should be in the format '(+|-)HH:mm', for example '-08:00'
* or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'.</li>
* </ul>
* Other short names like 'CST' are not recommended to use because they can be ambiguous.
* If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is
* used by default.
* </li>
* </ul>
*
* @since 2.0.0
*/
def options(options: java.util.Map[String, String]): DataStreamWriter[T] = {