spark-instrumented-optimizer/docs/sql-data-sources-json.md at e0bccc1831a0fc8997c2d3a7ed826336fb6962d9

itholic e0bccc1831 [SPARK-35528][DOCS] Add more options at Data Source Options pages

### What changes were proposed in this pull request?

This PR proposes adding more methods to set data source option to `Data Source Option` page for each data source.

For example, Data Source Option page for JSON as below:

- Before
<img width="322" alt="Screen Shot 2021-06-03 at 10 51 54 AM" src="https://user-images.githubusercontent.com/44108233/120574245-eb13aa00-c459-11eb-9f81-0b356023bcb5.png">

- After
<img width="470" alt="Screen Shot 2021-06-03 at 10 52 21 AM" src="https://user-images.githubusercontent.com/44108233/120574253-ed760400-c459-11eb-9008-1f075e0b9267.png">

### Why are the changes needed?

To provide users various options when they set options for data source.

### Does this PR introduce _any_ user-facing change?

Yes, now the document provides more methods for setting options than before, as in above screen capture.

### How was this patch tested?

Manually built the docs and check one by one.

Closes #32757 from itholic/SPARK-35528.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

2021-06-03 12:49:10 +09:00

12 KiB

Raw Blame History

layout	title	displayTitle	license
global	JSON Files	JSON Files	Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Spark SQL can automatically infer the schema of a JSON dataset and load it as a `Dataset[Row]`. This conversion can be done using `SparkSession.read.json()` on either a `Dataset[String]`, or a JSON file.

Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.

For a regular multi-line JSON file, set the multiLine option to true.

{% include_example json_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}

Spark SQL can automatically infer the schema of a JSON dataset and load it as a `Dataset`. This conversion can be done using `SparkSession.read().json()` on either a `Dataset`, or a JSON file.

For a regular multi-line JSON file, set the multiLine option to true.

{% include_example json_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. This conversion can be done using `SparkSession.read.json` on a JSON file.

For a regular multi-line JSON file, set the multiLine parameter to True.

{% include_example json_dataset python/sql/datasource.py %}

Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame. using the `read.json()` function, which loads data from a directory of JSON files where each line of the files is a JSON object.

For a regular multi-line JSON file, set a named parameter multiLine to TRUE.

{% include_example json_dataset r/RSparkSQLExample.R %}

{% highlight sql %}

CREATE TEMPORARY VIEW jsonTable USING org.apache.spark.sql.json OPTIONS ( path "examples/src/main/resources/people.json" )

SELECT * FROM jsonTable

{% endhighlight %}

Data Source Option

Data source options of JSON can be set via:

the .option/.options methods of
- DataFrameReader
- DataFrameWriter
- DataStreamReader
- DataStreamWriter
the built-in functions below
- from_json
- to_json
- schema_of_json
OPTIONS clause at CREATE TABLE USING DATA_SOURCE

Property Name	Default	Meaning	Scope
`timeZone`	None	Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of `timeZone` are supported: Region-based zone ID: It should have the form 'area/city', such as 'America/Los_Angeles'. Zone offset: It should be in the format '(+\|-)HH:mm', for example '-08:00' or '+01:00'. Also 'UTC' and 'Z' are supported as aliases of '+00:00'. Other short names like 'CST' are not recommended to use because they can be ambiguous. If it isn't set, the current value of the SQL config `spark.sql.session.timeZone` is used by default.	read/write
`primitivesAsString`	None	Infers all primitive values as a string type. If None is set, it uses the default value, `false`.	read
`prefersDecimal`	None	Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. If None is set, it uses the default value, `false`.	read
`allowComments`	None	Ignores Java/C++ style comment in JSON records. If None is set, it uses the default value, `false`	read
`allowUnquotedFieldNames`	None	Allows unquoted JSON field names. If None is set, it uses the default value, `false`.	read
`allowSingleQuotes`	None	Allows single quotes in addition to double quotes. If None is set, it uses the default value, `true`.	read
`allowNumericLeadingZero`	None	Allows leading zeros in numbers (e.g. 00012). If None is set, it uses the default value, `false`.	read
`allowBackslashEscapingAnyCharacter`	None	Allows accepting quoting of all character using backslash quoting mechanism. If None is set, it uses the default value, `false`.	read
`mode`	None	Allows a mode for dealing with corrupt records during parsing. If None is set, it uses the default value, `PERMISSIVE` `PERMISSIVE`: when it meets a corrupted record, puts the malformed string into a field configured by `columnNameOfCorruptRecord`, and sets malformed fields to `null`. To keep corrupt records, an user can set a string type field named `columnNameOfCorruptRecord` in an user-defined schema. If a schema does not have the field, it drops corrupt records during parsing. When inferring a schema, it implicitly adds a `columnNameOfCorruptRecord` field in an output schema. `DROPMALFORMED`: ignores the whole corrupted records. `FAILFAST`: throws an exception when it meets corrupted records.	read
`columnNameOfCorruptRecord`	None	Allows renaming the new field having malformed string created by `PERMISSIVE` mode. This overrides spark.sql.columnNameOfCorruptRecord. If None is set, it uses the value specified in `spark.sql.columnNameOfCorruptRecord`.	read
`dateFormat`	None	Sets the string that indicates a date format. Custom date formats follow the formats at datetime pattern. This applies to date type. If None is set, it uses the default value, `yyyy-MM-dd`.	read/write
`timestampFormat`	None	Sets the string that indicates a timestamp format. Custom date formats follow the formats at datetime pattern. This applies to timestamp type. If None is set, it uses the default value, `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]`.	read/write
`multiLine`	None	Parse one record, which may span multiple lines, per file. If None is set, it uses the default value, `false`.	read
`allowUnquotedControlChars`	None	Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.	read
`encoding`	None	For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. If None is set, the encoding of input JSON will be detected automatically when the multiLine option is set to `true`. For writing, Specifies encoding (charset) of saved json files. If None is set, the default UTF-8 charset will be used.	read/write
`lineSep`	None	Defines the line separator that should be used for parsing. If None is set, it covers all `\r`, `\r\n` and `\n`.	read/write
`samplingRatio`	None	Defines fraction of input JSON objects used for schema inferring. If None is set, it uses the default value, `1.0`.	read
`dropFieldIfAllNull`	None	Whether to ignore column of all null values or empty array/struct during schema inference. If None is set, it uses the default value, `false`.	read
`locale`	None	Sets a locale as language tag in IETF BCP 47 format. If None is set, it uses the default value, `en-US`. For instance, `locale` is used while parsing dates and timestamps.	read
`allowNonNumericNumbers`	None	Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values. If None is set, it uses the default value, `true`. `+INF`: for positive infinity, as well as alias of `+Infinity` and `Infinity`. `-INF`: for negative infinity, alias `-Infinity`. `NaN`: for other not-a-numbers, like result of division by zero.	read
`compression`	None	Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate).	write
`ignoreNullFields`	None	Whether to ignore null fields when generating JSON objects. If None is set, it uses the default value, `true`.	write

Other generic options can be found in Generic File Source Options.

12 KiB Raw Blame History

Data Source Option

12 KiB

Raw Blame History