### What changes were proposed in this pull request? This PR proposes fix the default value in Data Source Option page based on the Scala documentation. ### Why are the changes needed? Some of the existing default value in Data Source Option page follow the Python documentation, which has `None` as the default value for all options. ### Does this PR introduce _any_ user-facing change? Yes, the default value in the Data Source Option page is fixed (from `None` to proper default value) - Before <img width="361" alt="Screen Shot 2021-06-02 at 6 31 12 PM" src="https://user-images.githubusercontent.com/44108233/120456594-b8719f00-c3d0-11eb-9778-071ab2ba9f45.png"> - After <img width="562" alt="Screen Shot 2021-06-02 at 6 32 47 PM" src="https://user-images.githubusercontent.com/44108233/120456844-f1117880-c3d0-11eb-9c7c-9dcd66776444.png"> ### How was this patch tested? Manually built the docs and checked one by one. Closes #32745 from itholic/SPARK-35523. Lead-authored-by: itholic <haejoon.lee@databricks.com> Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
11 KiB
layout | title | displayTitle | license |
---|---|---|---|
global | JSON Files | JSON Files | Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine
option to true
.
{% include_example json_dataset scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine
option to true
.
{% include_example json_dataset java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java %}
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set the multiLine
parameter to True
.
{% include_example json_dataset python/sql/datasource.py %}
Note that the file that is offered as a json file is not a typical JSON file. Each line must contain a separate, self-contained valid JSON object. For more information, please see JSON Lines text format, also called newline-delimited JSON.
For a regular multi-line JSON file, set a named parameter multiLine
to TRUE
.
{% include_example json_dataset r/RSparkSQLExample.R %}
{% highlight sql %}
CREATE TEMPORARY VIEW jsonTable USING org.apache.spark.sql.json OPTIONS ( path "examples/src/main/resources/people.json" )
SELECT * FROM jsonTable
{% endhighlight %}
Data Source Option
Data source options of JSON can be set via:
- the
.option
/.options
methods ofDataFrameReader
DataFrameWriter
DataStreamReader
DataStreamWriter
- the built-in functions below
from_json
to_json
schema_of_json
OPTIONS
clause at CREATE TABLE USING DATA_SOURCE
Property Name | Default | Meaning | Scope |
---|---|---|---|
timeZone |
(value of spark.sql.session.timeZone configuration) |
Sets the string that indicates a time zone ID to be used to format timestamps in the JSON datasources or partition values. The following formats of timeZone are supported:
|
read/write |
primitivesAsString |
false |
Infers all primitive values as a string type. | read |
prefersDecimal |
false |
Infers all floating-point values as a decimal type. If the values do not fit in decimal, then it infers them as doubles. | read |
allowComments |
false |
Ignores Java/C++ style comment in JSON records. | read |
allowUnquotedFieldNames |
false |
Allows unquoted JSON field names. | read |
allowSingleQuotes |
true |
Allows single quotes in addition to double quotes. | read |
allowNumericLeadingZero |
false |
Allows leading zeros in numbers (e.g. 00012). | read |
allowBackslashEscapingAnyCharacter |
false |
Allows accepting quoting of all character using backslash quoting mechanism. | read |
mode |
PERMISSIVE |
Allows a mode for dealing with corrupt records during parsing.
|
read |
columnNameOfCorruptRecord |
(value of spark.sql.columnNameOfCorruptRecord configuration) |
Allows renaming the new field having malformed string created by PERMISSIVE mode. This overrides spark.sql.columnNameOfCorruptRecord. |
read |
dateFormat |
yyyy-MM-dd |
Sets the string that indicates a date format. Custom date formats follow the formats at datetime pattern. This applies to date type. | read/write |
timestampFormat |
yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX] |
Sets the string that indicates a timestamp format. Custom date formats follow the formats at datetime pattern. This applies to timestamp type. | read/write |
multiLine |
false |
Parse one record, which may span multiple lines, per file. | read |
allowUnquotedControlChars |
false |
Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not. | read |
encoding |
Detected automatically when multiLine is set to true (for reading), UTF-8 (for writing) |
For reading, allows to forcibly set one of standard basic or extended encoding for the JSON files. For example UTF-16BE, UTF-32LE. For writing, Specifies encoding (charset) of saved json files. | read/write |
lineSep |
\r , \r\n , \n (for reading), \n (for writing) |
Defines the line separator that should be used for parsing. | read/write |
samplingRatio |
1.0 |
Defines fraction of input JSON objects used for schema inferring. | read |
dropFieldIfAllNull |
false |
Whether to ignore column of all null values or empty array/struct during schema inference. | read |
locale |
en-US |
Sets a locale as language tag in IETF BCP 47 format. For instance, locale is used while parsing dates and timestamps. |
read |
allowNonNumericNumbers |
true |
Allows JSON parser to recognize set of “Not-a-Number” (NaN) tokens as legal floating number values.
|
read |
compression |
(none) | Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, bzip2, gzip, lz4, snappy and deflate). | write |
ignoreNullFields |
(value of spark.sql.jsonGenerator.ignoreNullFields configuration) |
Whether to ignore null fields when generating JSON objects. | write |