
501 lines
22 KiB
Raw Permalink Normal View History

layout: global
title: Parquet Files
displayTitle: Parquet Files
license: |
Licensed to the Apache Software Foundation (ASF) under one or more
contributor license agreements. See the NOTICE file distributed with
this work for additional information regarding copyright ownership.
The ASF licenses this file to You under the Apache License, Version 2.0
(the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
See the License for the specific language governing permissions and
limitations under the License.
* Table of contents
[Parquet]( is a columnar format that is supported by many other data processing systems.
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
[MINOR][SQL][DOC] Correct parquet nullability documentation ## What changes were proposed in this pull request? Parquet files appear to have nullability info when being written, not being read. ## How was this patch tested? Some test code: (running spark 2.3, but the relevant code in DataSource looks identical on master) case class NullTest(bo: Boolean, opbol: Option[Boolean]) val testDf = spark.createDataFrame(Seq(NullTest(true, Some(false)))) defined class NullTest testDf: org.apache.spark.sql.DataFrame = [bo: boolean, opbol: boolean] testDf.write.parquet("s3://asana-stats/tmp_dima/parquet_check_schema")"s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet4").printSchema() root |-- bo: boolean (nullable = true) |-- opbol: boolean (nullable = true) Meanwhile, the parquet file formed does have nullable info: []batchprod-report000:/tmp/dimakamalov-batch$ aws s3 ls s3://asana-stats/tmp_dima/parquet_check_schema/ 2018-10-17 21:03:52 0 _SUCCESS 2018-10-17 21:03:50 504 part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batchprod-report000:/tmp/dimakamalov-batch$ aws s3 cp s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet . download: s3://asana-stats/tmp_dima/parquet_check_schema/part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet to ./part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet []batchprod-report000:/tmp/dimakamalov-batch$ java -jar parquet-tools-1.8.2.jar schema part-00000-b1bf4a19-d9fe-4ece-a2b4-9bbceb490857-c000.snappy.parquet message spark_schema { required boolean bo; optional boolean opbol; } Closes #22759 from dima-asana/dima-asana-nullable-parquet-doc. Authored-by: dima-asana <> Signed-off-by: Sean Owen <>
2018-12-07 15:14:43 -05:00
of the original data. When reading Parquet files, all columns are automatically converted to be nullable for
compatibility reasons.
### Loading Data Programmatically
Using the data from the above example:
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% include_example basic_parquet_example scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
<div data-lang="java" markdown="1">
{% include_example basic_parquet_example java/org/apache/spark/examples/sql/ %}
<div data-lang="python" markdown="1">
{% include_example basic_parquet_example python/sql/ %}
<div data-lang="r" markdown="1">
{% include_example basic_parquet_example r/RSparkSQLExample.R %}
<div data-lang="SQL" markdown="1">
{% highlight sql %}
USING org.apache.spark.sql.parquet
path "examples/src/main/resources/people.parquet"
SELECT * FROM parquetTable
{% endhighlight %}
### Partition Discovery
Table partitioning is a common optimization approach used in systems like Hive. In a partitioned
table, data are usually stored in different directories, with partitioning column values encoded in
the path of each partition directory. All built-in file sources (including Text/CSV/JSON/ORC/Parquet)
are able to discover and infer partitioning information automatically.
For example, we can store all our previously used
population data into a partitioned table using the following directory structure, with two extra
columns, `gender` and `country` as partitioning columns:
{% highlight text %}
└── to
└── table
├── gender=male
│   ├── ...
│   │
│   ├── country=US
│   │   └── data.parquet
│   ├── country=CN
│   │   └── data.parquet
│   └── ...
└── gender=female
   ├── ...
   ├── country=US
   │   └── data.parquet
   ├── country=CN
   │   └── data.parquet
   └── ...
{% endhighlight %}
By passing `path/to/table` to either `` or ``, Spark SQL
will automatically extract the partitioning information from the paths.
Now the schema of the returned DataFrame becomes:
{% highlight text %}
|-- name: string (nullable = true)
|-- age: long (nullable = true)
|-- gender: string (nullable = true)
|-- country: string (nullable = true)
{% endhighlight %}
Notice that the data types of the partitioning columns are automatically inferred. Currently,
numeric data types, date, timestamp and string type are supported. Sometimes users may not want
to automatically infer the data types of the partitioning columns. For these use cases, the
automatic type inference can be configured by
`spark.sql.sources.partitionColumnTypeInference.enabled`, which is default to `true`. When type
inference is disabled, string type will be used for the partitioning columns.
Starting from Spark 1.6.0, partition discovery only finds partitions under the given paths
by default. For the above example, if users pass `path/to/table/gender=male` to either
`` or ``, `gender` will not be considered as a
partitioning column. If users need to specify the base path that partition discovery
should start with, they can set `basePath` in the data source options. For example,
when `path/to/table/gender=male` is the path of the data and
users set `basePath` to `path/to/table/`, `gender` will be a partitioning column.
### Schema Merging
Like Protocol Buffer, Avro, and Thrift, Parquet also supports schema evolution. Users can start with
a simple schema, and gradually add more columns to the schema as needed. In this way, users may end
up with multiple Parquet files with different but mutually compatible schemas. The Parquet data
source is now able to automatically detect this case and merge schemas of all these files.
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we
turned it off by default starting from 1.5.0. You may enable it by
1. setting data source option `mergeSchema` to `true` when reading Parquet files (as shown in the
examples below), or
2. setting the global SQL option `spark.sql.parquet.mergeSchema` to `true`.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% include_example schema_merging scala/org/apache/spark/examples/sql/SQLDataSourceExample.scala %}
<div data-lang="java" markdown="1">
{% include_example schema_merging java/org/apache/spark/examples/sql/ %}
<div data-lang="python" markdown="1">
{% include_example schema_merging python/sql/ %}
<div data-lang="r" markdown="1">
{% include_example schema_merging r/RSparkSQLExample.R %}
### Hive metastore Parquet table conversion
When reading from Hive metastore Parquet tables and writing to non-partitioned Hive metastore
Parquet tables, Spark SQL will try to use its own Parquet support instead of Hive SerDe for
better performance. This behavior is controlled by the `spark.sql.hive.convertMetastoreParquet`
configuration, and is turned on by default.
#### Hive/Parquet Schema Reconciliation
There are two key differences between Hive and Parquet from the perspective of table schema
1. Hive is case insensitive, while Parquet is not
1. Hive considers all columns nullable, while nullability in Parquet is significant
Due to this reason, we must reconcile Hive metastore schema with Parquet schema when converting a
Hive metastore Parquet table to a Spark SQL Parquet table. The reconciliation rules are:
1. Fields that have the same name in both schema must have the same data type regardless of
nullability. The reconciled field should have the data type of the Parquet side, so that
nullability is respected.
1. The reconciled schema contains exactly those fields defined in Hive metastore schema.
- Any fields that only appear in the Parquet schema are dropped in the reconciled schema.
- Any fields that only appear in the Hive metastore schema are added as nullable field in the
reconciled schema.
#### Metadata Refreshing
Spark SQL caches Parquet metadata for better performance. When Hive metastore Parquet table
conversion is enabled, metadata of those converted tables are also cached. If these tables are
updated by Hive or other external tools, you need to refresh them manually to ensure consistent
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
// spark is an existing SparkSession
{% endhighlight %}
<div data-lang="java" markdown="1">
{% highlight java %}
// spark is an existing SparkSession
{% endhighlight %}
<div data-lang="python" markdown="1">
{% highlight python %}
# spark is an existing SparkSession
{% endhighlight %}
<div data-lang="r" markdown="1">
{% highlight r %}
{% endhighlight %}
<div data-lang="SQL" markdown="1">
{% highlight sql %}
{% endhighlight %}
## Columnar Encryption
Since Spark 3.2, columnar encryption is supported for Parquet tables with Apache Parquet 1.12+.
Parquet uses the envelope encryption practice, where file parts are encrypted with "data encryption keys" (DEKs), and the DEKs are encrypted with "master encryption keys" (MEKs). The DEKs are randomly generated by Parquet for each encrypted file/column. The MEKs are generated, stored and managed in a Key Management Service (KMS) of users choice. The Parquet Maven [repository]( has a jar with a mock KMS implementation that allows to run column encryption and decryption using a spark-shell only, without deploying a KMS server (download the `parquet-hadoop-tests.jar` file and place it in the Spark `jars` folder):
<div data-lang="scala" markdown="1">
{% highlight scala %}
sc.hadoopConfiguration.set("parquet.encryption.kms.client.class" ,
// Explicit master keys (base64 encoded) - required only for mock InMemoryKMS
sc.hadoopConfiguration.set("parquet.encryption.key.list" ,
// Activate Parquet encryption, driven by Hadoop properties
sc.hadoopConfiguration.set("parquet.crypto.factory.class" ,
// Write encrypted dataframe files.
// Column "square" will be protected with master key "keyA".
// Parquet file footers will be protected with master key "keyB"
option("parquet.encryption.column.keys" , "keyA:square").
option("parquet.encryption.footer.key" , "keyB").
// Read encrypted dataframe files
val df2 ="/path/to/table.parquet.encrypted")
{% endhighlight %}
#### KMS Client
The InMemoryKMS class is provided only for illustration and simple demonstration of Parquet encryption functionality. **It should not be used in a real deployment**. The master encryption keys must be kept and managed in a production-grade KMS system, deployed in user's organization. Rollout of Spark with Parquet encryption requires implementation of a client class for the KMS server. Parquet provides a plug-in [interface]( for development of such classes,
<div data-lang="java" markdown="1">
{% highlight java %}
public interface KmsClient {
// Wraps a key - encrypts it with the master key.
public String wrapKey(byte[] keyBytes, String masterKeyIdentifier);
// Decrypts (unwraps) a key with the master key.
public byte[] unwrapKey(String wrappedKey, String masterKeyIdentifier);
// Use of initialization parameters is optional.
public void initialize(Configuration configuration, String kmsInstanceID,
String kmsInstanceURL, String accessToken);
{% endhighlight %}
An [example]( of such class for an open source [KMS]( can be found in the parquet-mr repository. The production KMS client should be designed in cooperation with organization's security administrators, and built by developers with an experience in access control management. Once such class is created, it can be passed to applications via the `parquet.encryption.kms.client.class` parameter and leveraged by general Spark users as shown in the encrypted dataframe write/read sample above.
Note: By default, Parquet implements a "double envelope encryption" mode, that minimizes the interaction of Spark executors with a KMS server. In this mode, the DEKs are encrypted with "key encryption keys" (KEKs, randomly generated by Parquet). The KEKs are encrypted with MEKs in KMS; the result and the KEK itself are cached in Spark executor memory. Users interested in regular envelope encryption, can switch to it by setting the `parquet.encryption.double.wrapping` parameter to `false`. For more details on Parquet encryption parameters, visit the parquet-hadoop configuration [page](
## Data Source Option
Data source options of Parquet can be set via:
[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM]( - Python ![Screen Shot 2021-05-21 at 1 38 27 PM]( - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM]( - Java ![Screen Shot 2021-05-21 at 1 37 19 PM]( ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <> Signed-off-by: Hyukjin Kwon <>
2021-05-21 05:05:49 -04:00
* the `.option`/`.options` methods of
* `DataFrameReader`
* `DataFrameWriter`
* `DataStreamReader`
* `DataStreamWriter`
* `OPTIONS` clause at [CREATE TABLE USING DATA_SOURCE](sql-ref-syntax-ddl-create-table-datasource.html)
<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<td>(value of <code>spark.sql.parquet.datetimeRebaseModeInRead</code> configuration)</td>
<td>The <code>datetimeRebaseMode</code> option allows to specify the rebasing mode for the values of the <code>DATE</code>, <code>TIMESTAMP_MILLIS</code>, <code>TIMESTAMP_MICROS</code> logical types from the Julian to Proleptic Gregorian calendar.<br>
Currently supported modes are:
<li><code>EXCEPTION</code>: fails in reads of ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: loads dates/timestamps without rebasing.</li>
<li><code>LEGACY</code>: performs rebasing of ancient dates/timestamps from the Julian to Proleptic Gregorian calendar.</li>
<td>(value of <code>spark.sql.parquet.int96RebaseModeInRead</code> configuration)</td>
<td>The <code>int96RebaseMode</code> option allows to specify the rebasing mode for INT96 timestamps from the Julian to Proleptic Gregorian calendar.<br>
Currently supported modes are:
<li><code>EXCEPTION</code>: fails in reads of ancient INT96 timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: loads INT96 timestamps without rebasing.</li>
<li><code>LEGACY</code>: performs rebasing of ancient timestamps from the Julian to Proleptic Gregorian calendar.</li>
[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM]( - Python ![Screen Shot 2021-05-21 at 1 38 27 PM]( - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM]( - Java ![Screen Shot 2021-05-21 at 1 37 19 PM]( ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <> Signed-off-by: Hyukjin Kwon <>
2021-05-21 05:05:49 -04:00
<td>(value of <code>spark.sql.parquet.mergeSchema</code> configuration)</td>
[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM]( - Python ![Screen Shot 2021-05-21 at 1 38 27 PM]( - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM]( - Java ![Screen Shot 2021-05-21 at 1 37 19 PM]( ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <> Signed-off-by: Hyukjin Kwon <>
2021-05-21 05:05:49 -04:00
<td>Sets whether we should merge schemas collected from all Parquet part-files. This will override <code>spark.sql.parquet.mergeSchema</code>.</td>
<td>Compression codec to use when saving to file. This can be one of the known case-insensitive shorten names (none, uncompressed, snappy, gzip, lzo, brotli, lz4, and zstd). This will override <code>spark.sql.parquet.compression.codec</code>.</td>
[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM]( - Python ![Screen Shot 2021-05-21 at 1 38 27 PM]( - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM]( - Java ![Screen Shot 2021-05-21 at 1 37 19 PM]( ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <> Signed-off-by: Hyukjin Kwon <>
2021-05-21 05:05:49 -04:00
[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM]( - Python ![Screen Shot 2021-05-21 at 1 38 27 PM]( - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM]( - Java ![Screen Shot 2021-05-21 at 1 37 19 PM]( ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <> Signed-off-by: Hyukjin Kwon <>
2021-05-21 05:05:49 -04:00
Other generic options can be found in <a href=""> Generic Files Source Options</a>
### Configuration
Configuration of Parquet can be done using the `setConf` method on `SparkSession` or by running
`SET key=value` commands using SQL.
<table class="table">
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
<tr><th>Property Name</th><th>Default</th><th>Meaning</th><th>Since Version</th></tr>
Some other Parquet-producing systems, in particular Impala, Hive, and older versions of Spark SQL, do
not differentiate between binary data and strings when writing out the Parquet schema. This
flag tells Spark SQL to interpret binary data as a string to provide compatibility with these systems.
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
Some Parquet-producing systems, in particular Impala and Hive, store Timestamp into INT96. This
flag tells Spark SQL to interpret INT96 data as a timestamp to provide compatibility with these systems.
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
Sets the compression codec used when writing Parquet files. If either <code>compression</code> or
<code>parquet.compression</code> is specified in the table-specific options/properties, the precedence would be
<code>compression</code>, <code>parquet.compression</code>, <code>spark.sql.parquet.compression.codec</code>. Acceptable values include:
none, uncompressed, snappy, gzip, lzo, brotli, lz4, zstd.
Note that <code>zstd</code> requires <code>ZStandardCodec</code> to be installed before Hadoop 2.9.0, <code>brotli</code> requires
<code>BrotliCodec</code> to be installed.
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
<td>Enables Parquet filter push-down optimization when set to true.</td>
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
When set to false, Spark SQL will use the Hive SerDe for parquet tables instead of the built in
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
When true, the Parquet data source merges schemas collected from all data files, otherwise the
schema is picked from the summary file or a random data file if no summary file is available.
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
If true, data will be written in a way of Spark 1.4 and earlier. For example, decimal values
will be written in Apache Parquet's fixed-length byte array format, which other systems such as
Apache Hive and Apache Impala use. If false, the newer format in Parquet will be used. For
example, decimals will be written in int-based format. If Parquet output is intended for use
with systems that do not support this newer format, set to true.
[SPARK-31295][DOC] Supplement version for configuration appear in doc ### What changes were proposed in this pull request? This PR supplements version for configuration appear in docs. I sorted out some information show below. **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.deploy.retainedApplications | 0.8.0 | None | 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.retainedDrivers | 1.1.0 | None | 7446f5ff93142d2dd5c79c63fa947f47a1d4db8b#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.spreadOut | 0.6.1 | None | bb2b9ff37cd2503cc6ea82c5dd395187b0910af0#diff-0e7ae91819fc8f7b47b0f97be7116325 |   spark.deploy.defaultCores | 0.9.0 | None | d8bcc8e9a095c1b20dd7a17b6535800d39bff80e#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.maxExecutorRetries | 1.6.3 | SPARK-16956 | ace458f0330f22463ecf7cbee7c0465e10fba8a8#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.worker.resource.{resourceName}.amount | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resource.{resourceName}.discoveryScript | 3.0.0 | SPARK-27371 | cbad616d4cb0c58993a88df14b5e30778c7f7e85#diff-d25032e4a3ae1b85a59e4ca9ccf189a8 |   spark.worker.resourcesFile | 3.0.0 | SPARK-27369 | 7cbe01e8efc3f6cd3a0cac4bcfadea8fcc74a955#diff-b2fc8d6ab7ac5735085e2d6cfacb95da |   spark.shuffle.service.db.enabled | 3.0.0 | SPARK-26288 | 8b0aa59218c209d39cbba5959302d8668b885cf6#diff-6bdad48cfc34314e89599655442ff210 | | 2.4.0 | SPARK-24340 | 8ef167a5f9ba8a79bb7ca98a9844fe9cfcfea060#diff-916ca56b663f178f302c265b7ef38499 |   spark.deploy.recoveryMode | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   spark.deploy.recoveryDirectory | 0.8.1 | None | d66c01f2b6defb3db6c1be99523b734a4d960532#diff-29dffdccd5a7f4c8b496c293e87c8668 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13 |   spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13 |   **docs/** Item name | Since version | JIRA ID | Commit ID | Note -- | -- | -- | -- | -- spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.hive.convertMetastoreParquet | 1.1.1 | SPARK-2406 | cc4015d2fa3785b92e6ab079b3abcf17627f7c56#diff-ff50aea397a607b79df9bec6f2a841db |   spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |   spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |   ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Jenkins test Closes #28064 from beliefer/supplement-doc-for-data-sources. Authored-by: beliefer <> Signed-off-by: HyukjinKwon <>
2020-03-30 23:33:46 -04:00
<td>The rebasing mode for the values of the <code>DATE</code>, <code>TIMESTAMP_MILLIS</code>, <code>TIMESTAMP_MICROS</code> logical types from the Julian to Proleptic Gregorian calendar:<br>
<li><code>EXCEPTION</code>: Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and read the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files.</li>
This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
<td>The rebasing mode for the values of the <code>DATE</code>, <code>TIMESTAMP_MILLIS</code>, <code>TIMESTAMP_MICROS</code> logical types from the Proleptic Gregorian to Julian calendar:<br>
<li><code>EXCEPTION</code>: Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and write the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files.</li>
<td>The rebasing mode for the values of the <code>INT96</code> timestamp type from the Julian to Proleptic Gregorian calendar:<br>
<li><code>EXCEPTION</code>: Spark will fail the reading if it sees ancient INT96 timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and read the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase INT96 timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files.</li>
This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
<td>The rebasing mode for the values of the <code>INT96</code> timestamp type from the Proleptic Gregorian to Julian calendar:<br>
<li><code>EXCEPTION</code>: Spark will fail the writing if it sees ancient timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and write the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase INT96 timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files.</li>