[SPARK-34437][SQL][DOCS] Update Spark SQL guide about the rebasing DS options and SQL configs

### What changes were proposed in this pull request?
In the PR, I propose to update the Spark SQL guide about the SQL configs that are related to datetime rebasing:
- spark.sql.parquet.int96RebaseModeInWrite
- spark.sql.parquet.datetimeRebaseModeInWrite
- spark.sql.parquet.int96RebaseModeInRead
- spark.sql.parquet.datetimeRebaseModeInRead
- spark.sql.avro.datetimeRebaseModeInWrite
- spark.sql.avro.datetimeRebaseModeInRead

Parquet options added by #31489:
- datetimeRebaseMode
- int96RebaseMode

and Avro options added by #31529:
- datetimeRebaseMode

<img width="998" alt="Screenshot 2021-02-17 at 21 42 09" src="https://user-images.githubusercontent.com/1580697/108252043-3afb8900-7169-11eb-8568-511e21fa7f78.png">

### Why are the changes needed?
To inform users about supported DS options and SQL configs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By generating the doc and manually checking:
```
$ SKIP_API=1 SKIP_SCALADOC=1 SKIP_PYTHONDOC=1 SKIP_RDOC=1 jekyll serve --watch
```

Closes #31564 from MaxGekk/doc-rebase-options.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This commit is contained in:
Max Gekk 2021-02-18 17:48:50 +09:00 committed by HyukjinKwon
parent 7b549c3e53
commit b58f0976a9
2 changed files with 124 additions and 0 deletions

View file

@ -283,6 +283,19 @@ Data source options of Avro can be set via:
</td> </td>
<td>function <code>from_avro</code></td> <td>function <code>from_avro</code></td>
</tr> </tr>
<tr>
<td><code>datetimeRebaseMode</code></td>
<td>The SQL config <code>spark.sql.avro</code> <code>.datetimeRebaseModeInRead</code> which is <code>EXCEPTION</code> by default</td>
<td>The <code>datetimeRebaseMode</code> option allows to specify the rebasing mode for the values of the <code>date</code>, <code>timestamp-micros</code>, <code>timestamp-millis</code> logical types from the Julian to Proleptic Gregorian calendar.<br>
Currently supported modes are:
<ul>
<li><code>EXCEPTION</code>: fails in reads of ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: loads dates/timestamps without rebasing.</li>
<li><code>LEGACY</code>: performs rebasing of ancient dates/timestamps from the Julian to Proleptic Gregorian calendar.</li>
</ul>
</td>
<td>read and function <code>from_avro</code></td>
</tr>
</table> </table>
## Configuration ## Configuration
@ -318,6 +331,31 @@ Configuration of Avro can be done using the `setConf` method on SparkSession or
</td> </td>
<td>2.4.0</td> <td>2.4.0</td>
</tr> </tr>
<tr>
<td>spark.sql.avro.datetimeRebaseModeInRead</td>
<td><code>EXCEPTION</code></td>
<td>The rebasing mode for the values of the <code>date</code>, <code>timestamp-micros</code>, <code>timestamp-millis</code> logical types from the Julian to Proleptic Gregorian calendar:<br>
<ul>
<li><code>EXCEPTION</code>: Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and read the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Avro files.</li>
</ul>
This config is only effective if the writer info (like Spark, Hive) of the Avro files is unknown.
</td>
<td>3.0.0</td>
</tr>
<tr>
<td>spark.sql.avro.datetimeRebaseModeInWrite</td>
<td><code>EXCEPTION</code></td>
<td>The rebasing mode for the values of the <code>date</code>, <code>timestamp-micros</code>, <code>timestamp-millis</code> logical types from the Proleptic Gregorian to Julian calendar:<br>
<ul>
<li><code>EXCEPTION</code>: Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and write the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Avro files.</li>
</ul>
</td>
<td>3.0.0</td>
</tr>
</table> </table>
## Compatibility with Databricks spark-avro ## Compatibility with Databricks spark-avro

View file

@ -252,6 +252,42 @@ REFRESH TABLE my_table;
</div> </div>
## Data Source Option
Data source options of Parquet can be set via:
* the `.option`/`.options` methods of `DataFrameReader` or `DataFrameWriter`
* the `.option`/`.options` methods of `DataStreamReader` or `DataStreamWriter`
<table class="table">
<tr><th><b>Property Name</b></th><th><b>Default</b></th><th><b>Meaning</b></th><th><b>Scope</b></th></tr>
<tr>
<td><code>datetimeRebaseMode</code></td>
<td>The SQL config <code>spark.sql.parquet</code> <code>.datetimeRebaseModeInRead</code> which is <code>EXCEPTION</code> by default</td>
<td>The <code>datetimeRebaseMode</code> option allows to specify the rebasing mode for the values of the <code>DATE</code>, <code>TIMESTAMP_MILLIS</code>, <code>TIMESTAMP_MICROS</code> logical types from the Julian to Proleptic Gregorian calendar.<br>
Currently supported modes are:
<ul>
<li><code>EXCEPTION</code>: fails in reads of ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: loads dates/timestamps without rebasing.</li>
<li><code>LEGACY</code>: performs rebasing of ancient dates/timestamps from the Julian to Proleptic Gregorian calendar.</li>
</ul>
</td>
<td>read</td>
</tr>
<tr>
<td><code>int96RebaseMode</code></td>
<td>The SQL config <code>spark.sql.parquet</code> <code>.int96RebaseModeInRead</code> which is <code>EXCEPTION</code> by default</td>
<td>The <code>int96RebaseMode</code> option allows to specify the rebasing mode for INT96 timestamps from the Julian to Proleptic Gregorian calendar.<br>
Currently supported modes are:
<ul>
<li><code>EXCEPTION</code>: fails in reads of ancient INT96 timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: loads INT96 timestamps without rebasing.</li>
<li><code>LEGACY</code>: performs rebasing of ancient timestamps from the Julian to Proleptic Gregorian calendar.</li>
</ul>
</td>
<td>read</td>
</tr>
</table>
### Configuration ### Configuration
Configuration of Parquet can be done using the `setConf` method on `SparkSession` or by running Configuration of Parquet can be done using the `setConf` method on `SparkSession` or by running
@ -329,4 +365,54 @@ Configuration of Parquet can be done using the `setConf` method on `SparkSession
</td> </td>
<td>1.6.0</td> <td>1.6.0</td>
</tr> </tr>
<tr>
<td>spark.sql.parquet.datetimeRebaseModeInRead</td>
<td><code>EXCEPTION</code></td>
<td>The rebasing mode for the values of the <code>DATE</code>, <code>TIMESTAMP_MILLIS</code>, <code>TIMESTAMP_MICROS</code> logical types from the Julian to Proleptic Gregorian calendar:<br>
<ul>
<li><code>EXCEPTION</code>: Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and read the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase dates/timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files.</li>
</ul>
This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
</td>
<td>3.0.0</td>
</tr>
<tr>
<td>spark.sql.parquet.datetimeRebaseModeInWrite</td>
<td><code>EXCEPTION</code></td>
<td>The rebasing mode for the values of the <code>DATE</code>, <code>TIMESTAMP_MILLIS</code>, <code>TIMESTAMP_MICROS</code> logical types from the Proleptic Gregorian to Julian calendar:<br>
<ul>
<li><code>EXCEPTION</code>: Spark will fail the writing if it sees ancient dates/timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and write the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase dates/timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files.</li>
</ul>
</td>
<td>3.0.0</td>
</tr>
<tr>
<td>spark.sql.parquet.int96RebaseModeInRead</td>
<td><code>EXCEPTION</code></td>
<td>The rebasing mode for the values of the <code>INT96</code> timestamp type from the Julian to Proleptic Gregorian calendar:<br>
<ul>
<li><code>EXCEPTION</code>: Spark will fail the reading if it sees ancient INT96 timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and read the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase INT96 timestamps from the legacy hybrid (Julian + Gregorian) calendar to Proleptic Gregorian calendar when reading Parquet files.</li>
</ul>
This config is only effective if the writer info (like Spark, Hive) of the Parquet files is unknown.
</td>
<td>3.1.0</td>
</tr>
<tr>
<td>spark.sql.parquet.int96RebaseModeInWrite</td>
<td><code>EXCEPTION</code></td>
<td>The rebasing mode for the values of the <code>INT96</code> timestamp type from the Proleptic Gregorian to Julian calendar:<br>
<ul>
<li><code>EXCEPTION</code>: Spark will fail the writing if it sees ancient timestamps that are ambiguous between the two calendars.</li>
<li><code>CORRECTED</code>: Spark will not do rebase and write the dates/timestamps as it is.</li>
<li><code>LEGACY</code>: Spark will rebase INT96 timestamps from Proleptic Gregorian calendar to the legacy hybrid (Julian + Gregorian) calendar when writing Parquet files.</li>
</ul>
</td>
<td>3.1.0</td>
</tr>
</table> </table>