spark-instrumented-optimizer

History

Max Gekk a85490659f [SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read ### What changes were proposed in this pull request? In the PR, I propose new options for the Parquet datasource: 1. `datetimeRebaseMode` 2. `int96RebaseMode` Both options influence on loading ancient dates and timestamps column values from parquet files. The `datetimeRebaseMode` option impacts on loading values of the `DATE`, `TIMESTAMP_MICROS` and `TIMESTAMP_MILLIS` types, `int96RebaseMode` impacts on loading of `INT96` timestamps. The options support the same values as the SQL configs `spark.sql.legacy.parquet.datetimeRebaseModeInRead` and `spark.sql.legacy.parquet.int96RebaseModeInRead` namely; - `"LEGACY"`, when an option is set to this value, Spark rebases dates/timestamps from the legacy hybrid calendar (Julian + Gregorian) to the Proleptic Gregorian calendar. - `"CORRECTED"`, dates/timestamps are read AS IS from parquet files. - `"EXCEPTION"`, when it is set as an option value, Spark will fail the reading if it sees ancient dates/timestamps that are ambiguous between the two calendars. ### Why are the changes needed? 1. New options will allow to load parquet files from at least two sources in different rebasing modes in the same query. For instance: ```scala val df1 = spark.read.option("datetimeRebaseMode", "legacy").parquet(folder1) val df2 = spark.read.option("datetimeRebaseMode", "corrected").parquet(folder2) df1.join(df2, ...) ``` Before the changes, it is impossible because the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` influences on both reads. 2. Mixing of Dataset/DataFrame and RDD APIs should become possible. Since SQL configs are not propagated through RDDs, the following code fails on ancient timestamps: ```scala spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "legacy") spark.read.parquet(folder).distinct.rdd.collect() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV1Suite" $ build/sbt "sql/test:testOnly ParquetRebaseDatetimeV2Suite" ``` Closes #31489 from MaxGekk/parquet-rebase-options. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>		2021-02-08 13:28:40 +00:00
..
avro	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python`	2021-02-02 09:30:50 +09:00
pandas	[SPARK-33489][PYSPARK] Add NullType support for Arrow executions	2021-01-25 11:34:47 +09:00
tests	[SPARK-34319][SQL] Resolve duplicate attributes for FlatMapCoGroupsInPandas/MapInPandas	2021-02-02 16:25:32 +09:00
__init__.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
__init__.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
_typing.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
catalog.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
catalog.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
column.py	[SPARK-33730][PYTHON] Standardize warning types	2021-01-18 09:32:55 +09:00
column.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
conf.py	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5	2020-07-14 11:22:44 +09:00
conf.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
context.py	[SPARK-34157][SQL] Unify output of SHOW TABLES and pass output attributes properly	2021-02-08 08:39:58 +00:00
context.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
dataframe.py	[PYTHON][MINOR] Fix docstring of DataFrame.join	2021-02-06 09:08:49 -06:00
dataframe.pyi	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
functions.py	[SPARK-34300][PYSPARK][DOCS][MINOR] Fix some typos and syntax issues in docstrings and output of `dev/lint-python`	2021-02-02 09:30:50 +09:00
functions.pyi	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs	2021-02-02 09:29:40 +09:00
group.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
group.pyi	[SPARK-32714][PYTHON] Initial pyspark-stubs port	2020-09-24 14:15:36 +09:00
readwriter.py	[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read	2021-02-08 13:28:40 +00:00
readwriter.pyi	[SPARK-33566][CORE][SQL][SS][PYTHON] Make unescapedQuoteHandling option configurable when read CSV	2020-11-27 15:47:39 +09:00
session.py	[SPARK-33989][SQL] Strip auto-generated cast when using Cast.sql	2021-01-14 15:27:14 +00:00
session.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
streaming.py	[SPARK-34377][SQL] Add new parquet datasource options to control datetime rebasing in read	2021-02-08 13:28:40 +00:00
streaming.pyi	[SPARK-33836][SS][PYTHON] Expose DataStreamReader.table and DataStreamWriter.toTable	2020-12-21 19:42:59 +09:00
types.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
types.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
udf.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
udf.pyi	[SPARK-33457][PYTHON] Adjust mypy configuration	2020-11-25 09:27:04 +09:00
utils.py	Spelling r common dev mlib external project streaming resource managers python	2020-11-27 10:22:45 -06:00
window.py	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00
window.pyi	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*)	2020-11-03 10:00:49 +09:00