spark-instrumented-optimizer/sql
Max Gekk c77c0d41e1 [SPARK-36825][SQL] Read/write dataframes with ANSI intervals from/to parquet files
### What changes were proposed in this pull request?
Allow saving and loading of ANSI intervals - `YearMonthIntervalType` and `DayTimeIntervalType` to/from the Parquet datasource. After the changes, Spark saves ANSI intervals as primitive physical Parquet types:
- year-month intervals as `INT32`
- day-time intervals as `INT64`

w/o any modifications. To load the values as intervals back, Spark puts the info about interval types to the extra key `org.apache.spark.sql.parquet.row.metadata`:
```
$ java -jar parquet-tools-1.12.0.jar meta ./part-...-c000.snappy.parquet

creator:     parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d)
extra:       org.apache.spark.version = 3.3.0
extra:       org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[...,{"name":"i","type":"interval year to month","nullable":false,"metadata":{}}]}

file schema: spark_schema
--------------------------------------------------------------------------------
...
i:           REQUIRED INT32 R:0 D:0
```

**Note:** The given PR focus on support of ANSI intervals in the Parquet datasource via write or read as a column in `Dataset`.

### Why are the changes needed?
To improve user experience with Spark SQL. At the moment, users can make ANSI intervals "inside" Spark or parallelize Java collections of `Period`/`Duration` objects but cannot save the intervals to any built-in datasources. After the changes, users can save datasets/dataframes with year-month/day-time intervals to load them back later by Apache Spark.

For example:
```scala
scala> sql("select date'today' - date'2021-01-01' as diff").write.parquet("/Users/maximgekk/tmp/parquet_interval")

scala> val readback = spark.read.parquet("/Users/maximgekk/tmp/parquet_interval")
readback: org.apache.spark.sql.DataFrame = [diff: interval day]

scala> readback.printSchema
root
 |-- diff: interval day (nullable = true)

scala> readback.show
+------------------+
|              diff|
+------------------+
|INTERVAL '264' DAY|
+------------------+
```

### Does this PR introduce _any_ user-facing change?
In some sense, yes. Before the changes, users get an error while saving of ANSI intervals as dataframe columns to parquet files but the operation should complete successfully after the changes.

### How was this patch tested?
1. By running the existing test suites:
```
$ build/sbt "test:testOnly *ParquetFileFormatV2Suite"
$ build/sbt "test:testOnly *FileBasedDataSourceSuite"
$ build/sbt "sql/test:testOnly *JsonV2Suite"
```
2. Added new tests:
```
$ build/sbt "sql/test:testOnly *ParquetIOSuite"
$ build/sbt "sql/test:testOnly *ParquetSchemaSuite"
```

Closes #34057 from MaxGekk/ansi-interval-save-parquet.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-24 09:55:11 +03:00
..
catalyst [SPARK-36825][SQL] Read/write dataframes with ANSI intervals from/to parquet files 2021-09-24 09:55:11 +03:00
core [SPARK-36825][SQL] Read/write dataframes with ANSI intervals from/to parquet files 2021-09-24 09:55:11 +03:00
hive [SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash) 2021-09-17 14:28:51 +08:00
hive-thriftserver [SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite 2021-09-16 14:28:47 -07:00
create-docs.sh [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build 2021-01-05 19:48:10 +09:00
gen-sql-api-docs.py [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document 2021-03-19 10:19:26 +09:00
gen-sql-config-docs.py [SPARK-36657][SQL] Update comment in 'gen-sql-config-docs.py' 2021-09-02 18:50:59 -07:00
gen-sql-functions-docs.py
mkdocs.yml
README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.