c77c0d41e1
### What changes were proposed in this pull request? Allow saving and loading of ANSI intervals - `YearMonthIntervalType` and `DayTimeIntervalType` to/from the Parquet datasource. After the changes, Spark saves ANSI intervals as primitive physical Parquet types: - year-month intervals as `INT32` - day-time intervals as `INT64` w/o any modifications. To load the values as intervals back, Spark puts the info about interval types to the extra key `org.apache.spark.sql.parquet.row.metadata`: ``` $ java -jar parquet-tools-1.12.0.jar meta ./part-...-c000.snappy.parquet creator: parquet-mr version 1.12.1 (build 2a5c06c58fa987f85aa22170be14d927d5ff6e7d) extra: org.apache.spark.version = 3.3.0 extra: org.apache.spark.sql.parquet.row.metadata = {"type":"struct","fields":[...,{"name":"i","type":"interval year to month","nullable":false,"metadata":{}}]} file schema: spark_schema -------------------------------------------------------------------------------- ... i: REQUIRED INT32 R:0 D:0 ``` **Note:** The given PR focus on support of ANSI intervals in the Parquet datasource via write or read as a column in `Dataset`. ### Why are the changes needed? To improve user experience with Spark SQL. At the moment, users can make ANSI intervals "inside" Spark or parallelize Java collections of `Period`/`Duration` objects but cannot save the intervals to any built-in datasources. After the changes, users can save datasets/dataframes with year-month/day-time intervals to load them back later by Apache Spark. For example: ```scala scala> sql("select date'today' - date'2021-01-01' as diff").write.parquet("/Users/maximgekk/tmp/parquet_interval") scala> val readback = spark.read.parquet("/Users/maximgekk/tmp/parquet_interval") readback: org.apache.spark.sql.DataFrame = [diff: interval day] scala> readback.printSchema root |-- diff: interval day (nullable = true) scala> readback.show +------------------+ | diff| +------------------+ |INTERVAL '264' DAY| +------------------+ ``` ### Does this PR introduce _any_ user-facing change? In some sense, yes. Before the changes, users get an error while saving of ANSI intervals as dataframe columns to parquet files but the operation should complete successfully after the changes. ### How was this patch tested? 1. By running the existing test suites: ``` $ build/sbt "test:testOnly *ParquetFileFormatV2Suite" $ build/sbt "test:testOnly *FileBasedDataSourceSuite" $ build/sbt "sql/test:testOnly *JsonV2Suite" ``` 2. Added new tests: ``` $ build/sbt "sql/test:testOnly *ParquetIOSuite" $ build/sbt "sql/test:testOnly *ParquetSchemaSuite" ``` Closes #34057 from MaxGekk/ansi-interval-save-parquet. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> |
||
---|---|---|
.. | ||
catalyst | ||
core | ||
hive | ||
hive-thriftserver | ||
create-docs.sh | ||
gen-sql-api-docs.py | ||
gen-sql-config-docs.py | ||
gen-sql-functions-docs.py | ||
mkdocs.yml | ||
README.md |
Spark SQL
This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.
Spark SQL is broken up into four subprojects:
- Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
- Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
Running ./sql/create-docs.sh
generates SQL documentation for built-in functions under sql/site
, and SQL configuration documentation that gets included as part of configuration.md
in the main docs
directory.