spark-instrumented-optimizer/sql
Hyukjin Kwon 34f80ef313 [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark
### What changes were proposed in this pull request?

This PR adds:
- the support of `TimestampNTZType` in pandas API on Spark.
- the support of Py4J handling of `spark.sql.timestampType` configuration

### Why are the changes needed?

To complete `TimestampNTZ` support.

In more details:

- ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp):

    > Because naive datetime objects are treated by many datetime methods as local times ...

  semantically it is legitimate to assume:
    - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone).
    - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time)

- ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278):
    - `Timestamp(..., timezone=sparkLocalTimezone)` ->  `TimestamType`
    - `Timestamp(..., timezone=null)` ->  `TimestampNTZType`

- (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time).

    One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default:

    ```python
    >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int")
    0    0
    ```

    whereas in Spark:

    ```python
    >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show()
    +------+
    |    _1|
    +------+
    |-32400|
    +------+

    >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show()
    +---+
    | _1|
    +---+
    |  0|
    +---+
    ```

    In contrast, some APIs like `pandas.fromtimestamp` assume they are local times:

    ```python
    >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0])
    Timestamp('1970-01-01 09:00:00')
    ```

    For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work.

    As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work.

### Does this PR introduce _any_ user-facing change?

Yes, now pandas API on Spark can handle `TimestampNTZType`.

```python
import datetime
spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark()
```

```
                          dt
0 2021-08-31 19:58:55.024410
```

This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration:

```python
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP '2021-09-03 19:34:03.949998''>
```
```python
>>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ")
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''>
```

### How was this patch tested?

Unittests were added.

Closes #33877 from HyukjinKwon/SPARK-36625.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-09 09:57:38 +09:00
..
catalyst [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark 2021-09-09 09:57:38 +09:00
core [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark 2021-09-09 09:57:38 +09:00
hive [SPARK-36602][COER][SQL] Clean up redundant asInstanceOf casts 2021-09-05 08:22:28 -05:00
hive-thriftserver [SPARK-36400][TEST][FOLLOWUP] Add test for redacting sensitive information in UI by config 2021-09-02 17:04:49 +09:00
create-docs.sh [SPARK-34010][SQL][DODCS] Use python3 instead of python in SQL documentation build 2021-01-05 19:48:10 +09:00
gen-sql-api-docs.py [SPARK-34747][SQL][DOCS] Add virtual operators to the built-in function document 2021-03-19 10:19:26 +09:00
gen-sql-config-docs.py [SPARK-36657][SQL] Update comment in 'gen-sql-config-docs.py' 2021-09-02 18:50:59 -07:00
gen-sql-functions-docs.py
mkdocs.yml
README.md

Spark SQL

This module provides support for executing relational queries expressed in either SQL or the DataFrame/Dataset API.

Spark SQL is broken up into four subprojects:

  • Catalyst (sql/catalyst) - An implementation-agnostic framework for manipulating trees of relational operators and expressions.
  • Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
  • Hive Support (sql/hive) - Includes extensions that allow users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allow users to run queries that include Hive UDFs, UDAFs, and UDTFs.
  • HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.

Running ./sql/create-docs.sh generates SQL documentation for built-in functions under sql/site, and SQL configuration documentation that gets included as part of configuration.md in the main docs directory.