spark-instrumented-optimizer

History

Hyukjin Kwon 34f80ef313 [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark ### What changes were proposed in this pull request? This PR adds: - the support of `TimestampNTZType` in pandas API on Spark. - the support of Py4J handling of `spark.sql.timestampType` configuration ### Why are the changes needed? To complete `TimestampNTZ` support. In more details: - ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp): > Because naive datetime objects are treated by many datetime methods as local times ... semantically it is legitimate to assume: - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone). - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time) - ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278): - `Timestamp(..., timezone=sparkLocalTimezone)` -> `TimestamType` - `Timestamp(..., timezone=null)` -> `TimestampNTZType` - (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time). One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default: ```python >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int") 0 0 ``` whereas in Spark: ```python >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show() +------+ \| _1\| +------+ \|-32400\| +------+ >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show() +---+ \| _1\| +---+ \| 0\| +---+ ``` In contrast, some APIs like `pandas.fromtimestamp` assume they are local times: ```python >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0]) Timestamp('1970-01-01 09:00:00') ``` For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work. As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work. ### Does this PR introduce _any_ user-facing change? Yes, now pandas API on Spark can handle `TimestampNTZType`. ```python import datetime spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark() ``` ``` dt 0 2021-08-31 19:58:55.024410 ``` This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration: ```python >>> lit(datetime.datetime.now()) Column<'TIMESTAMP '2021-09-03 19:34:03.949998''> ``` ```python >>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ") >>> lit(datetime.datetime.now()) Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''> ``` ### How was this patch tested? Unittests were added. Closes #33877 from HyukjinKwon/SPARK-36625. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2021-09-09 09:57:38 +09:00
..
benchmarks	[SPARK-36613][SQL][SS] Use EnumSet as the implementation of Table.capabilities method return value	2021-09-05 08:23:05 -05:00
src	[SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark	2021-09-09 09:57:38 +09:00
pom.xml	Revert "[SPARK-34309][BUILD][CORE][SQL][K8S] Use Caffeine instead of Guava Cache"	2021-08-22 09:36:15 +09:00