spark-instrumented-optimizer/python/pyspark/sql
0x0FFF bf550a4b55 [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function
This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
* Timezone information of this datetime is ignored
* This datetime is assumed to be in local timezone, which depends on the OS timezone setting

Fix includes both code change and regression test. Problem reproduction code on master:
```python
import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))

m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')

df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
```
It gives the same timestamp ignoring time zone:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]
```
After the fix:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
 Scan PhysicalRDD[dt#0]
```
PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo

Author: 0x0FFF <programmerag@gmail.com>

Closes #8555 from 0x0FFF/SPARK-10162.
2015-09-01 14:34:59 -07:00
..
__init__.py [SPARK-8060] Improve DataFrame Python test coverage and documentation. 2015-06-03 00:23:34 -07:00
column.py [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters 2015-08-25 12:33:13 +01:00
context.py [SPARK-9942] [PYSPARK] [SQL] ignore exceptions while try to import pandas 2015-08-13 14:03:55 -07:00
dataframe.py [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters 2015-08-25 12:33:13 +01:00
functions.py [DOCS] [SQL] [PYSPARK] Fix typo in ntile function 2015-08-19 09:42:41 +01:00
group.py [SPARK-8770][SQL] Create BinaryOperator abstract class. 2015-07-01 21:14:13 -07:00
readwriter.py [SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of String for JSON 2015-08-26 22:19:11 -07:00
tests.py [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function 2015-09-01 14:34:59 -07:00
types.py [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function 2015-09-01 14:34:59 -07:00
utils.py [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API 2015-07-19 00:32:56 -07:00
window.py [SPARK-9978] [PYSPARK] [SQL] fix Window.orderBy and doc of ntile() 2015-08-14 13:55:29 -07:00