spark-instrumented-optimizer

History

hyukjinkwon 224e0e785b [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column ## What changes were proposed in this pull request? This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below: 1.5.2 ```python >>> df = sqlContext.createDataFrame([[1]]) >>> 1 in df._1 Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` 1.6.3 ```python >>> 1 in sqlContext.range(1).id Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` 2.1.0 ```python >>> 1 in spark.range(1).id Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` Current Master ```python >>> 1 in spark.range(1).id Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` After ```python >>> 1 in spark.range(1).id Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__ raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' " ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column. ``` In more details, It seems the implementation intended to support this ```python 1 in df.column ``` However, currently, it throws an exception as below: ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__ raise ValueError("Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', " ValueError: Cannot convert column into bool: please use '&' for 'and', '\|' for 'or', '~' for 'not' when building DataFrame boolean expressions. ``` What happens here is as below: ```python class Column(object): def __contains__(self, item): print "I am contains" return Column() def __nonzero__(self): raise Exception("I am nonzero.") >>> 1 in Column() I am contains Traceback (most recent call last): File "<stdin>", line 1, in <module> File "<stdin>", line 6, in __nonzero__ Exception: I am nonzero. ``` It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific). It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below: https://bugs.python.org/issue16011 http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378 http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777 It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below: ```python class Column(object): def __contains__(self, item): print "I am contains" return Column() def __nonzero__(self): return "a" >>> 1 in Column() I am contains Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: __nonzero__ should return bool or int, returned str ``` ## How was this patch tested? Added unit tests in `tests.py`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17160 from HyukjinKwon/SPARK-19701.		2017-03-05 18:04:52 -08:00
..
__init__.py	[SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes"	2016-08-06 05:02:59 +01:00
catalog.py	[SPARK-19148][SQL] do not expose the external table concept in Catalog	2017-01-17 12:54:50 +08:00
column.py	[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column	2017-03-05 18:04:52 -08:00
conf.py	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code	2016-05-23 18:14:48 -07:00
context.py	[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error	2017-01-13 18:35:51 +08:00
dataframe.py	[SPARK-19497][SS] Implement streaming deduplication	2017-02-23 11:25:39 -08:00
functions.py	[SPARK-19595][SQL] Support json array in from_json	2017-03-05 14:35:06 -08:00
group.py	[MINOR][PYSPARK][DOC] Fix wrongly formatted examples in PySpark documentation	2016-07-06 10:45:51 -07:00
readwriter.py	[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide	2017-03-02 01:02:38 -08:00
session.py	[SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped	2017-01-12 20:53:31 +08:00
streaming.py	[SPARK-18352][DOCS] wholeFile JSON update doc and programming guide	2017-03-02 01:02:38 -08:00
tests.py	[SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column	2017-03-05 18:04:52 -08:00
types.py	[SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row	2017-01-07 12:52:41 +00:00
utils.py	[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo	2017-01-04 15:07:29 +00:00
window.py	[SPARK-18690][PYTHON][SQL] Backward compatibility of unbounded frames	2016-12-02 17:39:28 -08:00