[SPARK-10417] [SQL] Iterating through Column results in infinite loop

`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)

Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8574 from 0x0FFF/SPARK-10417.
This commit is contained in:
0x0FFF 2015-09-02 13:36:36 -07:00 committed by Davies Liu
parent 2da3a9e98e
commit 6cd98c1878
2 changed files with 12 additions and 0 deletions

View file

@ -226,6 +226,9 @@ class Column(object):
raise AttributeError(item)
return self.getField(item)
def __iter__(self):
raise TypeError("Column is not iterable")
# string methods
rlike = _bin_op("rlike")
like = _bin_op("like")

View file

@ -1066,6 +1066,15 @@ class SQLTests(ReusedPySparkTestCase):
keys = self.df.withColumn("key", self.df.key).select("key").collect()
self.assertEqual([r.key for r in keys], list(range(100)))
# regression test for SPARK-10417
def test_column_iterator(self):
def foo():
for x in self.df.key:
break
self.assertRaises(TypeError, foo)
class HiveContextSQLTests(ReusedPySparkTestCase):