[SPARK-10417] [SQL] Iterating through Column results in infinite loop
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance) Issue reproduction: ``` df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}'])) for i in df["name"]: print i ``` Author: 0x0FFF <programmerag@gmail.com> Closes #8574 from 0x0FFF/SPARK-10417.
This commit is contained in:
parent
2da3a9e98e
commit
6cd98c1878
|
@ -226,6 +226,9 @@ class Column(object):
|
|||
raise AttributeError(item)
|
||||
return self.getField(item)
|
||||
|
||||
def __iter__(self):
|
||||
raise TypeError("Column is not iterable")
|
||||
|
||||
# string methods
|
||||
rlike = _bin_op("rlike")
|
||||
like = _bin_op("like")
|
||||
|
|
|
@ -1066,6 +1066,15 @@ class SQLTests(ReusedPySparkTestCase):
|
|||
keys = self.df.withColumn("key", self.df.key).select("key").collect()
|
||||
self.assertEqual([r.key for r in keys], list(range(100)))
|
||||
|
||||
# regression test for SPARK-10417
|
||||
def test_column_iterator(self):
|
||||
|
||||
def foo():
|
||||
for x in self.df.key:
|
||||
break
|
||||
|
||||
self.assertRaises(TypeError, foo)
|
||||
|
||||
|
||||
class HiveContextSQLTests(ReusedPySparkTestCase):
|
||||
|
||||
|
|
Loading…
Reference in a new issue