spark-instrumented-optimizer

History

Takuya UESHIN 87be3641eb [SPARK-31441] Support duplicated column names for toPandas with arrow execution ### What changes were proposed in this pull request? This PR is adding support duplicated column names for `toPandas` with Arrow execution. ### Why are the changes needed? When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates. ```py >>> spark.sql("select 1 v, 1 v").toPandas() Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas pdf = table.to_pandas() File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager columns = _deserialize_column_index(table, all_columns, column_indexes) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index columns = _flatten_single_level_multiindex(columns) File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex raise ValueError('Found non-unique column index') ValueError: Found non-unique column index ``` ### Does this PR introduce any user-facing change? Yes, previously we will face an error above, but after this PR, we will see the result: ```py >>> spark.sql("select 1 v, 1 v").toPandas() v v 0 1 1 ``` ### How was this patch tested? Added and modified related tests. Closes #28210 from ueshin/issues/SPARK-31441/to_pandas. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>		2020-04-14 14:08:56 +09:00
..
__init__.py	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00
conversion.py	[SPARK-31441] Support duplicated column names for toPandas with arrow execution	2020-04-14 14:08:56 +09:00
functions.py	[SPARK-30722][DOCS][FOLLOW-UP] Explicitly mention the same entire input/output length restriction of Series Iterator UDF	2020-04-09 16:46:27 +09:00
group_ops.py	[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints	2020-02-12 10:49:46 +09:00
map_ops.py	[SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints	2020-02-12 10:49:46 +09:00
serializers.py	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy	2020-02-18 20:39:50 +08:00
typehints.py	[SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types	2020-01-22 15:32:58 +09:00
types.py	[SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion	2020-01-26 15:21:06 -08:00
utils.py	[SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package	2020-01-09 10:22:50 +09:00