spark-instrumented-optimizer/python/pyspark/sql/pandas
Takuya UESHIN 87be3641eb [SPARK-31441] Support duplicated column names for toPandas with arrow execution
### What changes were proposed in this pull request?

This PR is adding support duplicated column names for `toPandas` with Arrow execution.

### Why are the changes needed?

When we execute `toPandas()` with Arrow execution, it fails if the column names have duplicates.

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/path/to/lib/python3.7/site-packages/pyspark/sql/dataframe.py", line 2132, in toPandas
    pdf = table.to_pandas()
  File "pyarrow/array.pxi", line 441, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1367, in pyarrow.lib.Table._to_pandas
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 653, in table_to_blockmanager
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 704, in _deserialize_column_index
    columns = _flatten_single_level_multiindex(columns)
  File "/path/to/lib/python3.7/site-packages/pyarrow/pandas_compat.py", line 937, in _flatten_single_level_multiindex
    raise ValueError('Found non-unique column index')
ValueError: Found non-unique column index
```

### Does this PR introduce any user-facing change?

Yes, previously we will face an error above, but after this PR, we will see the result:

```py
>>> spark.sql("select 1 v, 1 v").toPandas()
   v  v
0  1  1
```

### How was this patch tested?

Added and modified related tests.

Closes #28210 from ueshin/issues/SPARK-31441/to_pandas.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-14 14:08:56 +09:00
..
__init__.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
conversion.py [SPARK-31441] Support duplicated column names for toPandas with arrow execution 2020-04-14 14:08:56 +09:00
functions.py [SPARK-30722][DOCS][FOLLOW-UP] Explicitly mention the same entire input/output length restriction of Series Iterator UDF 2020-04-09 16:46:27 +09:00
group_ops.py [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints 2020-02-12 10:49:46 +09:00
map_ops.py [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints 2020-02-12 10:49:46 +09:00
serializers.py [SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy 2020-02-18 20:39:50 +08:00
typehints.py [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types 2020-01-22 15:32:58 +09:00
types.py [SPARK-30640][PYTHON][SQL] Prevent unnecessary copies of data during Arrow to Pandas conversion 2020-01-26 15:21:06 -08:00
utils.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00