spark-instrumented-optimizer/python/pyspark/sql/pandas
HyukjinKwon 1af19a7b68 [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow
### What changes were proposed in this pull request?

When you use floats are index of pandas, it creates a Spark DataFrame with a wrong results as below when Arrow is enabled:

```bash
./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
```

```python
>>> import pandas as pd
>>> spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+
```

This is because direct slicing uses the value as index when the index contains floats:

```python
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])[2:]
     a
2.0  1
3.0  2
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.]).iloc[2:]
     a
4.0  3
>>> pd.DataFrame({'a': [1,2,3]}, index=[2, 3, 4])[2:]
   a
4  3
```

This PR proposes to explicitly use `iloc` to positionally slide when we create a DataFrame from a pandas DataFrame with Arrow enabled.

FWIW, I was trying to investigate why direct slicing refers the index value or the positional index sometimes but I stopped investigating further after reading this https://pandas.pydata.org/pandas-docs/stable/getting_started/10min.html#selection

> While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, `.at`, `.iat`, `.loc` and `.iloc`.

### Why are the changes needed?

To create the correct Spark DataFrame from a pandas DataFrame without a data loss.

### Does this PR introduce _any_ user-facing change?

Yes, it is a bug fix.

```bash
./bin/pyspark --conf spark.sql.execution.arrow.pyspark.enabled=true
```
```python
import pandas as pd
spark.createDataFrame(pd.DataFrame({'a': [1,2,3]}, index=[2., 3., 4.])).show()
```

Before:

```
+---+
|  a|
+---+
|  1|
|  1|
|  2|
+---+
```

After:

```
+---+
|  a|
+---+
|  1|
|  2|
|  3|
+---+
```

### How was this patch tested?

Manually tested and unittest were added.

Closes #28928 from HyukjinKwon/SPARK-32098.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
2020-06-25 11:04:47 -07:00
..
__init__.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00
conversion.py [SPARK-32098][PYTHON] Use iloc for positional slicing instead of direct slicing in createDataFrame with Arrow 2020-06-25 11:04:47 -07:00
functions.py [SPARK-30722][DOCS][FOLLOW-UP] Explicitly mention the same entire input/output length restriction of Series Iterator UDF 2020-04-09 16:46:27 +09:00
group_ops.py [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints 2020-02-12 10:49:46 +09:00
map_ops.py [SPARK-30722][PYTHON][DOCS] Update documentation for Pandas UDF with Python type hints 2020-02-12 10:49:46 +09:00
serializers.py [SPARK-31964][PYTHON] Use Pandas is_categorical on Arrow category type conversion 2020-06-11 10:26:40 +09:00
typehints.py [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types 2020-01-22 15:32:58 +09:00
types.py [SPARK-25351][SQL][PYTHON] Handle Pandas category type when converting from Python with Arrow 2020-05-27 17:27:29 -07:00
utils.py [SPARK-30434][PYTHON][SQL] Move pandas related functionalities into 'pandas' sub-package 2020-01-09 10:22:50 +09:00