spark-instrumented-optimizer/python/pyspark/pandas/tests
itholic b9aeeb4e6c [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side
### What changes were proposed in this pull request?

This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007
- it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```
- it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous.

And added the related test cases.

### Why are the changes needed?

To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily.

```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
MultiIndex([('a', 'x', 1),
            ('b', 'z', 2),
            ('k', 'z', 3)],
           )
```

And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`.

### Does this PR introduce _any_ user-facing change?

Yes, the previous bug is fixed as described in the above code examples.

### How was this patch tested?

Manually tested with linter and unittest in local, and it might be passed on CI.

Closes #32853 from itholic/SPARK-35683.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 14:18:54 +09:00
..
data_type_ops [SPARK-35616][PYTHON] Make astype method data-type-based 2021-06-14 16:33:15 -07:00
indexes [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side 2021-06-15 14:18:54 +09:00
plot [SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots 2021-06-12 14:36:46 +09:00
__init__.py
test_categorical.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_config.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_csv.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_dataframe.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_dataframe_conversion.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_dataframe_spark_io.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_default_index.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_expanding.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_extension.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_frame_spark.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_groupby.py [SPARK-35705][PYTHON] Adjust pandas-on-spark test_groupby_multiindex_columns test for different pandas versions 2021-06-10 10:36:19 +09:00
test_indexing.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_indexops_spark.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_internal.py [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes 2021-06-07 13:12:12 -07:00
test_namespace.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_numpy_compat.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_ops_on_diff_frames.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_ops_on_diff_frames_groupby.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_ops_on_diff_frames_groupby_expanding.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_ops_on_diff_frames_groupby_rolling.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_repr.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_reshape.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_rolling.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_series.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
test_series_conversion.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_series_datetime.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_series_string.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_sql.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_stats.py [SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true 2021-05-28 17:35:01 +09:00
test_typedef.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_utils.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00
test_window.py [SPARK-35364][PYTHON] Renaming the existing Koalas related codes 2021-05-20 15:08:30 -07:00