b9aeeb4e6c
### What changes were proposed in this pull request? This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007 - it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` - it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous. And added the related test cases. ### Why are the changes needed? To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) MultiIndex([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)], ) ``` And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`. ### Does this PR introduce _any_ user-facing change? Yes, the previous bug is fixed as described in the above code examples. ### How was this patch tested? Manually tested with linter and unittest in local, and it might be passed on CI. Closes #32853 from itholic/SPARK-35683. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> |
||
---|---|---|
.. | ||
data_type_ops | ||
indexes | ||
plot | ||
__init__.py | ||
test_categorical.py | ||
test_config.py | ||
test_csv.py | ||
test_dataframe.py | ||
test_dataframe_conversion.py | ||
test_dataframe_spark_io.py | ||
test_default_index.py | ||
test_expanding.py | ||
test_extension.py | ||
test_frame_spark.py | ||
test_groupby.py | ||
test_indexing.py | ||
test_indexops_spark.py | ||
test_internal.py | ||
test_namespace.py | ||
test_numpy_compat.py | ||
test_ops_on_diff_frames.py | ||
test_ops_on_diff_frames_groupby.py | ||
test_ops_on_diff_frames_groupby_expanding.py | ||
test_ops_on_diff_frames_groupby_rolling.py | ||
test_repr.py | ||
test_reshape.py | ||
test_rolling.py | ||
test_series.py | ||
test_series_conversion.py | ||
test_series_datetime.py | ||
test_series_string.py | ||
test_sql.py | ||
test_stats.py | ||
test_typedef.py | ||
test_utils.py | ||
test_window.py |