spark-instrumented-optimizer

History

itholic b9aeeb4e6c [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side ### What changes were proposed in this pull request? This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007 - it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` - it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous. And added the related test cases. ### Why are the changes needed? To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) MultiIndex([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)], ) ``` And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`. ### Does this PR introduce _any_ user-facing change? Yes, the previous bug is fixed as described in the above code examples. ### How was this patch tested? Manually tested with linter and unittest in local, and it might be passed on CI. Closes #32853 from itholic/SPARK-35683. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>		2021-06-15 14:18:54 +09:00
..
data_type_ops	[SPARK-35616][PYTHON] Make `astype` method data-type-based	2021-06-14 16:33:15 -07:00
indexes	[SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side	2021-06-15 14:18:54 +09:00
plot	[SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots	2021-06-12 14:36:46 +09:00
__init__.py
test_categorical.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_config.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_csv.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_dataframe.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_dataframe_conversion.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_dataframe_spark_io.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_default_index.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_expanding.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_extension.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_frame_spark.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_groupby.py	[SPARK-35705][PYTHON] Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions	2021-06-10 10:36:19 +09:00
test_indexing.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_indexops_spark.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_internal.py	[SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes	2021-06-07 13:12:12 -07:00
test_namespace.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_numpy_compat.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_ops_on_diff_frames.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_ops_on_diff_frames_groupby.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_ops_on_diff_frames_groupby_expanding.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_ops_on_diff_frames_groupby_rolling.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_repr.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_reshape.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_rolling.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_series.py	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes	2021-06-06 17:30:07 -07:00
test_series_conversion.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_series_datetime.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_series_string.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_sql.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_stats.py	[SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true	2021-05-28 17:35:01 +09:00
test_typedef.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_utils.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00
test_window.py	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes	2021-05-20 15:08:30 -07:00