spark-instrumented-optimizer/python/pyspark/pandas
Xinrong Meng 44cfce8548 [SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals
### What changes were proposed in this pull request?
Fix equality comparison of unordered Categoricals.

### Why are the changes needed?
Codes of a Categorical Series are used for Series equality comparison. However, that doesn't apply to unordered Categoricals, where the same value can have different codes in two same categories in a different order.

So we should map codes to value respectively and then compare the equality of value.

### Does this PR introduce _any_ user-facing change?
Yes.
From:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0     True
1     True
2     True
3    False
dtype: bool
```

To:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0    False
1    False
2    False
3     True
dtype: bool
```

### How was this patch tested?
Unit tests.

Closes #33497 from xinrong-databricks/cat_bug.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 85adc2ff60)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-23 18:31:18 -07:00
..
data_type_ops [SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals 2021-07-23 18:31:18 -07:00
indexes [SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex 2021-07-23 17:19:32 -07:00
missing [SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex 2021-07-23 17:19:32 -07:00
plot [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark 2021-06-28 19:03:42 -07:00
spark [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00
tests [SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals 2021-07-23 18:31:18 -07:00
typedef [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs 2021-07-16 11:41:53 +09:00
usage_logging [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
__init__.py [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package 2021-07-22 14:21:53 +09:00
_typing.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
accessors.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
base.py [SPARK-36265][PYTHON] Use __getitem__ instead of getItem to suppress warnings 2021-07-23 11:27:42 +09:00
categorical.py [SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex 2021-07-23 17:19:32 -07:00
config.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
datetimes.py [SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor 2021-06-01 10:33:10 +09:00
exceptions.py [SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module 2021-05-21 11:03:35 -07:00
extensions.py [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00
frame.py [SPARK-36265][PYTHON] Use __getitem__ instead of getItem to suppress warnings 2021-07-23 11:27:42 +09:00
generic.py [SPARK-35806][PYTHON] Mapping the mode argument to pandas in DataFrame.to_csv 2021-07-19 19:58:19 +09:00
groupby.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
indexing.py [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs 2021-07-16 11:41:53 +09:00
internal.py [SPARK-36167][PYTHON][3.2] Revisit more InternalField managements 2021-07-20 09:30:35 +09:00
ml.py [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs 2021-07-16 11:41:53 +09:00
mlflow.py [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs 2021-07-16 11:41:53 +09:00
namespace.py [SPARK-35810][PYTHON][FOLLWUP] Deprecate ps.broadcast API 2021-07-22 17:10:14 +09:00
numpy_compat.py [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark 2021-06-28 19:03:42 -07:00
series.py [SPARK-36167][PYTHON][3.2] Revisit more InternalField managements 2021-07-20 09:30:35 +09:00
sql_processor.py [SPARK-35809][PYTHON] Add index_col argument for ps.sql 2021-07-22 17:08:42 +09:00
strings.py [SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings 2021-06-15 11:17:56 +09:00
utils.py [SPARK-35806][PYTHON] Mapping the mode argument to pandas in DataFrame.to_csv 2021-07-19 19:58:19 +09:00
window.py [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00