spark-instrumented-optimizer/python/pyspark/pandas
Xinrong Meng 6e4e04f2a1 [SPARK-35615][PYTHON] Make unary and comparison operators data-type-based
### What changes were proposed in this pull request?
Make unary and comparison operators data-type-based. Refactored operators include:
- Unary operators: `__neg__`, `__abs__`, `__invert__`,
- Comparison operators: `>`, `>=`, `<`, `<=`, `==`, `!=`

Non-goal: Tasks below are inspired during the development of this PR.
[[SPARK-35997] Implement comparison operators for CategoricalDtype in pandas API on Spark](https://issues.apache.org/jira/browse/SPARK-35997)
[[SPARK-36000] Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled](https://issues.apache.org/jira/browse/SPARK-36000)
[[SPARK-36001] Assume result's index to be disordered in tests with operations on different Series](https://issues.apache.org/jira/browse/SPARK-36001)
[[SPARK-36002] Consolidate tests for data-type-based operations of decimal Series](https://issues.apache.org/jira/browse/SPARK-36002)
[[SPARK-36003] Implement unary operator `invert` of numeric ps.Series/Index](https://issues.apache.org/jira/browse/SPARK-36003)

### Why are the changes needed?

We have been refactoring basic operators to be data-type-based for readability, flexibility, and extensibility.
Unary and comparison operators are still not data-type-based yet. We should fill the gaps.

### Does this PR introduce _any_ user-facing change?

Yes.

- Better error messages. For example,

Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```
After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
TypeError: Unary - can not be applied to binaries.
>>>
```
- Support unary `-` of `bool` Series. For example,

Before:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```

After:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
0    False
1     True
2    False
dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #33162 from xinrong-databricks/datatypeops_refactor.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-07 13:46:50 -07:00
..
data_type_ops [SPARK-35615][PYTHON] Make unary and comparison operators data-type-based 2021-07-07 13:46:50 -07:00
indexes [SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions 2021-07-07 13:26:28 +09:00
missing [SPARK-35071][PYTHON] Rename Koalas to pandas-on-Spark in main codes 2021-04-15 12:48:59 +09:00
plot [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark 2021-06-28 19:03:42 -07:00
spark [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00
tests [SPARK-35615][PYTHON] Make unary and comparison operators data-type-based 2021-07-07 13:46:50 -07:00
typedef [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00
usage_logging [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
__init__.py [SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on Spark 2021-06-30 10:01:51 +09:00
_typing.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
accessors.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
base.py [SPARK-35615][PYTHON] Make unary and comparison operators data-type-based 2021-07-07 13:46:50 -07:00
categorical.py [SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module 2021-05-21 11:03:35 -07:00
config.py [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes 2021-06-06 17:30:07 -07:00
datetimes.py [SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor 2021-06-01 10:33:10 +09:00
exceptions.py [SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module 2021-05-21 11:03:35 -07:00
extensions.py [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00
frame.py [SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions 2021-07-07 13:26:28 +09:00
generic.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
groupby.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
indexing.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
internal.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
ml.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
mlflow.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
namespace.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
numpy_compat.py [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark 2021-06-28 19:03:42 -07:00
series.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
sql_processor.py [SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module 2021-05-21 11:03:35 -07:00
strings.py [SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings 2021-06-15 11:17:56 +09:00
utils.py [SPARK-35944][PYTHON] Introduce Name and Label type aliases 2021-07-01 09:40:07 +09:00
window.py [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark 2021-06-29 10:52:24 -07:00