Commit graph

133 commits

Author SHA1 Message Date
itholic b8508f4876 [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
### What changes were proposed in this pull request?

This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.

`RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.

Before:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       A    B
A
1 0  NaN  NaN
  1  2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN
```

After:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       B
A
1 0  NaN
  1  1.0
2 2  NaN
3 3  NaN
```

### Why are the changes needed?

We should follow the behavior of pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.

### How was this patch tested?

Unit tests.

Closes #33646 from itholic/SPARK-36388.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-10 10:12:52 +09:00
itholic a9f371c247 [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
### What changes were proposed in this pull request?

This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.

Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```

After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```

This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289.

### Why are the changes needed?

We should follow the behavior of pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the result for some cases have duplicates values will change.

### How was this patch tested?

Unit test.

Closes #33634 from itholic/SPARK-36369.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-09 11:10:01 +09:00
itholic 3d72c20e64 [SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io
### What changes were proposed in this pull request?

This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message.

### Why are the changes needed?

To improve the warning message.

### Does this PR introduce _any_ user-facing change?

The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead."

### How was this patch tested?

Manually run `dev/lint-python`

Closes #33631 from itholic/SPARK-35811-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-04 16:20:29 +09:00
Xinrong Meng 8ca11fe39f [SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists
### What changes were proposed in this pull request?
Better error messages for DataTypeOps against lists.

### Why are the changes needed?
Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead.

### Does this PR introduce _any_ user-facing change?
Yes. A TypeError message will be showed rather than a Py4JJavaError.

From:
```py
>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]) > [3, 2, 1]
Traceback (most recent call last):
...
py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1]
...
```

To:
```py
>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]) > [3, 2, 1]
Traceback (most recent call last):
...
TypeError: The operation can not be applied to list.
```

### How was this patch tested?
Unit tests.

Closes #33581 from xinrong-databricks/data_type_ops_list.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-03 16:25:49 +09:00
Takuya UESHIN 8cb9cf39b6 [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3
### What changes were proposed in this pull request?

Disable tests failed by the incompatible behavior of pandas 1.3.

### Why are the changes needed?

Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Disabled some tests related to the behavior change.

Closes #33598 from ueshin/issues/SPARK-36367/disable_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-03 14:02:18 +09:00
Yikun Jiang f04e991e6a [SPARK-35976][PYTHON] Adjust astype method for ExtensionDtype in pandas API on Spark
### What changes were proposed in this pull request?
This patch set value to `<NA>` (pd.NA) in BooleanExtensionOps and StringExtensionOps.

### Why are the changes needed?
The pandas behavior:
```python
>>> pd.Series([True, False, None], dtype="boolean").astype(str).tolist()
['True', 'False', '<NA>']
>>> pd.Series(['s1', 's2', None], dtype="string").astype(str).tolist()
['1', '2', '<NA>']
```

pandas on spark
```python
>>> import pandas as pd
>>> from pyspark import pandas as ps

# Before
>>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist()
['True', 'False', 'None']
>>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist()
['True', 'False', 'None']

# After
>>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist()
['True', 'False', '<NA>']
>>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist()
['s1', 's2', '<NA>']
```

See more in [SPARK-35976](https://issues.apache.org/jira/browse/SPARK-35976)

### Does this PR introduce _any_ user-facing change?
Yes, return `<NA>` when None to follow the pandas behavior

### How was this patch tested?
Change the ut to cover this scenario.

Closes #33585 from Yikun/SPARK-35976.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 10:37:25 +09:00
Takuya UESHIN 90d31dfcb7 [SPARK-36365][PYTHON] Remove old workarounds related to null ordering
### What changes were proposed in this pull request?

Remove old workarounds related to null ordering.

### Why are the changes needed?

In pandas-on-Spark, there are still some remaining places to call `Column._jc.(asc|desc)_nulls_(first|last)` as a workaround from Koalas to support Spark 2.3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified a couple of tests and existing tests.

Closes #33597 from ueshin/issues/SPARK-36365/nulls_first_last.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 10:33:25 +09:00
Hyukjin Kwon 74a6b9d23b [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/33570, which mistakenly changed the default value of the default index

### Why are the changes needed?

It was mistakenly changed. It was changed to check if the tests actually pass but I forgot to change it back.

### Does this PR introduce _any_ user-facing change?

No, it's not related yet. It fixes up the mistake of the default value mistakenly changed.
(Changed default value makes the test flaky because of the order affected by extra shuffle)

### How was this patch tested?

Manually tested.

Closes #33596 from HyukjinKwon/SPARK-36338-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-31 08:31:10 +09:00
Takuya UESHIN 895e3f5e2a [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps
### What changes were proposed in this pull request?

Move some logic related to `F.nanvl` to `DataTypeOps`.

### Why are the changes needed?

There are several places to branch by `FloatType` or `DoubleType` to use `F.nanvl` but `DataTypeOps` should handle it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33582 from ueshin/issues/SPARK-36350/nan_to_null.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-30 11:19:49 -07:00
Hyukjin Kwon c6140d4d0a [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side
### What changes were proposed in this pull request?

This PR proposes to implement `distributed-sequence` index in Scala side.

### Why are the changes needed?

- Avoid unnecessary (de)serialization
- Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104)

### Does this PR introduce _any_ user-facing change?

No to end users since pandas API on Spark is not released yet.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(1).spark.print_schema()
```

Before:

```
root
 |-- id: long (nullable = true)
```

After:

```
root
 |-- id: long (nullable = false)
```

### How was this patch tested?

Manually tested, and existing tests should cover them.

Closes #33570 from HyukjinKwon/SPARK-36338.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 22:29:23 +09:00
Hyukjin Kwon dd2ca0aee2 [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark
### What changes were proposed in this pull request?

This PR is a partial revert of https://github.com/apache/spark/pull/33567 that keeps the logic to skip mlflow related tests if that's not installed.

### Why are the changes needed?

It's consistent with other libraries, e.g) PyArrow.
It also fixes up the potential dev breakage (see also https://github.com/apache/spark/pull/33567#issuecomment-889841829)

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

This is a partial revert. CI should test it out too.

Closes #33589 from HyukjinKwon/SPARK-36254.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 22:28:19 +09:00
itholic abce61f3fd [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-30 00:04:48 -07:00
itholic 94cb2bbbc2 [SPARK-35806][PYTHON][FOLLOW-UP] Mapping the mode argument to pandas in DataFrame.to_csv
### What changes were proposed in this pull request?

This PR is follow-up for https://github.com/apache/spark/pull/33414 to support the more options for `mode` argument for all APIs that has `mode` argument, not only `DataFrame.to_csv`.

### Why are the changes needed?

To keep the usage consistency for the arguments that have same name.

### Does this PR introduce _any_ user-facing change?

More options is available for all APIs that has `mode` argument, same as `DataFrame.to_csv`

### How was this patch tested?

Manually test on local

Closes #33569 from itholic/SPARK-35085-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 12:48:24 +09:00
Takuya UESHIN 07ed82be0b [SPARK-36333][PYTHON] Reuse isnull where the null check is needed
### What changes were proposed in this pull request?

Reuse `IndexOpsMixin.isnull()` where the null check is needed.

### Why are the changes needed?

There are some places where we can reuse `IndexOpsMixin.isnull()` instead of directly using Spark `Column`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33562 from ueshin/issues/SPARK-36333/reuse_isnull.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-29 15:33:11 -07:00
Xinrong Meng 9c5cb99d6e [SPARK-36190][PYTHON] Improve the rest of DataTypeOps tests by avoiding joins
### What changes were proposed in this pull request?
Improve the rest of DataTypeOps tests by avoiding joins.

### Why are the changes needed?
bool, string, numeric DataTypeOps tests have been improved by avoiding joins.
We should improve the rest of the DataTypeOps tests in the same way.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33546 from xinrong-databricks/test_no_join.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-28 15:53:38 -07:00
Xinrong Meng 01213095e2 [SPARK-36143][PYTHON] Adjust astype of fractional Series with missing values to follow pandas
### What changes were proposed in this pull request?
Adjust `astype` of fractional Series with missing values to follow pandas.

Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas.

### Why are the changes needed?
`astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas.

We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes.

From:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
0    1.0
1    2.0
2    NaN
dtype: float64

```

To:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
Traceback (most recent call last):
...
ValueError: Cannot convert fractions with missing values to integer

```

### How was this patch tested?
Unit tests.

Closes #33466 from xinrong-databricks/extension_astype.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-28 11:26:48 -07:00
Takuya UESHIN 3c76a924ce [SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns
### What changes were proposed in this pull request?

Fix `Series`/`Index.copy()` to drop extra columns.

### Why are the changes needed?

Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns.
We can drop those when `Series`/`Index.copy()`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 18:39:53 +09:00
Takuya UESHIN bcc595c112 [SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any()
### What changes were proposed in this pull request?

Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`.

### Why are the changes needed?

`IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error.
Also it returns a wrong value when the `Series`/`Index` is empty.

```py
>>> ps.Series([]).hasnans
None
```

whereas:

```py
>>> pd.Series([]).hasnans
False
```

`IndexOpsMixin.any()` is safe for both cases.

### Does this PR introduce _any_ user-facing change?

`IndexOpsMixin.hasnans` will return `False` when empty.

### How was this patch tested?

Added some tests.

Closes #33547 from ueshin/issues/SPARK-36310/hasnan.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 09:21:12 +09:00
Takuya UESHIN c40d9d46f1 [SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Clean up `CategoricalAccessor` and `CategoricalIndex`.

- Clean up the classes
- Add deprecation warnings
- Clean up the docs

### Why are the changes needed?

To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33528 from ueshin/issues/SPARK-36267/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:17:18 +09:00
Yikun Jiang d52c2de08b [SPARK-36142][PYTHON] Follow Pandas when pow between fractional series with Na and bool literal
### What changes were proposed in this pull request?

Set the result to 1 when the exp with 0(or False).

### Why are the changes needed?
Currently, exponentiation between fractional series and bools is not consistent with pandas' behavior.
```
 >>> pser = pd.Series([1, 2, np.nan], dtype=float)
 >>> psser = ps.from_pandas(pser)
 >>> pser ** False
 0 1.0
 1 1.0
 2 1.0
 dtype: float64
 >>> psser ** False
 0 1.0
 1 1.0
 2 NaN
 dtype: float64
```
We ought to adjust that.

See more in [SPARK-36142](https://issues.apache.org/jira/browse/SPARK-36142)

### Does this PR introduce _any_ user-facing change?
Yes, it introduces a user-facing change, resulting in a different result for pow between fractional Series with missing values and bool literal, the results follow pandas behavior.

### How was this patch tested?
- Add test_pow_with_float_nan ut
- Exsiting test in test_pow

Closes #33521 from Yikun/SPARK-36142.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:06:05 +09:00
Xinrong Meng 55971b70fe [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?
Add set_categories to CategoricalAccessor and CategoricalIndex.

### Why are the changes needed?
set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to use `set_categories`.

### How was this patch tested?
Unit tests.

Closes #33506 from xinrong-databricks/set_categories.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-26 17:12:33 -07:00
Takuya UESHIN 663cbdfbe5 [SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9
### What changes were proposed in this pull request?

Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.

### Why are the changes needed?

Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.

```
The python3 -m black command was not found. Skipping black checks for now.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The `black` check in `lint-python` should work.

Closes #33507 from ueshin/issues/SPARK-36279/lint-python.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:49:11 +09:00
Xinrong Meng 85adc2ff60 [SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals
### What changes were proposed in this pull request?
Fix equality comparison of unordered Categoricals.

### Why are the changes needed?
Codes of a Categorical Series are used for Series equality comparison. However, that doesn't apply to unordered Categoricals, where the same value can have different codes in two same categories in a different order.

So we should map codes to value respectively and then compare the equality of value.

### Does this PR introduce _any_ user-facing change?
Yes.
From:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0     True
1     True
2     True
3    False
dtype: bool
```

To:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0    False
1    False
2    False
3     True
dtype: bool
```

### How was this patch tested?
Unit tests.

Closes #33497 from xinrong-databricks/cat_bug.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-23 18:30:59 -07:00
Takuya UESHIN e12bc4d31d [SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `reorder_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `reorder_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `reorder_categories`.

### How was this patch tested?

Added some tests.

Closes #33499 from ueshin/issues/SPARK-36264/reorder_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-23 17:19:20 -07:00
Takuya UESHIN 2fe12a7520 [SPARK-36261][PYTHON] Add remove_unused_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `remove_unused_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `remove_unused_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `remove_unused_categories`.

### How was this patch tested?

Added some tests.

Closes #33485 from ueshin/issues/SPARK-36261/remove_unused_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 14:04:59 +09:00
Xinrong Meng 8b3d84bb7e [SPARK-36248][PYTHON] Add rename_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?
Add rename_categories to CategoricalAccessor and CategoricalIndex.

### Why are the changes needed?
rename_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes. `rename_categories` is supported in pandas API on Spark now.

```py
# CategoricalIndex
>>> psser = ps.CategoricalIndex(["a", "a", "b"])
>>> psser.rename_categories([0, 1])
CategoricalIndex([0, 0, 1], categories=[0, 1], ordered=False, dtype='category')
>>> psser.rename_categories({'a': 'A', 'c': 'C'})
CategoricalIndex(['A', 'A', 'b'], categories=['A', 'b'], ordered=False, dtype='category')
>>> psser.rename_categories(lambda x: x.upper())
CategoricalIndex(['A', 'A', 'B'], categories=['A', 'B'], ordered=False, dtype='category')

# CategoricalAccessor
>>> s = ps.Series(["a", "a", "b"], dtype="category")
>>> s.cat.rename_categories([0, 1])
0    0
1    0
2    1
dtype: category
Categories (2, int64): [0, 1]
>>> s.cat.rename_categories({'a': 'A', 'c': 'C'})
0    A
1    A
2    b
dtype: category
Categories (2, object): ['A', 'b']
>>> s.cat.rename_categories(lambda x: x.upper())
0    A
1    A
2    B
dtype: category
Categories (2, object): ['A', 'B']
```

### How was this patch tested?
Unit tests.

Closes #33471 from xinrong-databricks/category_rename_categories.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 12:26:24 +09:00
Xinrong Meng 75fd1f5b82 [SPARK-36189][PYTHON] Improve bool, string, numeric DataTypeOps tests by avoiding joins
### What changes were proposed in this pull request?
Improve bool, string, numeric DataTypeOps tests by avoiding joins.

Previously, bool, string, numeric DataTypeOps tests are conducted between two different Series.
After the PR, bool, string, numeric DataTypeOps tests should perform on a single DataFrame.

### Why are the changes needed?
A considerable number of DataTypeOps tests have operations on different Series, so joining is needed, which takes a long time.
We shall avoid joins for a shorter test duration.

The majority of joins happen in bool, string, numeric DataTypeOps tests, so we improve them first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33402 from xinrong-databricks/datatypeops_diffframe.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 12:20:35 +09:00
Takuya UESHIN a76a087f7f [SPARK-36265][PYTHON] Use __getitem__ instead of getItem to suppress warnings
### What changes were proposed in this pull request?

Use `Column.__getitem__` instead of `Column.getItem` to suppress warnings.

### Why are the changes needed?

In pandas API on Spark code base, there are some places using `Column.getItem` with `Column` object, but it shows a deprecation warning.

### Does this PR introduce _any_ user-facing change?

Yes, users won't see the warnings anymore.

- before

```py
>>> s = ps.Series(list("abbccc"), dtype="category")
>>> s.astype(str)
/path/to/spark/python/pyspark/sql/column.py:322: FutureWarning: A column as 'key' in getItem is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]` or `column.key` syntax instead.
  warnings.warn(
0    a
1    b
2    b
3    c
4    c
5    c
dtype: object
```

- after

```py
>>> s = ps.Series(list("abbccc"), dtype="category")
>>> s.astype(str)
0    a
1    b
2    b
3    c
4    c
5    c
dtype: object
```

### How was this patch tested?

Existing tests.

Closes #33486 from ueshin/issues/SPARK-36265/getitem.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 11:27:31 +09:00
itholic d1a037a27c [SPARK-35810][PYTHON][FOLLWUP] Deprecate ps.broadcast API
### What changes were proposed in this pull request?

This PR follows up #33379 to fix build error in Sphinx

### Why are the changes needed?

The Sphinx build is failed with missing newline in docstring

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually test the Sphinx build

Closes #33479 from itholic/SPARK-35810-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 17:10:03 +09:00
itholic 6578f0b135 [SPARK-35809][PYTHON] Add index_col argument for ps.sql
### What changes were proposed in this pull request?

This PR proposes adding an argument `index_col` for `ps.sql` function, to preserve the index when users want.

NOTE that the `reset_index()` have to be performed before using `ps.sql` with `index_col`.

```python
>>> psdf
   A  B
a  1  4
b  2  5
c  3  6
>>> psdf_reset_index = psdf.reset_index()
>>> ps.sql("SELECT * from {psdf_reset_index} WHERE A > 1", index_col="index")
       A  B
index
b      2  5
c      3  6
```

Otherwise, the index is always lost.

```python
>>> ps.sql("SELECT * from {psdf} WHERE A > 1")
   A  B
0  2  5
1  3  6
```

### Why are the changes needed?

Index is one of the key object for the existing pandas users, so we should provide the way to keep the index after computing the `ps.sql`.

### Does this PR introduce _any_ user-facing change?

Yes, the new argument is added.

### How was this patch tested?

Add a unit test and manually check the build pass.

Closes #33450 from itholic/SPARK-35809.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 17:08:34 +09:00
Takuya UESHIN a3c7ae18e2 [SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `remove_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `remove_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `remove_categories`.

### How was this patch tested?

Added some tests.

Closes #33474 from ueshin/issues/SPARK-36249/remove_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 17:06:12 +09:00
Takuya UESHIN dcc0aaa3ef [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `add_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `add_categories`.

### How was this patch tested?

Added some tests.

Closes #33470 from ueshin/issues/SPARK-36214/add_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-21 22:34:04 -07:00
Hyukjin Kwon f3e29574d9 [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package
### What changes were proposed in this pull request?

This PR adds the version that added pandas API on Spark in PySpark documentation.

### Why are the changes needed?

To document the version added.

### Does this PR introduce _any_ user-facing change?

No to end user. Spark 3.2 is not released yet.

### How was this patch tested?

Linter and documentation build.

Closes #33473 from HyukjinKwon/SPARK-36253.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 14:21:43 +09:00
Takuya UESHIN d506815a92 [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add categories setter to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement categories setter in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use categories setter.

### How was this patch tested?

Added some tests.

Closes #33448 from ueshin/issues/SPARK-36188/categories_setter.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-21 11:31:30 -07:00
Takuya UESHIN 376fadc89c [SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `as_ordered`/`as_unordered` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `as_ordered`/`as_unordered` in `CategoricalAccessor` and `CategoricalIndex` yet.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `as_ordered`/`as_unordered`.

### How was this patch tested?

Added some tests.

Closes #33400 from ueshin/issues/SPARK-36186/as_ordered_unordered.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-20 18:23:54 -07:00
Takuya UESHIN c459c707c5 [SPARK-36167][PYTHON][FOLLOWUP] Fix test failures with older versions of pandas
### What changes were proposed in this pull request?

Fix test failures with `pandas < 1.2`.

### Why are the changes needed?

There are some test failures with `pandas < 1.2`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Fixed tests.

Closes #33398 from ueshin/issues/SPARK-36167/test.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 09:21:46 +09:00
Xinrong Meng 8dd43351d5 [SPARK-36127][PYTHON] Support comparison between a Categorical and a scalar
### What changes were proposed in this pull request?
Support comparison between a Categorical and a scalar.
There are 3 main changes:
- Modify `==` and `!=` from comparing **codes** of the Categorical to the scalar to comparing **actual values** of the Categorical to the scalar.
- Support `<`, `<=`, `>`, `>=` between a Categorical and a scalar.
- TypeError message fix.

### Why are the changes needed?
pandas supports comparison between a Categorical and a scalar, we should follow pandas' behaviors.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0     True
1    False
2    False
dtype: bool
>>> psser <= 1
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

After:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0    False
1     True
2    False
dtype: bool
>>> psser <= 1
0    True
1    True
2    True
dtype: bool

```

### How was this patch tested?
Unit tests.

Closes #33373 from xinrong-databricks/categorical_eq.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-19 15:06:44 -07:00
itholic 2f42afc53a [SPARK-35806][PYTHON] Mapping the mode argument to pandas in DataFrame.to_csv
### What changes were proposed in this pull request?

The `DataFrame.to_csv` has `mode` arguments both in pandas and pandas API on Spark.

However, pandas allows the string "w", "w+", "a", "a+" where as pandas-on-Spark allows "append", "overwrite", "ignore", "error" or "errorifexists".

We should map them while `mode` can still accept the existing parameters("append", "overwrite", "ignore", "error" or "errorifexists") as well.

### Why are the changes needed?

APIs in pandas-on-Spark should follows the behavior of pandas for preventing the existing pandas code break.

### Does this PR introduce _any_ user-facing change?

`DataFrame.to_csv` now can accept "w", "w+", "a", "a+" as well, same as pandas.

### How was this patch tested?

Add the unit test and manually write the file with the new acceptable strings.

Closes #33414 from itholic/SPARK-35806.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:58:11 +09:00
itholic 67e6120a85 [SPARK-35810][PYTHON] Deprecate ps.broadcast API
### What changes were proposed in this pull request?

The `broadcast` functions in `pyspark.pandas` is duplicated to `DataFrame.spark.hint` with `"broadcast"`.

```python
# The below 2 lines are the same
df.spark.hint("broadcast")
ps.broadcast(df)
```

So, we should remove `broadcast` in the future, and show deprecation warning for now.

### Why are the changes needed?

For deduplication of functions

### Does this PR introduce _any_ user-facing change?

They see the deprecation warning when using `broadcast` in `pyspark.pandas`.

```python
>>> ps.broadcast(df)
FutureWarning: `broadcast` has been deprecated and will be removed in a future version. use `DataFrame.spark.hint` with 'broadcast' for `name` parameter instead.
  warnings.warn(
```

### How was this patch tested?

Manually check the warning message and see the build passed.

Closes #33379 from itholic/SPARK-35810.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 10:44:59 +09:00
Takuya UESHIN c22f7a4834 [SPARK-36167][PYTHON] Revisit more InternalField managements
### What changes were proposed in this pull request?

Revisit and manage `InternalField` in more places.

### Why are the changes needed?

There are other places we can manage `InternalField`, and we can keep extension dtypes or `CategoricalDtype`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some tests.

Closes #33377 from ueshin/issues/SPARK-36167/internal_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-15 19:25:20 -07:00
Hyukjin Kwon a71dd6af2f [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs
### What changes were proposed in this pull request?

This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade)

```
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined
python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?
python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"?
python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object"
python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist"
python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue"
python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue"
python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str"
python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str")
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py💯 error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined
python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]")
python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel"
Found 32 errors in 15 files (checked 315 source files)
1
```

### Why are the changes needed?

Python 3.6 is deprecated at SPARK-35938

### Does this PR introduce _any_ user-facing change?

No. Maybe static analysis, etc. by some type hints but they are really non-breaking..

### How was this patch tested?

I manually checked by GitHub Actions build in forked repository.

Closes #33356 from HyukjinKwon/SPARK-36146.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-15 08:01:54 -07:00
Xinrong Meng 0cb120f390 [SPARK-36125][PYTHON] Implement non-equality comparison operators between two Categoricals
### What changes were proposed in this pull request?
Implement non-equality comparison operators between two Categoricals.
Non-goal: supporting Scalar input will be a follow-up task.

### Why are the changes needed?
pandas supports non-equality comparisons between two Categoricals. We should follow that.

### Does this PR introduce _any_ user-facing change?
Yes. No `NotImplementedError` for `<`, `<=`, `>`, `>=` operators between two Categoricals. An example is shown as below:

From:
```py
>>> import pyspark.pandas as ps
>>> from pandas.api.types import CategoricalDtype
>>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     psser <= other_psser
...
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

To:
```py
>>> import pyspark.pandas as ps
>>> from pandas.api.types import CategoricalDtype
>>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     psser <= other_psser
...
0    False
1     True
2     True
dtype: bool
```
### How was this patch tested?
Unit tests.

Closes #33331 from xinrong-databricks/categorical_compare.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-14 14:01:10 -07:00
Kousuke Saruta 47fd3173a5 [SPARK-36104][PYTHON][FOLLOWUP] Remove unused import "typing.cast"
### What changes were proposed in this pull request?

This is a followup PR for SPARK-36104 (#33307) and removes unused import `typing.cast`.
After that change, Python linter fails.
```
   ./dev/lint-python
  shell: sh -e {0}
  env:
    LC_ALL: C.UTF-8
    LANG: C.UTF-8
    pythonLocation: /__t/Python/3.6.13/x64
    LD_LIBRARY_PATH: /__t/Python/3.6.13/x64/lib
starting python compilation test...
python compilation succeeded.

starting black test...
black checks passed.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks failed:
./python/pyspark/pandas/data_type_ops/num_ops.py:19:1: F401 'typing.cast' imported but unused
from typing import cast, Any, Union
^
1     F401 'typing.cast' imported but unused
```

### Why are the changes needed?

To recover CI.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33315 from sarutak/followup-SPARK-36104.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 13:13:35 +09:00
Xinrong Meng 5afc27f899 [SPARK-36104][PYTHON] Manage InternalField in DataTypeOps.neg/abs
### What changes were proposed in this pull request?
Manage InternalField for DataTypeOps.neg/abs.

### Why are the changes needed?
The spark data type and nullability must be the same as the original when DataTypeOps.neg/abs.
We should manage InternalField for this case.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33307 from xinrong-databricks/internalField.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 12:07:05 +09:00
Takuya UESHIN e2021daafb [SPARK-36103][PYTHON] Manage InternalField in DataTypeOps.invert
### What changes were proposed in this pull request?

Properly set `InternalField` for `DataTypeOps.invert`.

### Why are the changes needed?

The spark data type and nullability must be the same as the original when `DataTypeOps.invert`.
We should manage `InternalField` for this case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33306 from ueshin/issues/SPARK-36103/invert.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 09:22:27 +09:00
Xinrong Meng badb0393d4 [SPARK-36003][PYTHON] Implement unary operator invert of integral ps.Series/Index
### What changes were proposed in this pull request?
Implement unary operator `invert` of integral ps.Series/Index.

### Why are the changes needed?
Currently, unary operator `invert` of integral ps.Series/Index is not supported. We ought to implement that following pandas' behaviors.

### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, 3])
>>> ~psser
Traceback (most recent call last):
...
NotImplementedError: Unary ~ can not be applied to integrals.
```

After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, 3])
>>> ~psser
0   -2
1   -3
2   -4
dtype: int64
```

### How was this patch tested?
Unit tests.

Closes #33285 from xinrong-databricks/numeric_invert.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-12 15:10:06 +09:00
Takuya UESHIN 95e6c6e3e9 [SPARK-36064][PYTHON] Manage InternalField more in DataTypeOps
### What changes were proposed in this pull request?

Properly set `InternalField` more in `DataTypeOps`.

### Why are the changes needed?

There are more places in `DataTypeOps` where we can manage `InternalField`.
We should manage `InternalField` for these cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33275 from ueshin/issues/SPARK-36064/fields.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-12 11:55:05 +09:00
Xinrong Meng 698c4ec16b [SPARK-36035][PYTHON] Adjust test_astype, test_neg for old pandas versions
### What changes were proposed in this pull request?
Adjust `test_astype`, `test_neg`  for old pandas versions.

### Why are the changes needed?
There are issues in old pandas versions that fail tests in pandas API on Spark. We ought to adjust `test_astype` and `test_neg` for old pandas versions.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests. Please refer to https://github.com/apache/spark/pull/33272 for test results with pandas 1.0.1.

Closes #33250 from xinrong-databricks/SPARK-36035.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 17:24:20 +09:00
Yikun Jiang fdc50f4452 [SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops

- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)

### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:

- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

We'd better merge test_decimal_ops into test_num_ops.

See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unittests passed

Closes #33206 from Yikun/SPARK-36002.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 14:08:13 +09:00
Xinrong Meng af81ad0d7e [SPARK-36001][PYTHON] Assume result's index to be disordered in tests with operations on different Series
### What changes were proposed in this pull request?
For tests with operations on different Series, sort index of results before comparing them with pandas.

### Why are the changes needed?
We have many tests with operations on different Series in `spark/python/pyspark/pandas/tests/data_type_ops/` that assume the result's index to be sorted and then compare to the pandas' behavior.

The assumption on the result's index ordering is wrong since Spark DataFrame join is used internally and the order is not preserved if the data being in different partitions.

So we should assume the result to be disordered and sort the index of such results before comparing them with pandas.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33274 from xinrong-databricks/datatypeops_testdiffframe.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 12:42:48 +09:00