Commit graph

2933 commits

Author SHA1 Message Date
Gengliang Wang 5d45a415f3 Preparing Spark release v3.2.0-rc7 2021-10-06 11:45:26 +00:00
Takuya UESHIN 04a19963e3 [SPARK-36907][PYTHON] Fix DataFrameGroupBy.apply without shortcut
### What changes were proposed in this pull request?

Fix `DataFrameGroupBy.apply` without shortcut.

Pandas' `DataFrameGroupBy.apply` sometimes behaves weirdly when the udf returns `Series` and whether there is only one group or more. E.g.,:

```py
>>> pdf = pd.DataFrame(
...      {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...      columns=["a", "b", "c"],
... )

>>> pdf.groupby('b').apply(lambda x: x['a'])
b
1  0    1
   1    2
2  2    3
3  3    4
5  4    5
8  5    6
Name: a, dtype: int64
>>> pdf[pdf['b'] == 1].groupby('b').apply(lambda x: x['a'])
a  0  1
b
1  1  2
```

If there is only one group, it returns a "wide" `DataFrame` instead of `Series`.

In our non-shortcut path, there is always only one group because it will be run in `groupby-applyInPandas`, so we will get `DataFrame`, then we should convert it to `Series` ourselves.

### Why are the changes needed?

`DataFrameGroupBy.apply` without shortcut could raise an exception when it returns `Series`.

```py
>>> ps.options.compute.shortcut_limit = 3
>>> psdf = ps.DataFrame(
...     {"a": [1, 2, 3, 4, 5, 6], "b": [1, 1, 2, 3, 5, 8], "c": [1, 4, 9, 16, 25, 36]},
...     columns=["a", "b", "c"],
... )
>>> psdf.groupby("b").apply(lambda x: x["a"])
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
...
ValueError: Length mismatch: Expected axis has 2 elements, new values have 3 elements
```

### Does this PR introduce _any_ user-facing change?

The error above will be gone:

```py
>>> psdf.groupby("b").apply(lambda x: x["a"])
b
1  0    1
   1    2
2  2    3
3  3    4
5  4    5
8  5    6
Name: a, dtype: int64
```

### How was this patch tested?

Added tests.

Closes #34160 from ueshin/issues/SPARK-36907/groupby-apply.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 38d39812c1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-10-03 12:25:27 +09:00
Kousuke Saruta 8b2b6bb0d3 [SPARK-36865][PYTHON][DOCS] Add PySpark API document of session_window
### What changes were proposed in this pull request?

This PR adds PySpark API document of `session_window`.
The docstring of the function doesn't comply with numpydoc format so this PR also fix it.
Further, the API document of `window` doesn't have `Parameters` section so it's also added in this PR.

### Why are the changes needed?

To provide PySpark users with the API document of the newly added function.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`make html` in `python/docs` and get the following docs.

[window]
![time-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963797-ce25b268-20ca-48e3-ac8d-cbcbd85ebb3e.png)

[session_window]
![session-window-python-doc-after](https://user-images.githubusercontent.com/4736016/134963853-dd9d8417-139b-41ee-9924-14544b1a91af.png)

Closes #34118 from sarutak/python-session-window-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 5a32e41e9c)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-30 16:51:27 +09:00
Gengliang Wang 4bd358474b Preparing development version 3.2.1-SNAPSHOT 2021-09-28 10:53:42 +00:00
Gengliang Wang dde73e2e1c Preparing Spark release v3.2.0-rc6 2021-09-28 10:53:35 +00:00
Gengliang Wang 0c57bb8f7f Preparing development version 3.2.1-SNAPSHOT 2021-09-27 08:24:50 +00:00
Gengliang Wang 49aea14c5a Preparing Spark release v3.2.0-rc5 2021-09-27 08:24:44 +00:00
Gengliang Wang 2348cce37e Preparing development version 3.2.1-SNAPSHOT 2021-09-26 12:28:46 +00:00
Gengliang Wang 2ed8c08c5b Preparing Spark release v3.2.0-rc5 2021-09-26 12:28:40 +00:00
Gengliang Wang da722d43cb Preparing development version 3.2.1-SNAPSHOT 2021-09-24 10:03:23 +00:00
Gengliang Wang 9e35703211 Preparing Spark release v3.2.0-rc5 2021-09-24 10:03:16 +00:00
Gengliang Wang 0fb7127f85 Preparing development version 3.2.1-SNAPSHOT 2021-09-23 08:46:28 +00:00
Gengliang Wang b609f2fe0c Preparing Spark release v3.2.0-rc4 2021-09-23 08:46:22 +00:00
Xinrong Meng 423cff4567 [SPARK-36818][PYTHON] Fix filtering a Series by a boolean Series
### What changes were proposed in this pull request?
Fix filtering a Series (without a name) by a boolean Series.

### Why are the changes needed?
A bugfix. The issue is raised as https://github.com/databricks/koalas/issues/2199.

### Does this PR introduce _any_ user-facing change?
Yes.

#### From
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
Traceback (most recent call last):
...
KeyError: 'none key'

```

#### To
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
0    0
1    1
2    2
dtype: int64
```

### How was this patch tested?
Unit test.

Closes #34061 from xinrong-databricks/filter_series.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 6a5ee0283c)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-22 12:53:06 -07:00
Xinrong Meng 4543ac62bc [SPARK-36771][PYTHON][3.2] Fix pop of Categorical Series
### What changes were proposed in this pull request?
Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior.

This is a backport of https://github.com/apache/spark/pull/34052.

### Why are the changes needed?
As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series.

### Does this PR introduce _any_ user-facing change?
Yes, results of `pop` of Categorical Series change.

#### From
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```

#### To
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

```

### How was this patch tested?
Unit tests.

Closes #34063 from xinrong-databricks/backport_cat_pop.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-21 19:16:27 -07:00
dgd-contributor 3d47c692d2 [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value
### What changes were proposed in this pull request?
Fix DataFrame.isin when DataFrame has NaN value

### Why are the changes needed?
Fix DataFrame.isin when DataFrame has NaN value

``` python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
     a    b  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
      a     b     c
0  None  None  True
1  True  None  None
2  None  None  True
3  None  None  None
4  None  True  True
5  None  True  True
6  None  None  True
7  None  None  None
8  None  None  None

>>> psdf.to_pandas().isin(other)
       a      b      c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### Does this PR introduce _any_ user-facing change?
After this PR

``` python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
     a    b  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
       a      b      c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### How was this patch tested?
Unit tests

Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit cc182fe6f6)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-20 17:53:02 -07:00
Gengliang Wang b0249851f6 Preparing development version 3.2.1-SNAPSHOT 2021-09-18 11:30:12 +00:00
Gengliang Wang 96044e9735 Preparing Spark release v3.2.0-rc3 2021-09-18 11:30:06 +00:00
dgd-contributor 36ce9cce55 [SPARK-36762][PYTHON] Fix Series.isin when Series has NaN values
### What changes were proposed in this pull request?
Fix Series.isin when Series has NaN values

### Why are the changes needed?
Fix Series.isin when Series has NaN values
``` python
>>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
>>> psser = ps.from_pandas(pser)
>>> pser.isin([1, 3, 5, None])
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool
>>> psser.isin([1, 3, 5, None])
0    None
1    True
2    None
3    True
4    None
5    True
6    None
7    None
8    None
dtype: object
```

### Does this PR introduce _any_ user-facing change?
After this PR
``` python
>>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
>>> psser = ps.from_pandas(pser)
>>> psser.isin([1, 3, 5, None])
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool

```

### How was this patch tested?
unit tests

Closes #34005 from dgd-contributor/SPARK-36762_fix_series.isin_when_values_have_NaN.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 32b8512912)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-17 17:48:27 -07:00
dgd-contributor 017bce7b11 [SPARK-36722][PYTHON] Fix Series.update with another in same frame
### What changes were proposed in this pull request?
Fix Series.update with another in same frame

also add test for update series in diff frame

### Why are the changes needed?
Fix Series.update with another in same frame

Pandas behavior:
``` python
>>> pdf = pd.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )
>>> pdf
     a    b
0  NaN  NaN
1  2.0  5.0
2  3.0  NaN
3  4.0  3.0
4  5.0  2.0
5  6.0  1.0
6  7.0  NaN
7  8.0  0.0
8  NaN  0.0
>>> pdf.a.update(pdf.b)
>>> pdf
     a    b
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
```

### Does this PR introduce _any_ user-facing change?
Before
```python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update
    combined = combine_frames(self._psdf, other._psdf, how="leftouter")
  File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames
    assert not same_anchor(
AssertionError: We don't need to combine. `this` and `that` are same.
>>>
```

After
```python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
>>> psdf
     a    b
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
>>>
```

### How was this patch tested?
unit tests

Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c15072cc73)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-15 11:08:12 -07:00
Leona Yoda b3488a50d7 [SPARK-36739][DOCS][PYTHON] Add apache license headers to makefiles
### What changes were proposed in this pull request?

Add apache license headers to makefiles of PySpark documents.

### Why are the changes needed?

Makefiles of PySpark documentations do not have apache license headers, while the other files have.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Closes #33979 from yoda-mon/add-license-header-makefiles.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a440025f08)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 09:16:39 +09:00
Xinrong Meng 88bba0c94b [SPARK-36697][PYTHON] Fix dropping all columns of a DataFrame
### What changes were proposed in this pull request?
Fix dropping all columns of a DataFrame

### Why are the changes needed?
When dropping all columns of a pandas-on-Spark DataFrame, a ValueError is raised.
Whereas in pandas, an empty DataFrame reserving the index is returned.
We should follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes.

From
```py
>>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]})
>>> psdf
   x  y  z
0  1  3  5
1  2  4  6

>>> psdf.drop(['x', 'y', 'z'])
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)

```
To
```py
>>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]})
>>> psdf
   x  y  z
0  1  3  5
1  2  4  6

>>> psdf.drop(['x', 'y', 'z'])
Empty DataFrame
Columns: []
Index: [0, 1]
```

### How was this patch tested?
Unit tests.

Closes #33938 from xinrong-databricks/frame_drop_col.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 33bb7b39e9)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-09 09:59:50 +09:00
itholic 3d50760a3e [SPARK-36531][SPARK-36515][PYTHON] Improve test coverage for data_type_ops/* and groupby
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark data types & GroupBy code base, which is written in `data_type_ops/*.py` and `groupby.py` separately.

This PR did the following to improve coverage:
- Add unittest for untested code
- Fix unittest which is not tested properly
- Remove unused code

**NOTE**: This PR is not only include the test-only update, for example it includes the fixing `astype` for binary ops.

pandas-on-Spark Series we have:
```python
>>> psser
0    [49]
1    [50]
2    [51]
dtype: object
```

before:
```python
>>> psser.astype(bool)
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type mismatch: cannot cast binary to boolean;
...
```

after:
```python
>>> psser.astype(bool)
0    True
1    True
2    True
dtype: bool
```

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33850 from itholic/SPARK-36531.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 71dbd03fbe)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-08 10:23:06 +09:00
Cary Lee 11d10fc994
[SPARK-36617][PYTHON] Fix type hints for approxQuantile to support multi-column version
### What changes were proposed in this pull request?
Update both `DataFrame.approxQuantile` and `DataFrameStatFunctions.approxQuantile` to support overloaded definitions when multiple columns are supplied.

### Why are the changes needed?
The current type hints don't support the multi-column signature, a form that was added in Spark 2.2 (see [the approxQuantile docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html).) This change was also introduced to pyspark-stubs (https://github.com/zero323/pyspark-stubs/pull/552). zero323 asked me to open a PR for the upstream change.

### Does this PR introduce _any_ user-facing change?
This change only affects type hints - it brings the `approxQuantile` type hints up to date with the actual code.

### How was this patch tested?
Ran `./dev/lint-python`.

Closes #33880 from carylee/master.

Authored-by: Cary Lee <cary@amperity.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
(cherry picked from commit 37f5ab07fa)
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-09-02 15:03:08 +02:00
Leona Yoda b81e9741cd [SPARK-36621][PYTHON][DOCS] Add Apache license headers to Pandas API on Spark documents
### What changes were proposed in this pull request?

 Apache license headers to Pandas API on Spark documents.

### Why are the changes needed?

Pandas API on Spark document sources do not have license headers, while the other docs have.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Closes #33871 from yoda-mon/add-license-header.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 77fdf5f0e4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-02 12:35:51 +09:00
Gengliang Wang 1bad04d028 Preparing development version 3.2.1-SNAPSHOT 2021-08-31 17:04:14 +00:00
Gengliang Wang 03f5d23e96 Preparing Spark release v3.2.0-rc2 2021-08-31 17:04:08 +00:00
itholic 396b76466b [SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests.

Some tests are missing

No

Unittest

Closes #33776 from itholic/SPARK-36388-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c91ae544fd)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 20:46:54 +09:00
Huaxin Gao 786d773585 [SPARK-36578][ML] UnivariateFeatureSelector API doc improvement
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`

### Why are the changes needed?
make the doc look better

### Does this PR introduce _any_ user-facing change?
yes, API doc change

### How was this patch tested?
Manually checked

Closes #33855 from huaxingao/ml_doc.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
(cherry picked from commit 15e42b4442)
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-26 21:16:59 -07:00
Hyukjin Kwon 069f326e36 [MINOR] Address conflicts for SPARK-36367 cherry-pick 2021-08-27 10:24:18 +09:00
itholic 2dc15d9d84 [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype
This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.

- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.

We should enable the tests as much as possible even if pandas has a bug.

And we should follow the behavior of latest pandas.

Yes, `GroupBy.transform` now follow the behavior of latest pandas.

Unittests.

Closes #33817 from itholic/SPARK-36537.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fe486185c4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:00:23 +09:00
itholic 8829406366 [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3
This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.

**Before:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['b', 'c', 'a']
```

**After:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']  # CategoricalDtype is not updated if dtype is the same
```

`CategoricalDtype` is treated as a same `dtype` if the unique values are the same.

```python
>>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
>>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
>>> pcat1.dtype == pcat2.dtype
True
```

We should follow the latest pandas as much as possible.

Yes, the behavior is changed as example in the PR description.

Unittest

Closes #33757 from itholic/SPARK-36368.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit f2e593bcf1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:00:12 +09:00
itholic 31557d4759 [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string
This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.

In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                    NaT
Name: datetime, dtype: string
```

This is changed to

```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                   <NA>
Name: datetime, dtype: string
```

in pandas >= 1.3, so we follow the behavior of latest pandas.

Because pandas-on-Spark always follow the behavior of latest pandas.

Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)

Unittest passed

Closes #33735 from itholic/SPARK-36387.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit c0441bb7e8)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 10:00:01 +09:00
itholic 0fc8c393b4 [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.

`RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.

Before:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       A    B
A
1 0  NaN  NaN
  1  2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN
```

After:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       B
A
1 0  NaN
  1  1.0
2 2  NaN
3 3  NaN
```

We should follow the behavior of pandas as much as possible.

Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.

Unit tests.

Closes #33646 from itholic/SPARK-36388.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit b8508f4876)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 09:59:48 +09:00
itholic f2f09e4cdb [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.

Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```

After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```

This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289.

We should follow the behavior of pandas as much as possible.

Yes, the result for some cases have duplicates values will change.

Unit test.

Closes #33634 from itholic/SPARK-36369.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a9f371c247)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 09:59:32 +09:00
Takuya UESHIN cb075b5301 [SPARK-36345][SPARK-36367][INFRA][PYTHON] Disable tests failed by the incompatible behavior of pandas 1.3
Disable tests failed by the incompatible behavior of pandas 1.3.

Pandas 1.3 has been released.
There are some behavior changes and we should follow it, but it's not ready yet.

No.

Disabled some tests related to the behavior change.

Closes #33598 from ueshin/issues/SPARK-36367/disable_tests.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8cb9cf39b6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-27 09:58:42 +09:00
Leona Yoda 36be232eea [SPARK-36541][DOCS][PYTHON] Replace the word Koalas to pandas-on-Spark
### What changes were proposed in this pull request?

Replace images in pyspark on pandas document because those images uses the word Koalas

### Why are the changes needed?

Images in Transform and apply a function documentation still uses the word Koalas, althogh the word was replaced to panas-on-Spark by this PR .
https://github.com/apache/spark/pull/32835

I think we have to match the word on that images

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Screen shots
![130179112-8485fdde-b422-4834-8b23-fe69e7402118](https://user-images.githubusercontent.com/14937752/130186051-d6ff65f0-c121-40bd-b4f1-2fbc10e76f3e.png)
![130179239-8dae7812-4d81-4f8c-8558-b75e4eae3787](https://user-images.githubusercontent.com/14937752/130186063-17d4a95f-0b9d-49d3-85c7-13ea07e4b6bb.png)
![130179273-10f9fbc3-0a62-4e1a-ab6e-7049d75653a1](https://user-images.githubusercontent.com/14937752/130186074-7d684669-b9ef-4a4e-8a2d-c63bb9800ddb.png)
![130179311-616545af-dde2-4dec-807f-dde0a0d4bfbe](https://user-images.githubusercontent.com/14937752/130186095-20669673-b1d3-4552-97bf-86bbc1a5d43b.png)
Environment
- Windows 10
- Google Chrome 92.0.4515.159

[images.pptx](https://github.com/apache/spark/files/7029087/images.pptx)

Closes #33786 from yoda-mon/replace-pyspark-doc-images.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit aeb3da2798)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 19:03:11 +09:00
itholic 0feb19c53a [SPARK-36505][PYTHON] Improve test coverage for frame.py
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.

This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33833 from itholic/SPARK-36505.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 97e7d6e667)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 17:43:12 +09:00
Dongjoon Hyun d841679ecc [MINOR][3.2] Remove unused numpy import
### What changes were proposed in this pull request?

This fixed Python linter failure.

### Why are the changes needed?

```
flake8 checks failed:
./python/pyspark/ml/tests/test_tuning.py:21:1: F401 'numpy as np' imported but unused
import numpy as np
F401 'numpy as np' imported but unused
Error: Process completed with exit code 1.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the GitHub Action Linter job.

Closes #33841 from dongjoon-hyun/unused_import.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 09:52:54 +09:00
Hyukjin Kwon 26ae9e93da [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 93cec49212)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:03:00 +09:00
Gengliang Wang 5463caac0d Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?

Revert 397b843890 and 5a48eb8d00

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit de932f51ce)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:39:29 -07:00
Xinrong Meng 56c211bd6a [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map
Implement `CategoricalIndex.map` and `DatetimeIndex.map`

`MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary.

Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now.

- CategoricalIndex.map

```py
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser)
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
```

- DatetimeIndex.map

```py
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')
```

Unit tests.

Closes #33756 from xinrong-databricks/other_indexes_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 0b6af464dc)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 10:11:21 +09:00
Gengliang Wang 69be513c5e Preparing development version 3.2.1-SNAPSHOT 2021-08-20 12:40:47 +00:00
Gengliang Wang 6bb3523d8e Preparing Spark release v3.2.0-rc1 2021-08-20 12:40:40 +00:00
Gengliang Wang fafdc1482b Revert "Preparing Spark release v3.2.0-rc1"
This reverts commit 8e58fafb05.
2021-08-20 20:07:02 +08:00
Gengliang Wang c829ed53ff Revert "Preparing development version 3.2.1-SNAPSHOT"
This reverts commit 4f1d21571d.
2021-08-20 20:07:01 +08:00
Gengliang Wang 4f1d21571d Preparing development version 3.2.1-SNAPSHOT 2021-08-19 14:08:32 +00:00
Gengliang Wang 8e58fafb05 Preparing Spark release v3.2.0-rc1 2021-08-19 14:08:26 +00:00
Takuya UESHIN 528fca8944 [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version
### What changes were proposed in this pull request?

This is a follow-up of #33687.

Use `LooseVersion` instead of `pkg_resources.parse_version`.

### Why are the changes needed?

In the previous PR, `pkg_resources.parse_version` was used, but we should use `LooseVersion` instead to be consistent in the code base.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 7fb8ea319e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-18 10:36:17 +09:00
Cedric-Magnan e15daa31b3 [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined
### What changes were proposed in this pull request?
Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module.
Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class.

### Why are the changes needed?
This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran the following command successfully :
```sh
python/run-tests --testnames 'pyspark.pandas.tests.test_groupby'
```
Tests passed in 327 seconds

Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas.

Authored-by: Cedric-Magnan <cedric.magnan@artefact.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 964dfe254f)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-17 10:47:01 -07:00