Commit graph

2787 commits

Author SHA1 Message Date
Liang Zhang ef7441bb42 [SPARK-36642][SQL] Add df.withMetadata pyspark API
This PR adds the pyspark API `df.withMetadata(columnName, metadata)`. The scala API is added in this PR https://github.com/apache/spark/pull/33853.

### What changes were proposed in this pull request?

To make it easy to use/modify the semantic annotation, we want to have a shorter API to update the metadata in a dataframe. Currently we have `df.withColumn("col1", col("col1").alias("col1", metadata=metadata))` to update the metadata without changing the column name, and this is too verbose. We want to have a syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality.

### Why are the changes needed?

A bit of background for the frequency of the update: We are working on inferring the semantic data types and use them in AutoML and store the semantic annotation in the metadata. So in many cases, we will suggest the user update the metadata to correct the wrong inference or add the annotation for weak inference.

### Does this PR introduce _any_ user-facing change?

Yes.
A syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality as`df.withColumn("col1", col("col1").alias("col1", metadata=metadata))`.

### How was this patch tested?

doctest.

Closes #34021 from liangz1/withMetadataPython.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-09-24 08:30:18 +08:00
itholic 5268904742 [SPARK-36506][PYTHON] Improve test coverage for series.py and indexes/*.py
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark Series & Index code base, which is written in `series.py` and `indexes/*.py` separately.

This PR did the following to improve coverage:
- Add unittest for untested code
- Fix unittest which is not tested properly
- Remove unused code

**NOTE**: This PR is not only include the test-only update, for example it includes the new warning for `__xor__`, `__and__`, `__or__`.

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33844 from itholic/SPARK-36506.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-23 14:23:52 +09:00
Hyukjin Kwon cc2fcb4794 [SPARK-36708][PYTHON] Support numpy.typing for annotating ArrayType in pandas API on Spark
### What changes were proposed in this pull request?

This PR adds the support of understanding `numpy.typing` package that's added from NumPy 1.21.

### Why are the changes needed?

For user-friendly return type specification in type hints for function apply APIs in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, this PR will enable users to specify return type as `numpy.typing.NDArray[...]` to internally specify pandas UDF's return type.

For example,

```python
import pandas as pd
import pyspark.pandas as ps

pdf = pd.DataFrame(
    {"a": [1, 2, 3, 4, 5, 6, 7, 8, 9], "b": [[e] for e in [4, 5, 6, 3, 2, 1, 0, 0, 0]]},
    index=np.random.rand(9),
)
psdf = ps.from_pandas(pdf)

def func(x) -> ps.DataFrame[float, [int, ntp.NDArray[int]]]:
    return x

psdf.pandas_on_spark.apply_batch(func)
```

### How was this patch tested?

Unittest and e2e tests were added.

Closes #34028 from HyukjinKwon/SPARK-36708.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-23 10:50:10 +09:00
Xinrong Meng 6a5ee0283c [SPARK-36818][PYTHON] Fix filtering a Series by a boolean Series
### What changes were proposed in this pull request?
Fix filtering a Series (without a name) by a boolean Series.

### Why are the changes needed?
A bugfix. The issue is raised as https://github.com/databricks/koalas/issues/2199.

### Does this PR introduce _any_ user-facing change?
Yes.

#### From
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
Traceback (most recent call last):
...
KeyError: 'none key'

```

#### To
```py
>>> psser = ps.Series([0, 1, 2, 3, 4])
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> psser.loc[ps.Series([True, True, True, False, False])]
0    0
1    1
2    2
dtype: int64
```

### How was this patch tested?
Unit test.

Closes #34061 from xinrong-databricks/filter_series.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-22 12:52:56 -07:00
Xinrong Meng 079a9c5292 [SPARK-36771][PYTHON] Fix pop of Categorical Series
### What changes were proposed in this pull request?
Fix `pop` of Categorical Series to be consistent with the latest pandas (1.3.2) behavior.

### Why are the changes needed?
As https://github.com/databricks/koalas/issues/2198, pandas API on Spark behaves differently from pandas on `pop` of Categorical Series.

### Does this PR introduce _any_ user-facing change?
Yes, results of `pop` of Categorical Series change.

#### From
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
0
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
0
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']
```

#### To
```py
>>> psser = ps.Series(["a", "b", "c", "a"], dtype="category")
>>> psser
0    a
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(0)
'a'
>>> psser
1    b
2    c
3    a
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> psser.pop(3)
'a'
>>> psser
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

```

### How was this patch tested?
Unit tests.

Closes #34052 from xinrong-databricks/cat_pop.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-21 14:11:21 -07:00
Xinrong Meng 33e463ccf9 [SPARK-36769][PYTHON] Improve filter of single-indexed DataFrame
### What changes were proposed in this pull request?
Improve `filter` of single-indexed DataFrame by replacing a long Project with Filter or Join.

### Why are the changes needed?
When the given `items` have too many elements, a long Project is introduced.
We may replace that with `Column.isin` or joining depending on the length of `items` for better performance.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33998 from xinrong-databricks/impr_filter.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-21 10:20:15 -07:00
dgd-contributor cc182fe6f6 [SPARK-36785][PYTHON] Fix DataFrame.isin when DataFrame has NaN value
### What changes were proposed in this pull request?
Fix DataFrame.isin when DataFrame has NaN value

### Why are the changes needed?
Fix DataFrame.isin when DataFrame has NaN value

``` python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
     a    b  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
      a     b     c
0  None  None  True
1  True  None  None
2  None  None  True
3  None  None  None
4  None  True  True
5  None  True  True
6  None  None  True
7  None  None  None
8  None  None  None

>>> psdf.to_pandas().isin(other)
       a      b      c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### Does this PR introduce _any_ user-facing change?
After this PR

``` python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0], "c": [1, 5, 1, 3, 2, 1, 1, 0, 0]},
... )
>>> psdf
     a    b  c
0  NaN  NaN  1
1  2.0  5.0  5
2  3.0  NaN  1
3  4.0  3.0  3
4  5.0  2.0  2
5  6.0  1.0  1
6  7.0  NaN  1
7  8.0  0.0  0
8  NaN  0.0  0
>>> other = [1, 2, None]

>>> psdf.isin(other)
       a      b      c
0  False  False   True
1   True  False  False
2  False  False   True
3  False  False  False
4  False   True   True
5  False   True   True
6  False  False   True
7  False  False  False
8  False  False  False
```

### How was this patch tested?
Unit tests

Closes #34040 from dgd-contributor/SPARK-36785_dataframe.isin_fix.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-20 17:52:51 -07:00
Xinrong Meng 4b61c623b5 [SPARK-36746][PYTHON] Refactor _select_rows_by_iterable in iLocIndexer to use Column.isin
### What changes were proposed in this pull request?
Refactor `_select_rows_by_iterable` in `iLocIndexer` to use `Column.isin`.

### Why are the changes needed?
For better performance.

After a rough benchmark, a long projection performs worse than `Column.isin`, even when the length of the filtering conditions exceeding `compute.isin_limit`.

So we use `Column.isin` instead.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #33964 from xinrong-databricks/iloc_select.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-20 15:00:10 -07:00
Xinrong Meng 4cf86d33ad [SPARK-36618][PYTHON] Support dropping rows of a single-indexed DataFrame
### What changes were proposed in this pull request?
Support dropping rows of a single-indexed DataFrame.

Dropping rows and columns at the same time is supported in this PR  as well.

### Why are the changes needed?
To increase pandas API coverage.

### Does this PR introduce _any_ user-facing change?
Yes, dropping rows of a single-indexed DataFrame is supported now.

```py
>>> df = ps.DataFrame(np.arange(12).reshape(3, 4), columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11
```
#### From
```py
>>> df.drop([0, 1])
Traceback (most recent call last):
...
KeyError: [(0,), (1,)]

>>> df.drop([0, 1], axis=0)
Traceback (most recent call last):
...
NotImplementedError: Drop currently only works for axis=1

>>> df.drop(1)
Traceback (most recent call last):
...
KeyError: [(1,)]

>>> df.drop(index=1)
Traceback (most recent call last):
...
TypeError: drop() got an unexpected keyword argument 'index'

>>> df.drop(index=[0, 1], columns='A')
Traceback (most recent call last):
...
TypeError: drop() got an unexpected keyword argument 'index'
```
#### To
```py
>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

>>> df.drop([0, 1], axis=0)
   A  B   C   D
2  8  9  10  11

>>> df.drop(1)
   A  B   C   D
0  0  1   2   3
2  8  9  10  11

>>> df.drop(index=1)
   A  B   C   D
0  0  1   2   3
2  8  9  10  11

>>> df.drop(index=[0, 1], columns='A')
   B   C   D
2  9  10  11
```

### How was this patch tested?
Unit tests.

Closes #33929 from xinrong-databricks/frame_drop.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-20 14:50:50 -07:00
Yuto Akutsu 30d17b6333 [SPARK-36683][SQL] Add new built-in SQL functions: SEC and CSC
### What changes were proposed in this pull request?

Add new built-in SQL functions: secant and cosecant, and add them as Scala and Python functions.

### Why are the changes needed?

Cotangent has been supported in Spark SQL but Secant and Cosecant are missing though I believe they can be used as much as cot.
Related Links: [SPARK-20751](https://github.com/apache/spark/pull/17999) [SPARK-36660](https://github.com/apache/spark/pull/33906)

### Does this PR introduce _any_ user-facing change?

Yes, users can now use these functions.

### How was this patch tested?

Unit tests

Closes #33988 from yutoacts/SPARK-36683.

Authored-by: Yuto Akutsu <yuto.akutsu@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-20 22:38:47 +09:00
Hyukjin Kwon 8d8b4aabc3 [SPARK-36710][PYTHON] Support new typing syntax in function apply APIs in pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes the new syntax introduced in https://github.com/apache/spark/pull/33954. Namely, users now can specify the index type and name as below:

```python
import pandas as pd
import pyspark.pandas as ps
def transform(pdf) -> pd.DataFrame[int, [int, int]]:
    pdf['A'] = pdf.id + 1
    return pdf

ps.range(5).koalas.apply_batch(transform)
```

```
   c0  c1
0   0   1
1   1   2
2   2   3
3   3   4
4   4   5
```

```python
import pandas as pd
import pyspark.pandas as ps
def transform(pdf) -> pd.DataFrame[("index", int), [("a", int), ("b", int)]]:
    pdf['A'] = pdf.id * pdf.id
    return pdf

ps.range(5).koalas.apply_batch(transform)
```

```
       a   b
index
0      0   0
1      1   1
2      2   4
3      3   9
4      4  16
```

Again, this syntax remains experimental and this is a non-standard way apart from Python standard. We should migrate to proper typing once pandas supports it like `numpy.typing`.

### Why are the changes needed?

The rationale is described in https://github.com/apache/spark/pull/33954. In order to avoid unnecessary computation for default index or schema inference.

### Does this PR introduce _any_ user-facing change?

Yes, this PR affects the following APIs:

- `DataFrame.apply(..., axis=1)`
- `DataFrame.groupby.apply(...)`
- `DataFrame.pandas_on_spark.transform_batch(...)`
- `DataFrame.pandas_on_spark.apply_batch(...)`

Now they can specify the index type with the new syntax below:

```
DataFrame[index_type, [type, ...]]
DataFrame[(index_name, index_type), [(name, type), ...]]
DataFrame[dtype instance, dtypes instance]
DataFrame[(index_name, index_type), zip(names, types)]
```

### How was this patch tested?

Manually tested, and unittests were added.

Closes #34007 from HyukjinKwon/SPARK-36710.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-20 10:37:06 +09:00
dgd-contributor 32b8512912 [SPARK-36762][PYTHON] Fix Series.isin when Series has NaN values
### What changes were proposed in this pull request?
Fix Series.isin when Series has NaN values

### Why are the changes needed?
Fix Series.isin when Series has NaN values
``` python
>>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
>>> psser = ps.from_pandas(pser)
>>> pser.isin([1, 3, 5, None])
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool
>>> psser.isin([1, 3, 5, None])
0    None
1    True
2    None
3    True
4    None
5    True
6    None
7    None
8    None
dtype: object
```

### Does this PR introduce _any_ user-facing change?
After this PR
``` python
>>> pser = pd.Series([None, 5, None, 3, 2, 1, None, 0, 0])
>>> psser = ps.from_pandas(pser)
>>> psser.isin([1, 3, 5, None])
0    False
1     True
2    False
3     True
4    False
5     True
6    False
7    False
8    False
dtype: bool

```

### How was this patch tested?
unit tests

Closes #34005 from dgd-contributor/SPARK-36762_fix_series.isin_when_values_have_NaN.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-17 17:48:15 -07:00
dgd-contributor 8f895e9e96 [SPARK-36779][PYTHON] Fix when list of data type tuples has len = 1
### What changes were proposed in this pull request?

Fix when list of data type tuples has len = 1

### Why are the changes needed?
Fix when list of data type tuples has len = 1

``` python
>>> ps.DataFrame[("a", int), [int]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, int]

>>> ps.DataFrame[("a", int), [("b", int)]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dgd/spark/python/pyspark/pandas/frame.py", line 11998, in __class_getitem__
    return create_tuple_for_frame_type(params)
  File "/Users/dgd/spark/python/pyspark/pandas/typedef/typehints.py", line 685, in create_tuple_for_frame_type
    return Tuple[extract_types(params)]
  File "/Users/dgd/spark/python/pyspark/pandas/typedef/typehints.py", line 755, in extract_types
    return (index_type,) + extract_types(data_types)
  File "/Users/dgd/spark/python/pyspark/pandas/typedef/typehints.py", line 770, in extract_types
    raise TypeError(
TypeError: Type hints should be specified as one of:
  - DataFrame[type, type, ...]
  - DataFrame[name: type, name: type, ...]
  - DataFrame[dtypes instance]
  - DataFrame[zip(names, types)]
  - DataFrame[index_type, [type, ...]]
  - DataFrame[(index_name, index_type), [(name, type), ...]]
  - DataFrame[dtype instance, dtypes instance]
  - DataFrame[(index_name, index_type), zip(names, types)]
However, got [('b', <class 'int'>)].
```

### Does this PR introduce _any_ user-facing change?

After:
``` python
>>> ps.DataFrame[("a", int), [("b", int)]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType]

```

### How was this patch tested?
exist test

Closes #34019 from dgd-contributor/fix_when_list_of_tuple_data_type_have_len=1.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-17 09:27:46 +09:00
dgd-contributor c15072cc73 [SPARK-36722][PYTHON] Fix Series.update with another in same frame
### What changes were proposed in this pull request?
Fix Series.update with another in same frame

also add test for update series in diff frame

### Why are the changes needed?
Fix Series.update with another in same frame

Pandas behavior:
``` python
>>> pdf = pd.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )
>>> pdf
     a    b
0  NaN  NaN
1  2.0  5.0
2  3.0  NaN
3  4.0  3.0
4  5.0  2.0
5  6.0  1.0
6  7.0  NaN
7  8.0  0.0
8  NaN  0.0
>>> pdf.a.update(pdf.b)
>>> pdf
     a    b
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
```

### Does this PR introduce _any_ user-facing change?
Before
```python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update
    combined = combine_frames(self._psdf, other._psdf, how="leftouter")
  File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames
    assert not same_anchor(
AssertionError: We don't need to combine. `this` and `that` are same.
>>>
```

After
```python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
>>> psdf
     a    b
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
>>>
```

### How was this patch tested?
unit tests

Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-15 11:08:01 -07:00
Leona Yoda 0666f5c003 [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R
### What changes were proposed in this pull request?

octet_length: caliculate the byte length of strings
bit_length: caliculate the bit length of strings
Those two string related functions are only implemented on SparkSQL, not on Scala, Python and R.

### Why are the changes needed?

Those functions would be useful for multi-bytes character users, who mainly working with Scala, Python or R.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call octet_length/bit_length APIs on Scala(Dataframe), Python, and R.

### How was this patch tested?

unit tests

Closes #33992 from yoda-mon/add-bit-octet-length.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-15 16:27:13 +09:00
Hyukjin Kwon 0aaf86b520 [SPARK-36709][PYTHON] Support new syntax for specifying index type and name in pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes new syntax to specify the index type and name in pandas API on Spark. This is a base work for SPARK-36707.

More specifically, users now can use the type hints when typing as below:

```
pd.DataFrame[int, [int, int]]
pd.DataFrame[pdf.index.dtype, pdf.dtypes]
pd.DataFrame[("index", int), [("id", int), ("A", int)]]
pd.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)]
```

Note that the types of `[("id", int), ("A", int)]` or  `("index", int)` are matched to how you provide a compound NumPy type (see also https://numpy.org/doc/stable/user/basics.rec.html#introduction).

Therefore, the syntax will be:

**Without index:**

```
pd.DataFrame[type, type, ...]
pd.DataFrame[name: type, name: type, ...]
pd.DataFrame[dtypes instance]
pd.DataFrame[zip(names, types)]
```

(New) **With index:**

```
pd.DataFrame[index_type, [type, ...]]
pd.DataFrame[(index_name, index_type), [(name, type), ...]]
pd.DataFrame[dtype instance, dtypes instance]
pd.DataFrame[(index_name, index_type), zip(names, types)]
```

### Why are the changes needed?

Currently, there is no way to specify the type hint for index type - the type hints are converted to return type of pandas UDFs internally. Therefore, we always attach default index which degrade performance:

```python
>>> def transform(pdf) -> pd.DataFrame[int, int]:
...     pdf['A'] = pdf.id + 1
...     return pdf
...
>>> ks.range(5).koalas.apply_batch(transform)
```

```
   c0  c1
0   0   1
1   1   2
2   2   3
3   3   4
4   4   5
```

The [default index](https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type) (for the first column that looks unnamed) is attached when the type hint is specified. For better performance, we should have a way to work around, see also https://github.com/apache/spark/pull/33954#issuecomment-917742920 and [Specify the index column in conversion from Spark DataFrame to Koalas DataFrame](https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#specify-the-index-column-in-conversion-from-spark-dataframe-to-koalas-dataframe).

Note that this still remains as experimental because Python itself yet doesn't support such kind of typing out of the box. Once pandas completes typing support like NumPy did in `numpy.typing`, we should implement Koalas typing package, and migrate to it with leveraging pandas' typing way.

### Does this PR introduce _any_ user-facing change?

No, this PR does not yet affect any user-facing behavior in theory.

### How was this patch tested?

Unittests were added.

Closes #33954 from HyukjinKwon/SPARK-36709.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 10:13:33 +09:00
Xinrong Meng 1ed671cec6 [SPARK-36748][PYTHON] Introduce the 'compute.isin_limit' option
### What changes were proposed in this pull request?
Introduce the 'compute.isin_limit' option, with the default value of 80.

### Why are the changes needed?
`Column.isin(list)` doesn't perform well when the given `list` is large, as https://issues.apache.org/jira/browse/SPARK-33383.
Thus, 'compute.isin_limit' is introduced to constrain the usage of `Column.isin(list)` in the code base.
If the length of the ‘list’ is above the `'compute.isin_limit'`, broadcast join is used instead for better performance.

#### Why is the default value 80?
After reproducing the benchmark mentioned in https://issues.apache.org/jira/browse/SPARK-33383,

| length of filtering list | isin time /ms| broadcast DF time / ms|
| :---:   | :-: | :-: |
| 200 | 69411 | 39296 |
| 100 | 43074 | 40087 |
| 80 | 35592 | 40350 |
| 50 | 28134 | 37847 |

We may notice when the length of the filtering list <= 80, the `isin` approach performs better than `broadcast DF`.

### Does this PR introduce _any_ user-facing change?
Users may read/write the value of `'compute.isin_limit'` as follows
```py
>>> ps.get_option('compute.isin_limit')
80

>>> ps.set_option('compute.isin_limit', 10)
>>> ps.get_option('compute.isin_limit')
10

>>> ps.set_option('compute.isin_limit', -1)
...
ValueError: 'compute.isin_limit' should be greater than or equal to 0.

>>> ps.reset_option('compute.isin_limit')
>>> ps.get_option('compute.isin_limit')
80
```

### How was this patch tested?
Manual test.

Closes #33982 from xinrong-databricks/new_option.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 11:37:35 +09:00
dgd-contributor f8657d1924 [SPARK-36653][PYTHON] Implement Series.__xor__ and Series.__rxor__
### What changes were proposed in this pull request?
Implement Series.\_\_xor__ and Series.\_\_rxor__

### Why are the changes needed?
Follow pandas

### Does this PR introduce _any_ user-facing change?
Yes, user can use
``` python
psdf = ps.DataFrame([[11, 11], [1, 2]])
psdf[0] ^ psdf[1]
```

### How was this patch tested?
unit tests

Closes #33911 from dgd-contributor/SPARK-36653_Implement_Series._xor_.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-13 15:09:22 -07:00
itholic 0227793f9e [SPARK-36689][PYTHON] Cleanup the deprecated APIs and raise proper warning message
### What changes were proposed in this pull request?

This PR proposes cleanup the deprecated APIs in `missing/*.py`, and raise proper warning message for the deprecated APIs such as pandas does.

Also remove the checking for pandas < 1.0, since now we only focus on following the behavior of latest pandas.

### Why are the changes needed?

We should follow the deprecation of APIs of latest pandas.

### Does this PR introduce _any_ user-facing change?

Now the some APIs raise proper alternative message for deprecated functions such as pandas does.

### How was this patch tested?

Ran `dev/lint-python` and manually check the pandas API documents one by one.

Closes #33931 from itholic/SPARK-36689.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-09 19:50:29 +09:00
Xinrong Meng 33bb7b39e9 [SPARK-36697][PYTHON] Fix dropping all columns of a DataFrame
### What changes were proposed in this pull request?
Fix dropping all columns of a DataFrame

### Why are the changes needed?
When dropping all columns of a pandas-on-Spark DataFrame, a ValueError is raised.
Whereas in pandas, an empty DataFrame reserving the index is returned.
We should follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes.

From
```py
>>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]})
>>> psdf
   x  y  z
0  1  3  5
1  2  4  6

>>> psdf.drop(['x', 'y', 'z'])
Traceback (most recent call last):
...
ValueError: not enough values to unpack (expected 2, got 0)

```
To
```py
>>> psdf = ps.DataFrame({"x": [1, 2], "y": [3, 4], "z": [5, 6]})
>>> psdf
   x  y  z
0  1  3  5
1  2  4  6

>>> psdf.drop(['x', 'y', 'z'])
Empty DataFrame
Columns: []
Index: [0, 1]
```

### How was this patch tested?
Unit tests.

Closes #33938 from xinrong-databricks/frame_drop_col.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-09 09:59:42 +09:00
Hyukjin Kwon 34f80ef313 [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark
### What changes were proposed in this pull request?

This PR adds:
- the support of `TimestampNTZType` in pandas API on Spark.
- the support of Py4J handling of `spark.sql.timestampType` configuration

### Why are the changes needed?

To complete `TimestampNTZ` support.

In more details:

- ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp):

    > Because naive datetime objects are treated by many datetime methods as local times ...

  semantically it is legitimate to assume:
    - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone).
    - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time)

- ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278):
    - `Timestamp(..., timezone=sparkLocalTimezone)` ->  `TimestamType`
    - `Timestamp(..., timezone=null)` ->  `TimestampNTZType`

- (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time).

    One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default:

    ```python
    >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int")
    0    0
    ```

    whereas in Spark:

    ```python
    >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show()
    +------+
    |    _1|
    +------+
    |-32400|
    +------+

    >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show()
    +---+
    | _1|
    +---+
    |  0|
    +---+
    ```

    In contrast, some APIs like `pandas.fromtimestamp` assume they are local times:

    ```python
    >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0])
    Timestamp('1970-01-01 09:00:00')
    ```

    For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work.

    As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work.

### Does this PR introduce _any_ user-facing change?

Yes, now pandas API on Spark can handle `TimestampNTZType`.

```python
import datetime
spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark()
```

```
                          dt
0 2021-08-31 19:58:55.024410
```

This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration:

```python
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP '2021-09-03 19:34:03.949998''>
```
```python
>>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ")
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''>
```

### How was this patch tested?

Unittests were added.

Closes #33877 from HyukjinKwon/SPARK-36625.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-09 09:57:38 +09:00
itholic 71dbd03fbe [SPARK-36531][SPARK-36515][PYTHON] Improve test coverage for data_type_ops/* and groupby
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark data types & GroupBy code base, which is written in `data_type_ops/*.py` and `groupby.py` separately.

This PR did the following to improve coverage:
- Add unittest for untested code
- Fix unittest which is not tested properly
- Remove unused code

**NOTE**: This PR is not only include the test-only update, for example it includes the fixing `astype` for binary ops.

pandas-on-Spark Series we have:
```python
>>> psser
0    [49]
1    [50]
2    [51]
dtype: object
```

before:
```python
>>> psser.astype(bool)
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve 'CAST(`0` AS BOOLEAN)' due to data type mismatch: cannot cast binary to boolean;
...
```

after:
```python
>>> psser.astype(bool)
0    True
1    True
2    True
dtype: bool
```

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33850 from itholic/SPARK-36531.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-08 10:22:52 +09:00
Hyukjin Kwon c6f3a13087 [SPARK-36626][PYTHON][FOLLOW-UP] Use datetime.tzinfo instead of datetime.tzname()
### What changes were proposed in this pull request?

This PR is a small followup of https://github.com/apache/spark/pull/33876 which proposes to use `datetime.tzinfo` instead of `datetime.tzname` to see if timezome information is provided or not.

This way is consistent with other places such as:

9c5bcac61e/python/pyspark/sql/types.py (L182)

9c5bcac61e/python/pyspark/sql/types.py (L1662)

### Why are the changes needed?

In some cases, `datetime.tzname` can raise an exception (https://docs.python.org/3/library/datetime.html#datetime.datetime.tzname):

> ... raises an exception if the latter doesn’t return None or a string object,

I was able to reproduce this in Jenkins with setting `spark.sql.timestampType` to `TIMESTAMP_NTZ` by default:

```
======================================================================
ERROR: test_time_with_timezone (pyspark.sql.tests.test_serde.SerdeTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jenkins/workspace/SparkPullRequestBuilder/python/pyspark/sql/tests/test_serde.py", line 92, in test_time_with_timezone
...
  File "/usr/lib/pypy3/lib-python/3/datetime.py", line 979, in tzname
    raise NotImplementedError("tzinfo subclass must override tzname()")
NotImplementedError: tzinfo subclass must override tzname()
```

### Does this PR introduce _any_ user-facing change?

No to end users because it has not be released.
This is rather a safeguard to prevent potential breakage.

### How was this patch tested?

Manually tested.

Closes #33918 from HyukjinKwon/SPARK-36626-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-06 17:16:52 +02:00
Yuto Akutsu db95960f4b [SPARK-36660][SQL] Add cot as Scala and Python functions
### What changes were proposed in this pull request?

Add cotangent support by Dataframe operations (e.g. `df.select(cot($"col"))`).

### Why are the changes needed?

Cotangent has been supported by Spark SQL since 2.3.0 but it cannot be called by Dataframe operations.

### Does this PR introduce _any_ user-facing change?

Yes, users can now call the cotangent function by Dataframe operations.

### How was this patch tested?

unit tests.

Closes #33906 from yutoacts/SPARK-36660.

Lead-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-06 13:38:18 +09:00
Xinrong Meng 019203caec [SPARK-36655][PYTHON] Add versionadded for API added in Spark 3.3.0
### What changes were proposed in this pull request?

Add `versionadded` for API added in Spark 3.3.0: DataFrame.combine_first.

### Why are the changes needed?
That documents the version of Spark which added the described API.

### Does this PR introduce _any_ user-facing change?
No user-facing behavior change. Only the document of the affected API shows when it's introduced.

### How was this patch tested?
Manual test.

Closes #33901 from xinrong-databricks/version.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-03 11:22:53 -07:00
dgd-contributor 9b262e722d [SPARK-36401][PYTHON] Implement Series.cov
### What changes were proposed in this pull request?

Implement Series.cov

### Why are the changes needed?

That is supported in pandas. We should support that as well.

### Does this PR introduce _any_ user-facing change?

Yes. Series.cov can be used.

```python
>>> from pyspark.pandas.config import set_option, reset_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> s1 = ps.Series([0.90010907, 0.13484424, 0.62036035])
>>> s2 = ps.Series([0.12528585, 0.26962463, 0.51111198])
>>> s1.cov(s2)
-0.016857626527158744
>>> reset_option("compute.ops_on_diff_frames")
```

### How was this patch tested?

Unit tests

Closes #33752 from dgd-contributor/SPARK-36401_Implement_Series.cov.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-03 10:41:27 -07:00
itholic 61b3223f47 [SPARK-36609][PYTHON] Add errors argument for ps.to_numeric
### What changes were proposed in this pull request?

This PR proposes to support `errors` argument for `ps.to_numeric` such as pandas does.

Note that we don't support the `ignore` when the `arg` is pandas-on-Spark Series for now.

### Why are the changes needed?

We should match the behavior to pandas' as much as possible.

Also in the [recent blog post](https://medium.com/chuck.connell.3/pandas-on-databricks-via-koalas-a-review-9876b0a92541), the author pointed out we're missing this feature.

Seems like it's the kind of feature that commonly used in data science.

### Does this PR introduce _any_ user-facing change?

Now the `errors` argument is available for `ps.to_numeric`.

### How was this patch tested?

Unittests.

Closes #33882 from itholic/SPARK-36609.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-03 21:33:30 +09:00
Cary Lee 37f5ab07fa
[SPARK-36617][PYTHON] Fix type hints for approxQuantile to support multi-column version
### What changes were proposed in this pull request?
Update both `DataFrame.approxQuantile` and `DataFrameStatFunctions.approxQuantile` to support overloaded definitions when multiple columns are supplied.

### Why are the changes needed?
The current type hints don't support the multi-column signature, a form that was added in Spark 2.2 (see [the approxQuantile docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html).) This change was also introduced to pyspark-stubs (https://github.com/zero323/pyspark-stubs/pull/552). zero323 asked me to open a PR for the upstream change.

### Does this PR introduce _any_ user-facing change?
This change only affects type hints - it brings the `approxQuantile` type hints up to date with the actual code.

### How was this patch tested?
Ran `./dev/lint-python`.

Closes #33880 from carylee/master.

Authored-by: Cary Lee <cary@amperity.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-09-02 15:02:40 +02:00
Hyukjin Kwon 9c5bcac61e [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
### What changes were proposed in this pull request?

This PR proposes to implement `TimestampNTZType` support in PySpark's `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.

### Why are the changes needed?

To complete `TimestampNTZType` support.

### Does this PR introduce _any_ user-facing change?

Yes.

- Users now can use `TimestampNTZType` type in `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.

- If `spark.sql.timestampType` is configured to `TIMESTAMP_NTZ`, PySpark will infer the `datetime` without timezone as `TimestampNTZType`. If it has a timezone, it will be inferred as `TimestampType` in `SparkSession.createDataFrame`.

    - If `TimestampType` and `TimestampNTZType` conflict during merging inferred schema, `TimestampType` has a higher precedence.

- If the type is `TimestampNTZType`, treat this internally as an unknown timezone, and compute w/ UTC (same as JVM side), and avoid localization externally.

### How was this patch tested?

Manually tested and unittests were added.

Closes #33876 from HyukjinKwon/SPARK-36626.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-02 14:00:27 +09:00
Xinrong Meng c6a2021fec [SPARK-36399][PYTHON] Implement DataFrame.combine_first
### What changes were proposed in this pull request?
Implement `DataFrame.combine_first`.

The PR is based on https://github.com/databricks/koalas/pull/1950. Thanks AishwaryaKalloli for the prototype.

### Why are the changes needed?
Update null elements with value in the same location in another is a common use case.
That is supported in pandas. We should support that as well.

### Does this PR introduce _any_ user-facing change?
Yes. `DataFrame.combine_first` can be used.

```py
>>> ps.set_option("compute.ops_on_diff_frames", True)
>>> df1 = ps.DataFrame({'A': [None, 0], 'B': [None, 4]})
>>> df2 = ps.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine_first(df2).sort_index()
     A    B
0  1.0  3.0
1  0.0  4.0

# Null values still persist if the location of that null value does not exist in other

>>> df1 = ps.DataFrame({'A': [None, 0], 'B': [4, None]})
>>> df2 = ps.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])
>>> df1.combine_first(df2).sort_index()
     A    B    C
0  NaN  4.0  NaN
1  0.0  3.0  1.0
2  NaN  3.0  1.0
>>> ps.reset_option("compute.ops_on_diff_frames")
```

### How was this patch tested?
Unit tests.

Closes #33714 from xinrong-databricks/df_combine_first.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-31 19:50:16 -07:00
Huaxin Gao 15e42b4442 [SPARK-36578][ML] UnivariateFeatureSelector API doc improvement
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`

### Why are the changes needed?
make the doc look better

### Does this PR introduce _any_ user-facing change?
yes, API doc change

### How was this patch tested?
Manually checked

Closes #33855 from huaxingao/ml_doc.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-08-26 21:16:49 -07:00
itholic fe486185c4 [SPARK-36537][PYTHON] Revisit disabled tests for CategoricalDtype
### What changes were proposed in this pull request?

This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.

- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.

### Why are the changes needed?

We should enable the tests as much as possible even if pandas has a bug.

And we should follow the behavior of latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, `GroupBy.transform` now follow the behavior of latest pandas.

### How was this patch tested?

Unittests.

Closes #33817 from itholic/SPARK-36537.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 17:43:49 +09:00
itholic 97e7d6e667 [SPARK-36505][PYTHON] Improve test coverage for frame.py
### What changes were proposed in this pull request?

This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.

This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing

### Why are the changes needed?

To make the project healthier by improving coverage.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittest.

Closes #33833 from itholic/SPARK-36505.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-26 17:43:00 +09:00
Hyukjin Kwon 93cec49212 [SPARK-36559][SQL][PYTHON] Create plans dedicated to distributed-sequence index for optimization
### What changes were proposed in this pull request?

This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```

**Before:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
      +- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
         +- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
            +- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
               +- Project [id#37L]
                  +- Filter atleastnnonnulls(1, id#37L)
                     +- Scan ExistingRDD[__index_level_0__#36L,id#37L]
                        # ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```

**After:**

```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
   +- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
      +- HashAggregate(keys=[id#258L], functions=[count(1)])
         +- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
            +- Filter atleastnnonnulls(1, id#258L)
               +- Range (0, 10, step=1, splits=16)
                  # ^^^ Removed the Spark job execution for `zipWithIndex`
```

### Why are the changes needed?

To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.

Closes #33807 from HyukjinKwon/SPARK-36559.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-25 10:02:53 +09:00
Gengliang Wang de932f51ce Revert "[SPARK-34415][ML] Randomization in hyperparameter optimization"
### What changes were proposed in this pull request?

Revert 397b843890 and 5a48eb8d00

### Why are the changes needed?

As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Ci tests

Closes #33819 from gengliangwang/revert-SPARK-34415.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-24 13:38:14 -07:00
Hyukjin Kwon fa53aa06d1 [SPARK-36560][PYTHON][INFRA] Deflake PySpark coverage job
### What changes were proposed in this pull request?

This PR proposes to increase timeouts for:
- `pyspark.sql.tests.test_streaming.StreamingTests.test_parameter_accuracy`
- `pyspark.mllib.tests.test_streaming_algorithms.StreamingLogisticRegressionWithSGDTests. test_parameter_accuracy`

to deflake PySpark coverage build:

- https://github.com/apache/spark/runs/3392972609?check_suite_focus=true
- https://github.com/apache/spark/runs/3388727798?check_suite_focus=true
- https://github.com/apache/spark/runs/3359880048?check_suite_focus=true
- https://github.com/apache/spark/runs/3338876122?check_suite_focus=true

### Why are the changes needed?

To have more stable PySpark coverage report: https://app.codecov.io/gh/apache/spark

### Does this PR introduce _any_ user-facing change?

Spark developers will be able to see more stable results in https://app.codecov.io/gh/apache/spark

### How was this patch tested?

GitHub Actions' scheduled jobs will test them out.

Closes #33808 from HyukjinKwon/SPARK-36560.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-24 11:08:43 +09:00
Xinrong Meng 0b6af464dc [SPARK-36470][PYTHON] Implement CategoricalIndex.map and DatetimeIndex.map
### What changes were proposed in this pull request?
Implement `CategoricalIndex.map` and `DatetimeIndex.map`

`MultiIndex.map` cannot be implemented in the same way as the `map` of other indexes. It should be taken care of separately if necessary.

### Why are the changes needed?
Mapping values using input correspondence is a common operation that is supported in pandas. We shall support that as well.

### Does this PR introduce _any_ user-facing change?
Yes. `CategoricalIndex.map` and `DatetimeIndex.map` can be used now.

- CategoricalIndex.map

```py
>>> idx = ps.CategoricalIndex(['a', 'b', 'c'])
>>> idx
CategoricalIndex(['a', 'b', 'c'], categories=['a', 'b', 'c'], ordered=False, dtype='category')

>>> idx.map(lambda x: x.upper())
CategoricalIndex(['A', 'B', 'C'],  categories=['A', 'B', 'C'], ordered=False, dtype='category')

>>> pser = pd.Series([1, 2, 3], index=pd.CategoricalIndex(['a', 'b', 'c'], ordered=True))
>>> idx.map(pser)
CategoricalIndex([1, 2, 3], categories=[1, 2, 3], ordered=True, dtype='category')

>>> idx.map({'a': 'first', 'b': 'second', 'c': 'third'})
CategoricalIndex(['first', 'second', 'third'], categories=['first', 'second', 'third'], ordered=False, dtype='category')
```

- DatetimeIndex.map

```py
>>> pidx = pd.date_range(start="2020-08-08", end="2020-08-10")
>>> psidx = ps.from_pandas(pidx)

>>> mapper_dict = {
...   datetime.datetime(2020, 8, 8): datetime.datetime(2021, 8, 8),
...   datetime.datetime(2020, 8, 9): datetime.datetime(2021, 8, 9),
... }
>>> psidx.map(mapper_dict)
DatetimeIndex(['2021-08-08', '2021-08-09', 'NaT'], dtype='datetime64[ns]', freq=None)

>>> mapper_pser = pd.Series([1, 2, 3], index=pidx)
>>> psidx.map(mapper_pser)
Int64Index([1, 2, 3], dtype='int64')
>>> psidx
DatetimeIndex(['2020-08-08', '2020-08-09', '2020-08-10'], dtype='datetime64[ns]', freq=None)

>>> psidx.map(lambda x: x.strftime("%B %d, %Y, %r"))
Index(['August 08, 2020, 12:00:00 AM', 'August 09, 2020, 12:00:00 AM',
       'August 10, 2020, 12:00:00 AM'],
      dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33756 from xinrong-databricks/other_indexes_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-23 10:08:40 +09:00
itholic f2e593bcf1 [SPARK-36368][PYTHON] Fix CategoricalOps.astype to follow pandas 1.3
### What changes were proposed in this pull request?

This PR proposes to fix the behavior of `astype` for `CategoricalDtype` to follow pandas 1.3.

**Before:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['b', 'c', 'a']
```

**After:**
```python
>>> pcat
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']

>>> pcat.astype(CategoricalDtype(["b", "c", "a"]))
0    a
1    b
2    c
dtype: category
Categories (3, object): ['a', 'b', 'c']  # CategoricalDtype is not updated if dtype is the same
```

`CategoricalDtype` is treated as a same `dtype` if the unique values are the same.

```python
>>> pcat1 = pser.astype(CategoricalDtype(["b", "c", "a"]))
>>> pcat2 = pser.astype(CategoricalDtype(["a", "b", "c"]))
>>> pcat1.dtype == pcat2.dtype
True
```

### Why are the changes needed?

We should follow the latest pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the behavior is changed as example in the PR description.

### How was this patch tested?

Unittest

Closes #33757 from itholic/SPARK-36368.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-18 11:38:59 -07:00
itholic c91ae544fd [SPARK-36388][SPARK-36386][PYTHON][FOLLOWUP] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
### What changes were proposed in this pull request?

This PR is followup for https://github.com/apache/spark/pull/33646 to add missing tests.

### Why are the changes needed?

Some tests are missing

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unittest

Closes #33776 from itholic/SPARK-36388-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-18 11:17:01 -07:00
Takuya UESHIN 7fb8ea319e [SPARK-36370][PYTHON][FOLLOWUP] Use LooseVersion instead of pkg_resources.parse_version
### What changes were proposed in this pull request?

This is a follow-up of #33687.

Use `LooseVersion` instead of `pkg_resources.parse_version`.

### Why are the changes needed?

In the previous PR, `pkg_resources.parse_version` was used, but we should use `LooseVersion` instead to be consistent in the code base.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33768 from ueshin/issues/SPARK-36370/LooseVersion.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-18 10:36:09 +09:00
Cedric-Magnan 964dfe254f [SPARK-36370][PYTHON] _builtin_table directly imported from pandas instead of being redefined
### What changes were proposed in this pull request?
Suggesting to refactor the way the _builtin_table is defined in the `python/pyspark/pandas/groupby.py` module.
Pandas has recently refactored the way we import the _builtin_table and is now part of the pandas.core.common module instead of being an attribute of the pandas.core.base.SelectionMixin class.

### Why are the changes needed?
This change is not fully needed but the current implementation redefines this table within pyspark, so any changes of this table from the pandas library would need to be updated in the pyspark repository as well.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran the following command successfully :
```sh
python/run-tests --testnames 'pyspark.pandas.tests.test_groupby'
```
Tests passed in 327 seconds

Closes #33687 from Cedric-Magnan/_builtin_table_from_pandas.

Authored-by: Cedric-Magnan <cedric.magnan@artefact.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-17 10:46:49 -07:00
itholic c0441bb7e8 [SPARK-36387][PYTHON] Fix Series.astype from datetime to nullable string
### What changes were proposed in this pull request?

This PR proposes to fix `Series.astype` when converting datetime type to StringDtype, to match the behavior of pandas 1.3.

In pandas < 1.3,
```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                    NaT
Name: datetime, dtype: string
```

This is changed to

```python
>>> pd.Series(["2020-10-27 00:00:01", None], name="datetime").astype("string")
0    2020-10-27 00:00:01
1                   <NA>
Name: datetime, dtype: string
```

in pandas >= 1.3, so we follow the behavior of latest pandas.

### Why are the changes needed?

Because pandas-on-Spark always follow the behavior of latest pandas.

### Does this PR introduce _any_ user-facing change?

Yes, the behavior is changed to latest pandas when converting datetime to nullable string (StringDtype)

### How was this patch tested?

Unittest passed

Closes #33735 from itholic/SPARK-36387.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-17 10:29:16 -07:00
Xinrong Meng 4dcd746025 [SPARK-36469][PYTHON] Implement Index.map
### What changes were proposed in this pull request?
Implement `Index.map`.

The PR is based on https://github.com/databricks/koalas/pull/2136. Thanks awdavidson for the prototype.

`map` of CategoricalIndex and DatetimeIndex will be implemented in separate PRs.

### Why are the changes needed?
Mapping values using input correspondence (a dict, Series, or function) is supported in pandas as [Index.map](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Index.map.html).
We shall also support hat.

### Does this PR introduce _any_ user-facing change?
Yes. `Index.map` is available now.

```py
>>> psidx = ps.Index([1, 2, 3])

>>> psidx.map({1: "one", 2: "two", 3: "three"})
Index(['one', 'two', 'three'], dtype='object')

>>> psidx.map(lambda id: "{id} + 1".format(id=id))
Index(['1 + 1', '2 + 1', '3 + 1'], dtype='object')

>>> pser = pd.Series(["one", "two", "three"], index=[1, 2, 3])
>>> psidx.map(pser)
Index(['one', 'two', 'three'], dtype='object')
```

### How was this patch tested?
Unit tests.

Closes #33694 from xinrong-databricks/index_map.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-08-16 11:06:10 -07:00
Liang-Chi Hsieh 8b8d91cf64 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:00 +09:00
itholic b8508f4876 [SPARK-36388][SPARK-36386][PYTHON] Fix DataFrame groupby-rolling and groupby-expanding to follow pandas 1.3
### What changes were proposed in this pull request?

This PR proposes to fix `RollingGroupBy` and `ExpandingGroupBy` to follow latest pandas behavior.

`RollingGroupBy` and `ExpandingGroupBy` no longer returns grouped-by column in values from pandas 1.3.

Before:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       A    B
A
1 0  NaN  NaN
  1  2.0  1.0
2 2  NaN  NaN
3 3  NaN  NaN
```

After:
```python
>>> df = pd.DataFrame({"A": [1, 1, 2, 3], "B": [0, 1, 2, 3]})
>>> df.groupby("A").rolling(2).sum()
       B
A
1 0  NaN
  1  1.0
2 2  NaN
3 3  NaN
```

### Why are the changes needed?

We should follow the behavior of pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the result of `RollingGroupBy` and `ExpandingGroupBy` is changed as described above.

### How was this patch tested?

Unit tests.

Closes #33646 from itholic/SPARK-36388.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-10 10:12:52 +09:00
itholic a9f371c247 [SPARK-36369][PYTHON] Fix Index.union to follow pandas 1.3
### What changes were proposed in this pull request?

This PR proposes fixing the `Index.union` to follow the behavior of pandas 1.3.

Before:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2], dtype='int64')
```

After:
```python
>>> ps_idx1 = ps.Index([1, 1, 1, 1, 1, 2, 2])
>>> ps_idx2 = ps.Index([1, 1, 2, 2, 2, 2, 2])
>>> ps_idx1.union(ps_idx2)
Int64Index([1, 1, 1, 1, 1, 2, 2, 2, 2, 2], dtype='int64')
```

This bug is fixed in https://github.com/pandas-dev/pandas/issues/36289.

### Why are the changes needed?

We should follow the behavior of pandas as much as possible.

### Does this PR introduce _any_ user-facing change?

Yes, the result for some cases have duplicates values will change.

### How was this patch tested?

Unit test.

Closes #33634 from itholic/SPARK-36369.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-09 11:10:01 +09:00
Weichen Xu f9f6c0d350 [SPARK-36425][PYSPARK][ML] Support CrossValidatorModel get standard deviation of metrics for each paramMap
Signed-off-by: Weichen Xu <weichen.xudatabricks.com>

### What changes were proposed in this pull request?
Support CrossValidatorModel get standard deviation of metrics for each paramMap.

### Why are the changes needed?
So that in mlflow autologging, we can log standard deviation of metrics which is useful.

### Does this PR introduce _any_ user-facing change?
Yes.
`CrossValidatorModel` add a public attribute `stdMetrics` which are the standard deviation of metrics for each paramMap

### How was this patch tested?
Unit test.

Closes #33652 from WeichenXu123/add_std_metric.

Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-09 10:08:52 +09:00
Kousuke Saruta 856c9a58f8 [SPARK-36173][CORE][PYTHON][FOLLOWUP] Add type hint for TaskContext.cpus
### What changes were proposed in this pull request?

This PR adds type hint for `TaskContext.cpus` added in SPARK-36173 (#33385)

### Why are the changes needed?

To comply with Project Zen.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Confirmed typehint works with IntelliJ IDEA.

Closes #33645 from sarutak/taskcontext-pyi.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-06 10:56:10 +09:00
Wu, Xiaochang f6e6d1157a [SPARK-36173][CORE] Support getting CPU number in TaskContext
In stage-level resource scheduling, the allocated 3rd party resources can be obtained in TaskContext using resources() interface, however there is no API to get how many cpus are allocated for the task. Will add a cpus() interface to TaskContext to complement resources(). Althrough the task cpu requests can be got from profile, it's more convenient to get it inside the task code without the need to pass profile from driver side to the executor side.

### What changes were proposed in this pull request?
Add cpus() interface in TaskContext and modify relevant code.

### Why are the changes needed?
TaskContext has resources() to get 3rd party resources allocated. the is no API to get CPU allocated for the task.

### Does this PR introduce _any_ user-facing change?
Add cpus() interface for TaskContext

### How was this patch tested?
Unit tests

Closes #33385 from xwu99/taskcontext-cpus.

Lead-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Co-authored-by: Xiaochang Wu <xiaochang.wu@intel.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-08-04 21:14:01 -05:00
itholic 3d72c20e64 [SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io
### What changes were proposed in this pull request?

This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message.

### Why are the changes needed?

To improve the warning message.

### Does this PR introduce _any_ user-facing change?

The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead."

### How was this patch tested?

Manually run `dev/lint-python`

Closes #33631 from itholic/SPARK-35811-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-04 16:20:29 +09:00