Commit graph

2932 commits

Author SHA1 Message Date
Liang-Chi Hsieh 3aa933b162 [SPARK-36465][SS] Dynamic gap duration in session window
### What changes were proposed in this pull request?

This patch supports dynamic gap duration in session window.

### Why are the changes needed?

The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. For example, in our usecase, we may have different gap by looking at the certain column in the input rows.

### Does this PR introduce _any_ user-facing change?

Yes, users can specify dynamic gap duration.

### How was this patch tested?

Modified existing tests and new test.

Closes #33691 from viirya/dynamic-session-window-gap.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit 8b8d91cf64)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-08-16 11:06:16 +09:00
itholic f7ab2bfc8c [SPARK-35811][PYTHON][FOLLOWUP] Deprecate DataFrame.to_spark_io
### What changes were proposed in this pull request?

This PR is followup for https://github.com/apache/spark/pull/32964, to improve the warning message.

### Why are the changes needed?

To improve the warning message.

### Does this PR introduce _any_ user-facing change?

The warning is changed from "Deprecated in 3.2, Use `spark.to_spark_io` instead." to "Deprecated in 3.2, Use `DataFrame.spark.to_spark_io` instead."

### How was this patch tested?

Manually run `dev/lint-python`

Closes #33631 from itholic/SPARK-35811-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3d72c20e64)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-04 16:20:37 +09:00
Xinrong Meng c22a25b76a [SPARK-36192][PYTHON] Better error messages for DataTypeOps against lists
### What changes were proposed in this pull request?
Better error messages for DataTypeOps against lists.

### Why are the changes needed?
Currently, DataTypeOps against lists throw a Py4JJavaError, we shall throw a TypeError with proper messages instead.

### Does this PR introduce _any_ user-facing change?
Yes. A TypeError message will be showed rather than a Py4JJavaError.

From:
```py
>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]) > [3, 2, 1]
Traceback (most recent call last):
...
py4j.protocol.Py4JJavaError: An error occurred while calling o107.gt.
: java.lang.RuntimeException: Unsupported literal type class java.util.ArrayList [3, 2, 1]
...
```

To:
```py
>>> import pyspark.pandas as ps
>>> ps.Series([1, 2, 3]) > [3, 2, 1]
Traceback (most recent call last):
...
TypeError: The operation can not be applied to list.
```

### How was this patch tested?
Unit tests.

Closes #33581 from xinrong-databricks/data_type_ops_list.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8ca11fe39f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-03 16:25:59 +09:00
Takuya UESHIN 71e4c56974 [SPARK-36367][3.2][PYTHON] Partially backport to avoid unexpected error with pandas 1.3
### What changes were proposed in this pull request?

Partially backport from #33598 to avoid unexpected error caused by pandas 1.3.

### Why are the changes needed?

If uses tries to use pandas 1.3 as the underlying pandas, it will raise unexpected errors caused by removed APIs or behavior change.

Note that pandas API on Spark 3.2 will still follow the pandas 1.2 behavior.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33614 from ueshin/issues/SPARK-36367/3.2/partially_backport.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-03 14:03:35 +09:00
Linhong Liu e26cb968bd [SPARK-36224][SQL] Use Void as the type name of NullType
### What changes were proposed in this pull request?
Change the `NullType.simpleString` to "void" to set "void" as the formal type name of `NullType`

### Why are the changes needed?
This PR is intended to address the type name discussion in PR #28833. Here are the reasons:
1. The type name of NullType is displayed everywhere, e.g. schema string, error message, document. Hence it's not possible to hide it from users, we have to choose a proper name
2. The "void" is widely used as the type name of "NULL", e.g. Hive, pgSQL
3. Changing to "void" can enable the round trip of `toDDL`/`fromDDL` for NullType. (i.e. make `from_json(col, schema.toDDL)`) work

### Does this PR introduce _any_ user-facing change?
Yes, the type name of "NULL" is changed from "null" to "void". for example:
```
scala> sql("select null as a, 1 as b").schema.catalogString
res5: String = struct<a:void,b:int>
```

### How was this patch tested?
existing test cases

Closes #33437 from linhongliu-db/SPARK-36224-void-type-name.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 2f700773c2)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-02 23:20:11 +08:00
Hyukjin Kwon 310cd8eef1 [SPARK-36092][INFRA][BUILD][PYTHON] Migrate to GitHub Actions with Codecov from Jenkins
This PR proposes to migrate Coverage report from Jenkins to GitHub Actions by setting a dailly cron job.

For some background, currently PySpark code coverage is being reported in this specific Jenkins job: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.7/

Because of the security issue between [Codecov service](https://app.codecov.io/gh/) and Jenkins machines, we had to work around by manually hosting a coverage site via GitHub pages, see also https://spark-test.github.io/pyspark-coverage-site/ by spark-test account (which is shared to only subset of PMC members).

Since we now run the build via GitHub Actions, we can leverage [Codecov plugin](https://github.com/codecov/codecov-action), and remove the workaround we used.

Virtually no. Coverage site (UI) might change but the information it holds should be virtually the same.

I manually tested:
- Scheduled run: https://github.com/HyukjinKwon/spark/actions/runs/1082261484
- Coverage report: 73f0291a7d/python/pyspark
- Run against a PR: https://github.com/HyukjinKwon/spark/actions/runs/1082367175

Closes #33591 from HyukjinKwon/SPARK-36092.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c0d1860f25)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 21:38:39 +09:00
Yikun Jiang 5548cd5a7f [SPARK-35976][PYTHON] Adjust astype method for ExtensionDtype in pandas API on Spark
### What changes were proposed in this pull request?
This patch set value to `<NA>` (pd.NA) in BooleanExtensionOps and StringExtensionOps.

### Why are the changes needed?
The pandas behavior:
```python
>>> pd.Series([True, False, None], dtype="boolean").astype(str).tolist()
['True', 'False', '<NA>']
>>> pd.Series(['s1', 's2', None], dtype="string").astype(str).tolist()
['1', '2', '<NA>']
```

pandas on spark
```python
>>> import pandas as pd
>>> from pyspark import pandas as ps

# Before
>>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist()
['True', 'False', 'None']
>>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist()
['True', 'False', 'None']

# After
>>> ps.from_pandas(pd.Series([True, False, None], dtype="boolean")).astype(str).tolist()
['True', 'False', '<NA>']
>>> ps.from_pandas(pd.Series(['s1', 's2', None], dtype="string")).astype(str).tolist()
['s1', 's2', '<NA>']
```

See more in [SPARK-35976](https://issues.apache.org/jira/browse/SPARK-35976)

### Does this PR introduce _any_ user-facing change?
Yes, return `<NA>` when None to follow the pandas behavior

### How was this patch tested?
Change the ut to cover this scenario.

Closes #33585 from Yikun/SPARK-35976.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f04e991e6a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 10:37:35 +09:00
Takuya UESHIN 5d37a77d44 [SPARK-36365][PYTHON] Remove old workarounds related to null ordering
### What changes were proposed in this pull request?

Remove old workarounds related to null ordering.

### Why are the changes needed?

In pandas-on-Spark, there are still some remaining places to call `Column._jc.(asc|desc)_nulls_(first|last)` as a workaround from Koalas to support Spark 2.3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified a couple of tests and existing tests.

Closes #33597 from ueshin/issues/SPARK-36365/nulls_first_last.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 90d31dfcb7)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-01 10:33:43 +09:00
Hyukjin Kwon 187aa6ab7f [SPARK-36338][PYTHON][FOLLOW-UP] Keep the original default value as 'sequence' in default index in pandas on Spark
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/33570, which mistakenly changed the default value of the default index

### Why are the changes needed?

It was mistakenly changed. It was changed to check if the tests actually pass but I forgot to change it back.

### Does this PR introduce _any_ user-facing change?

No, it's not related yet. It fixes up the mistake of the default value mistakenly changed.
(Changed default value makes the test flaky because of the order affected by extra shuffle)

### How was this patch tested?

Manually tested.

Closes #33596 from HyukjinKwon/SPARK-36338-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 74a6b9d23b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-31 08:31:19 +09:00
Takuya UESHIN a4dcda1794 [SPARK-36350][PYTHON] Move some logic related to F.nanvl to DataTypeOps
### What changes were proposed in this pull request?

Move some logic related to `F.nanvl` to `DataTypeOps`.

### Why are the changes needed?

There are several places to branch by `FloatType` or `DoubleType` to use `F.nanvl` but `DataTypeOps` should handle it.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33582 from ueshin/issues/SPARK-36350/nan_to_null.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 895e3f5e2a)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-30 11:20:01 -07:00
Hyukjin Kwon fee87f13d1 [SPARK-36338][PYTHON][SQL] Move distributed-sequence implementation to Scala side
### What changes were proposed in this pull request?

This PR proposes to implement `distributed-sequence` index in Scala side.

### Why are the changes needed?

- Avoid unnecessary (de)serialization
- Keep the nullability in the input DataFrame when `distributed-sequence` is enabled. During the serialization, all fields are being nullable for now (see https://github.com/apache/spark/pull/32775#discussion_r645882104)

### Does this PR introduce _any_ user-facing change?

No to end users since pandas API on Spark is not released yet.

```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(1).spark.print_schema()
```

Before:

```
root
 |-- id: long (nullable = true)
```

After:

```
root
 |-- id: long (nullable = false)
```

### How was this patch tested?

Manually tested, and existing tests should cover them.

Closes #33570 from HyukjinKwon/SPARK-36338.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c6140d4d0a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 22:29:31 +09:00
Hyukjin Kwon 9cd370894b [SPARK-36254][PYTHON][FOLLOW-UP] Skip mlflow related tests in pandas on Spark
### What changes were proposed in this pull request?

This PR is a partial revert of https://github.com/apache/spark/pull/33567 that keeps the logic to skip mlflow related tests if that's not installed.

### Why are the changes needed?

It's consistent with other libraries, e.g) PyArrow.
It also fixes up the potential dev breakage (see also https://github.com/apache/spark/pull/33567#issuecomment-889841829)

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

This is a partial revert. CI should test it out too.

Closes #33589 from HyukjinKwon/SPARK-36254.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit dd2ca0aee2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 22:28:29 +09:00
itholic a9c5b1a5c8 [SPARK-36254][INFRA][PYTHON] Install mlflow in Github Actions CI
### What changes were proposed in this pull request?

This PR proposes adding a Python package, `mlflow` and `sklearn` to enable the MLflow test in pandas API on Spark.

### Why are the changes needed?

To enable the MLflow test in pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

No, it's test-only

### How was this patch tested?

Manually test on local, with `python/run-tests --testnames pyspark.pandas.mlflow`.

Closes #33567 from itholic/SPARK-36254.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit abce61f3fd)
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-30 00:04:59 -07:00
itholic 26b8297fa3 [SPARK-35806][PYTHON][FOLLOW-UP] Mapping the mode argument to pandas in DataFrame.to_csv
### What changes were proposed in this pull request?

This PR is follow-up for https://github.com/apache/spark/pull/33414 to support the more options for `mode` argument for all APIs that has `mode` argument, not only `DataFrame.to_csv`.

### Why are the changes needed?

To keep the usage consistency for the arguments that have same name.

### Does this PR introduce _any_ user-facing change?

More options is available for all APIs that has `mode` argument, same as `DataFrame.to_csv`

### How was this patch tested?

Manually test on local

Closes #33569 from itholic/SPARK-35085-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 94cb2bbbc2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-30 12:48:36 +09:00
Takuya UESHIN 359dfd7ec2 [SPARK-36333][PYTHON] Reuse isnull where the null check is needed
### What changes were proposed in this pull request?

Reuse `IndexOpsMixin.isnull()` where the null check is needed.

### Why are the changes needed?

There are some places where we can reuse `IndexOpsMixin.isnull()` instead of directly using Spark `Column`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33562 from ueshin/issues/SPARK-36333/reuse_isnull.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 07ed82be0b)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-29 15:33:25 -07:00
Samuel Moseley e42324982d [SPARK-36161][PYTHON] Add type check on dropDuplicates pyspark function
### What changes were proposed in this pull request?
Improve the error message for wrong type when calling dropDuplicates in pyspark.

### Why are the changes needed?
The current error message is cryptic and can be unclear to less experienced users.

### Does this PR introduce _any_ user-facing change?
Yes, it adds a type error for when a user gives the wrong type to dropDuplicates

### How was this patch tested?
There is currently no testing for error messages in pyspark dataframe functions

Closes #33364 from sammyjmoseley/sm/add-type-checking-for-drop-duplicates.

Lead-authored-by: Samuel Moseley <smoseley@palantir.com>
Co-authored-by: Sammy Moseley <moseley.sammy@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a07df1acc6)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-29 19:11:57 +09:00
Xinrong Meng 999cf81653 [SPARK-36190][PYTHON] Improve the rest of DataTypeOps tests by avoiding joins
### What changes were proposed in this pull request?
Improve the rest of DataTypeOps tests by avoiding joins.

### Why are the changes needed?
bool, string, numeric DataTypeOps tests have been improved by avoiding joins.
We should improve the rest of the DataTypeOps tests in the same way.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33546 from xinrong-databricks/test_no_join.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 9c5cb99d6e)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-28 15:53:51 -07:00
Xinrong Meng bdf1570911 [SPARK-36143][PYTHON] Adjust astype of fractional Series with missing values to follow pandas
### What changes were proposed in this pull request?
Adjust `astype` of fractional Series with missing values to follow pandas.

Non-goal: Adjust the issue of `astype` of Decimal Series with missing values to follow pandas.

### Why are the changes needed?
`astype` of fractional Series with missing values doesn't behave the same as pandas, for example, float Series returns itself when `astype` integer, while a ValueError is raised in pandas.

We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes.

From:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
0    1.0
1    2.0
2    NaN
dtype: float64

```

To:
```py
>>> import numpy as np
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, np.nan])
>>> psser.astype(int)
Traceback (most recent call last):
...
ValueError: Cannot convert fractions with missing values to integer

```

### How was this patch tested?
Unit tests.

Closes #33466 from xinrong-databricks/extension_astype.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 01213095e2)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-28 11:26:58 -07:00
Takuya UESHIN c9af94ecb4 [SPARK-36320][PYTHON] Fix Series/Index.copy() to drop extra columns
### What changes were proposed in this pull request?

Fix `Series`/`Index.copy()` to drop extra columns.

### Why are the changes needed?

Currently `Series`/`Index.copy()` keeps the copy of the anchor DataFrame which holds unnecessary columns.
We can drop those when `Series`/`Index.copy()`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33549 from ueshin/issues/SPARK-36320/index_ops_copy.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 3c76a924ce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 18:40:03 +09:00
Takuya UESHIN 0e9e737a84 [SPARK-36310][PYTHON] Fix IndexOpsMixin.hasnans to use isnull().any()
### What changes were proposed in this pull request?

Fix `IndexOpsMixin.hasnans` to use `IndexOpsMixin.isnull().any()`.

### Why are the changes needed?

`IndexOpsMixin.hasnans` has a potential issue to cause `a window function inside an aggregate function` error.
Also it returns a wrong value when the `Series`/`Index` is empty.

```py
>>> ps.Series([]).hasnans
None
```

whereas:

```py
>>> pd.Series([]).hasnans
False
```

`IndexOpsMixin.any()` is safe for both cases.

### Does this PR introduce _any_ user-facing change?

`IndexOpsMixin.hasnans` will return `False` when empty.

### How was this patch tested?

Added some tests.

Closes #33547 from ueshin/issues/SPARK-36310/hasnan.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit bcc595c112)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-28 09:21:22 +09:00
Luran He 8a3b1cd811
[SPARK-36211][PYTHON] Correct typing of udf return value
The following code should type-check:

```python3
import uuid

import pyspark.sql.functions as F

my_udf = F.udf(lambda: str(uuid.uuid4())).asNondeterministic()
```

### What changes were proposed in this pull request?

The `udf` function should return a more specific type.

### Why are the changes needed?

Right now, `mypy` will throw spurious errors, such as for the code given above.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

This was not tested. Sorry, I am not very familiar with this repo -- are there any typing tests?

Closes #33399 from luranhe/patch-1.

Lead-authored-by: Luran He <luranjhe@gmail.com>
Co-authored-by: Luran He <luran.he@compass.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
(cherry picked from commit ede1bc6b51)
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
2021-07-27 09:09:11 +02:00
Leona 6027137928 [SPARK-36288][DOCS][PYTHON] Update API usage on pyspark pandas documents
### What changes were proposed in this pull request?

Update api usage examples on PySpark pandas API documents.

### Why are the changes needed?

If users try to use PySpark pandas API from the document, they will see some API deprication warnings.
It is kind for users to update those documents to avoid confusion.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

```
make html
```

Closes #33519 from yoda-mon/update-pyspark-configurations.

Authored-by: Leona <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 9a47483f74)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:31:00 +09:00
Takuya UESHIN f278f771e6 [SPARK-36267][PYTHON] Clean up CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Clean up `CategoricalAccessor` and `CategoricalIndex`.

- Clean up the classes
- Add deprecation warnings
- Clean up the docs

### Why are the changes needed?

To finalize the series of PRs for `CategoricalAccessor` and `CategoricalIndex`, we should clean up the classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33528 from ueshin/issues/SPARK-36267/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit c40d9d46f1)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:17:26 +09:00
Yikun Jiang 139536c3ed [SPARK-36142][PYTHON] Follow Pandas when pow between fractional series with Na and bool literal
### What changes were proposed in this pull request?

Set the result to 1 when the exp with 0(or False).

### Why are the changes needed?
Currently, exponentiation between fractional series and bools is not consistent with pandas' behavior.
```
 >>> pser = pd.Series([1, 2, np.nan], dtype=float)
 >>> psser = ps.from_pandas(pser)
 >>> pser ** False
 0 1.0
 1 1.0
 2 1.0
 dtype: float64
 >>> psser ** False
 0 1.0
 1 1.0
 2 NaN
 dtype: float64
```
We ought to adjust that.

See more in [SPARK-36142](https://issues.apache.org/jira/browse/SPARK-36142)

### Does this PR introduce _any_ user-facing change?
Yes, it introduces a user-facing change, resulting in a different result for pow between fractional Series with missing values and bool literal, the results follow pandas behavior.

### How was this patch tested?
- Add test_pow_with_float_nan ut
- Exsiting test in test_pow

Closes #33521 from Yikun/SPARK-36142.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d52c2de08b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-27 12:07:10 +09:00
Xinrong Meng 1641812e97 [SPARK-36260][PYTHON] Add set_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?
Add set_categories to CategoricalAccessor and CategoricalIndex.

### Why are the changes needed?
set_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes, users will be able to use `set_categories`.

### How was this patch tested?
Unit tests.

Closes #33506 from xinrong-databricks/set_categories.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 55971b70fe)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-26 17:12:45 -07:00
Dominik Gehl 29c0b1e7b1 [SPARK-36225][PYTHON][DOCS] Use DataFrame in python docstrings
### What changes were proposed in this pull request?
Changing references to Dataset in python docstrings to DataFrame

### Why are the changes needed?
no Dataset class in pyspark

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Doc change only

Closes #33438 from dominikgehl/feature/SPARK-36225.

Lead-authored-by: Dominik Gehl <dog@open.ch>
Co-authored-by: Dominik Gehl <gehl@fastmail.fm>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit ae1c20ee0d)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:58:19 +09:00
Takuya UESHIN c1434b1928 [SPARK-36279][INFRA][PYTHON] Fix lint-python to work with Python 3.9
### What changes were proposed in this pull request?

Fix `lint-python` to pick `PYTHON_EXECUTABLE` from the environment variable first to switch the Python and explicitly specify `PYTHON_EXECUTABLE` to use `python3.9` in CI.

### Why are the changes needed?

Currently `lint-python` uses `python3`, but it's not the one we expect in CI.
As a result, `black` check is not working.

```
The python3 -m black command was not found. Skipping black checks for now.
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

The `black` check in `lint-python` should work.

Closes #33507 from ueshin/issues/SPARK-36279/lint-python.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 663cbdfbe5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-24 16:49:51 +09:00
Xinrong Meng 44cfce8548 [SPARK-36274][PYTHON] Fix equality comparison of unordered Categoricals
### What changes were proposed in this pull request?
Fix equality comparison of unordered Categoricals.

### Why are the changes needed?
Codes of a Categorical Series are used for Series equality comparison. However, that doesn't apply to unordered Categoricals, where the same value can have different codes in two same categories in a different order.

So we should map codes to value respectively and then compare the equality of value.

### Does this PR introduce _any_ user-facing change?
Yes.
From:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0     True
1     True
2     True
3    False
dtype: bool
```

To:
```py
>>> psser1 = ps.Series(pd.Categorical(list("abca")))
>>> psser2 = ps.Series(pd.Categorical(list("bcaa"), categories=list("bca")))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     (psser1 == psser2).sort_index()
...
0    False
1    False
2    False
3     True
dtype: bool
```

### How was this patch tested?
Unit tests.

Closes #33497 from xinrong-databricks/cat_bug.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 85adc2ff60)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-23 18:31:18 -07:00
Takuya UESHIN ab5224c45b [SPARK-36264][PYTHON] Add reorder_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `reorder_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `reorder_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `reorder_categories`.

### How was this patch tested?

Added some tests.

Closes #33499 from ueshin/issues/SPARK-36264/reorder_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit e12bc4d31d)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-23 17:19:32 -07:00
Dominik Gehl bcd2812fd2 [SPARK-36226][PYTHON][DOCS] Improve python docstring links to other classes
### What changes were proposed in this pull request?
additional links to other classes in python documentation

### Why are the changes needed?
python docstring syntax wasn't fully used everywhere

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Documentation change only

Closes #33440 from dominikgehl/feature/python-docstrings.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 701756ac95)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 19:18:00 +09:00
Takuya UESHIN 4abc1d389e [SPARK-36261][PYTHON] Add remove_unused_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `remove_unused_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `remove_unused_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `remove_unused_categories`.

### How was this patch tested?

Added some tests.

Closes #33485 from ueshin/issues/SPARK-36261/remove_unused_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 2fe12a7520)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 14:05:09 +09:00
Xinrong Meng 37e5a10477 [SPARK-36248][PYTHON] Add rename_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?
Add rename_categories to CategoricalAccessor and CategoricalIndex.

### Why are the changes needed?
rename_categories is supported in pandas CategoricalAccessor and CategoricalIndex. We ought to follow pandas.

### Does this PR introduce _any_ user-facing change?
Yes. `rename_categories` is supported in pandas API on Spark now.

```py
# CategoricalIndex
>>> psser = ps.CategoricalIndex(["a", "a", "b"])
>>> psser.rename_categories([0, 1])
CategoricalIndex([0, 0, 1], categories=[0, 1], ordered=False, dtype='category')
>>> psser.rename_categories({'a': 'A', 'c': 'C'})
CategoricalIndex(['A', 'A', 'b'], categories=['A', 'b'], ordered=False, dtype='category')
>>> psser.rename_categories(lambda x: x.upper())
CategoricalIndex(['A', 'A', 'B'], categories=['A', 'B'], ordered=False, dtype='category')

# CategoricalAccessor
>>> s = ps.Series(["a", "a", "b"], dtype="category")
>>> s.cat.rename_categories([0, 1])
0    0
1    0
2    1
dtype: category
Categories (2, int64): [0, 1]
>>> s.cat.rename_categories({'a': 'A', 'c': 'C'})
0    A
1    A
2    b
dtype: category
Categories (2, object): ['A', 'b']
>>> s.cat.rename_categories(lambda x: x.upper())
0    A
1    A
2    B
dtype: category
Categories (2, object): ['A', 'B']
```

### How was this patch tested?
Unit tests.

Closes #33471 from xinrong-databricks/category_rename_categories.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 8b3d84bb7e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 12:26:35 +09:00
Xinrong Meng aeab18edd7 [SPARK-36189][PYTHON] Improve bool, string, numeric DataTypeOps tests by avoiding joins
### What changes were proposed in this pull request?
Improve bool, string, numeric DataTypeOps tests by avoiding joins.

Previously, bool, string, numeric DataTypeOps tests are conducted between two different Series.
After the PR, bool, string, numeric DataTypeOps tests should perform on a single DataFrame.

### Why are the changes needed?
A considerable number of DataTypeOps tests have operations on different Series, so joining is needed, which takes a long time.
We shall avoid joins for a shorter test duration.

The majority of joins happen in bool, string, numeric DataTypeOps tests, so we improve them first.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33402 from xinrong-databricks/datatypeops_diffframe.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 75fd1f5b82)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 12:20:45 +09:00
Takuya UESHIN 9a004ae12d [SPARK-36265][PYTHON] Use __getitem__ instead of getItem to suppress warnings
### What changes were proposed in this pull request?

Use `Column.__getitem__` instead of `Column.getItem` to suppress warnings.

### Why are the changes needed?

In pandas API on Spark code base, there are some places using `Column.getItem` with `Column` object, but it shows a deprecation warning.

### Does this PR introduce _any_ user-facing change?

Yes, users won't see the warnings anymore.

- before

```py
>>> s = ps.Series(list("abbccc"), dtype="category")
>>> s.astype(str)
/path/to/spark/python/pyspark/sql/column.py:322: FutureWarning: A column as 'key' in getItem is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]` or `column.key` syntax instead.
  warnings.warn(
0    a
1    b
2    b
3    c
4    c
5    c
dtype: object
```

- after

```py
>>> s = ps.Series(list("abbccc"), dtype="category")
>>> s.astype(str)
0    a
1    b
2    b
3    c
4    c
5    c
dtype: object
```

### How was this patch tested?

Existing tests.

Closes #33486 from ueshin/issues/SPARK-36265/getitem.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a76a087f7f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-23 11:27:42 +09:00
itholic 94479a23c1 [SPARK-36239][PYTHON][DOCS] Remove some APIs from documentation
### What changes were proposed in this pull request?

This PR proposes removing some APIs from pandas-on-Spark documentation.

Because they can be easily workaround via Spark DataFrame or Column functions, so they might be removed In the future.

### Why are the changes needed?

Because we don't want to expose some functions as a public API.

### Does this PR introduce _any_ user-facing change?

The APIs such as `(Series|Index).spark.data_type`, `(Series|Index).spark.nullable`, `DataFrame.spark.schema`, `DataFrame.spark.print_schema`, `DataFrame.pandas_on_spark.attach_id_column`, `DataFrame.spark.checkpoint`, `DataFrame.spark.localcheckpoint` and `DataFrame.spark.explain` is removed in the documentation.

### How was this patch tested?

Manually build the documents.

Closes #33458 from itholic/SPARK-36239.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 86471ad668)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 19:46:49 +09:00
itholic 156ed24c52 [SPARK-35810][PYTHON][FOLLWUP] Deprecate ps.broadcast API
### What changes were proposed in this pull request?

This PR follows up #33379 to fix build error in Sphinx

### Why are the changes needed?

The Sphinx build is failed with missing newline in docstring

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually test the Sphinx build

Closes #33479 from itholic/SPARK-35810-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d1a037a27c)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 17:10:14 +09:00
itholic 3a18864c5f [SPARK-35809][PYTHON] Add index_col argument for ps.sql
### What changes were proposed in this pull request?

This PR proposes adding an argument `index_col` for `ps.sql` function, to preserve the index when users want.

NOTE that the `reset_index()` have to be performed before using `ps.sql` with `index_col`.

```python
>>> psdf
   A  B
a  1  4
b  2  5
c  3  6
>>> psdf_reset_index = psdf.reset_index()
>>> ps.sql("SELECT * from {psdf_reset_index} WHERE A > 1", index_col="index")
       A  B
index
b      2  5
c      3  6
```

Otherwise, the index is always lost.

```python
>>> ps.sql("SELECT * from {psdf} WHERE A > 1")
   A  B
0  2  5
1  3  6
```

### Why are the changes needed?

Index is one of the key object for the existing pandas users, so we should provide the way to keep the index after computing the `ps.sql`.

### Does this PR introduce _any_ user-facing change?

Yes, the new argument is added.

### How was this patch tested?

Add a unit test and manually check the build pass.

Closes #33450 from itholic/SPARK-35809.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 6578f0b135)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 17:08:42 +09:00
Takuya UESHIN 0e94e42cd3 [SPARK-36249][PYTHON] Add remove_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `remove_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `remove_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `remove_categories`.

### How was this patch tested?

Added some tests.

Closes #33474 from ueshin/issues/SPARK-36249/remove_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit a3c7ae18e2)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 17:06:25 +09:00
Takuya UESHIN f83a9ec2fd [SPARK-36214][PYTHON] Add add_categories to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `add_categories` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `add_categories` in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `add_categories`.

### How was this patch tested?

Added some tests.

Closes #33470 from ueshin/issues/SPARK-36214/add_categories.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit dcc0aaa3ef)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-21 22:34:15 -07:00
Hyukjin Kwon c42866e627 [SPARK-36253][PYTHON][DOCS] Add versionadded to the top of pandas-on-Spark package
### What changes were proposed in this pull request?

This PR adds the version that added pandas API on Spark in PySpark documentation.

### Why are the changes needed?

To document the version added.

### Does this PR introduce _any_ user-facing change?

No to end user. Spark 3.2 is not released yet.

### How was this patch tested?

Linter and documentation build.

Closes #33473 from HyukjinKwon/SPARK-36253.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit f3e29574d9)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-22 14:21:53 +09:00
Takuya UESHIN 24095bfb07 [SPARK-36188][PYTHON] Add categories setter to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add categories setter to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement categories setter in `CategoricalAccessor` and `CategoricalIndex`.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use categories setter.

### How was this patch tested?

Added some tests.

Closes #33448 from ueshin/issues/SPARK-36188/categories_setter.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit d506815a92)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-21 11:31:46 -07:00
Takuya UESHIN a3a13da26c [SPARK-36186][PYTHON] Add as_ordered/as_unordered to CategoricalAccessor and CategoricalIndex
### What changes were proposed in this pull request?

Add `as_ordered`/`as_unordered` to `CategoricalAccessor` and `CategoricalIndex`.

### Why are the changes needed?

We should implement `as_ordered`/`as_unordered` in `CategoricalAccessor` and `CategoricalIndex` yet.

### Does this PR introduce _any_ user-facing change?

Yes, users will be able to use `as_ordered`/`as_unordered`.

### How was this patch tested?

Added some tests.

Closes #33400 from ueshin/issues/SPARK-36186/as_ordered_unordered.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 376fadc89c)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-20 18:24:09 -07:00
Hyukjin Kwon 9d461501b9 [SPARK-36216][PYTHON][TESTS] Increase timeout for StreamingLinearRegressionWithTests. test_parameter_convergence
### What changes were proposed in this pull request?

Test is flaky (https://github.com/apache/spark/runs/3109815586):

```
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 391, in test_parameter_convergence
    eventually(condition, catch_assertions=True)
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 91, in eventually
    raise lastValue
  File "/__w/spark/spark/python/pyspark/testing/utils.py", line 82, in eventually
    lastValue = condition()
  File "/__w/spark/spark/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 387, in condition
    self.assertEqual(len(model_weights), len(batches))
AssertionError: 9 != 10
```

Should probably increase timeout

### Why are the changes needed?

To avoid flakiness in the test.

### Does this PR introduce _any_ user-facing change?

Nope, dev-only.

### How was this patch tested?

CI should test it out.

Closes #33427 from HyukjinKwon/SPARK-36216.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit d6b974f8ce)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 13:17:13 +09:00
Takuya UESHIN 55c9dbd4d2 [SPARK-36167][PYTHON][3.2] Revisit more InternalField managements
### What changes were proposed in this pull request?

This is a backport of #33377.

Revisit and manage `InternalField` in more places.

### Why are the changes needed?

There are other places we can manage `InternalField`, and we can keep extension dtypes or `CategoricalDtype`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some tests.

Closes #33384 from ueshin/issues/SPARK-36167/3.2/internal_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-20 09:30:35 +09:00
Xinrong Meng 48fadee158 [SPARK-36127][PYTHON] Support comparison between a Categorical and a scalar
### What changes were proposed in this pull request?
Support comparison between a Categorical and a scalar.
There are 3 main changes:
- Modify `==` and `!=` from comparing **codes** of the Categorical to the scalar to comparing **actual values** of the Categorical to the scalar.
- Support `<`, `<=`, `>`, `>=` between a Categorical and a scalar.
- TypeError message fix.

### Why are the changes needed?
pandas supports comparison between a Categorical and a scalar, we should follow pandas' behaviors.

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0     True
1    False
2    False
dtype: bool
>>> psser <= 1
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

After:
```py
>>> import pyspark.pandas as ps
>>> import pandas as pd
>>> from pandas.api.types import CategoricalDtype
>>> pser = pd.Series(pd.Categorical([1, 2, 3], categories=[3, 2, 1], ordered=True))
>>> psser = ps.from_pandas(pser)
>>> psser == 2
0    False
1     True
2    False
dtype: bool
>>> psser <= 1
0    True
1    True
2    True
dtype: bool

```

### How was this patch tested?
Unit tests.

Closes #33373 from xinrong-databricks/categorical_eq.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 8dd43351d5)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-19 15:06:56 -07:00
itholic 8d58211b9d [SPARK-35806][PYTHON] Mapping the mode argument to pandas in DataFrame.to_csv
### What changes were proposed in this pull request?

The `DataFrame.to_csv` has `mode` arguments both in pandas and pandas API on Spark.

However, pandas allows the string "w", "w+", "a", "a+" where as pandas-on-Spark allows "append", "overwrite", "ignore", "error" or "errorifexists".

We should map them while `mode` can still accept the existing parameters("append", "overwrite", "ignore", "error" or "errorifexists") as well.

### Why are the changes needed?

APIs in pandas-on-Spark should follows the behavior of pandas for preventing the existing pandas code break.

### Does this PR introduce _any_ user-facing change?

`DataFrame.to_csv` now can accept "w", "w+", "a", "a+" as well, same as pandas.

### How was this patch tested?

Add the unit test and manually write the file with the new acceptable strings.

Closes #33414 from itholic/SPARK-35806.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 2f42afc53a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:58:19 +09:00
Dominik Gehl b80ceb552d [SPARK-36181][PYTHON] Update pyspark sql readwriter documentation
### What changes were proposed in this pull request?
Updating the pyspark sql readwriter documentation to the level of detail provided by the scala documentation

### Why are the changes needed?
documentation clarity

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33394 from dominikgehl/feature/SPARK-36181.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 2ef8ced27a)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:51:24 +09:00
Dominik Gehl e7a210e5ed [SPARK-36178][PYTHON] List pyspark.sql.catalog APIs in documentation
### What changes were proposed in this pull request?
The pyspark.sql.catalog APIs were missing from the documentation. PR fixes this omission.

### Why are the changes needed?
Documentation consistency

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Documentation change only.

Closes #33392 from dominikgehl/feature/SPARK-36178.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fe4db74da4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 19:49:22 +09:00
itholic 80a9644372 [SPARK-35810][PYTHON] Deprecate ps.broadcast API
### What changes were proposed in this pull request?

The `broadcast` functions in `pyspark.pandas` is duplicated to `DataFrame.spark.hint` with `"broadcast"`.

```python
# The below 2 lines are the same
df.spark.hint("broadcast")
ps.broadcast(df)
```

So, we should remove `broadcast` in the future, and show deprecation warning for now.

### Why are the changes needed?

For deduplication of functions

### Does this PR introduce _any_ user-facing change?

They see the deprecation warning when using `broadcast` in `pyspark.pandas`.

```python
>>> ps.broadcast(df)
FutureWarning: `broadcast` has been deprecated and will be removed in a future version. use `DataFrame.spark.hint` with 'broadcast' for `name` parameter instead.
  warnings.warn(
```

### How was this patch tested?

Manually check the warning message and see the build passed.

Closes #33379 from itholic/SPARK-35810.

Lead-authored-by: itholic <haejoon.lee@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Co-authored-by: Haejoon Lee <44108233+itholic@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 67e6120a85)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-19 10:45:16 +09:00
Dominik Gehl d69861ab6f [SPARK-36160][PYTHON][DOCS] Clarifying documentation for pyspark sql/column
Adapting documentation of `between`, `getField`, `dropFields` and `cast` to the corresponding scala doc

Documentation clarity

No.

Only documentation change

Closes #33369 from dominikgehl/feature/SPARK-36160.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 2d8d7b4aae)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-16 21:41:57 +09:00
Jungtaek Lim 4bfcdf38cf [SPARK-34893][SS] Support session window natively
Introduction: this PR is the last part of SPARK-10816 (EventTime based sessionization (session window)). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)

### What changes were proposed in this pull request?

This PR proposes to support native session window. Please refer the comments/design doc in SPARK-10816 for more details on the rationalization and design (could be outdated a bit compared to the PR).

The definition of the boundary of "session window" is [the timestamp of start event ~ the timestamp of last event + gap duration). That said, unlike time window, session window is a dynamic window which can expand if new input row is added to the session. To handle expansion of session window, Spark defines session window per input row, and "merge" windows if they can be merged (boundaries are overlapped).

This PR leverages two different approaches on merging session windows:

1. merging session windows with Spark's aggregation logic (a variant of sort aggregation)
2. updating session window for all rows bound to the same session, and applying aggregation logic afterwards

First one is preferable as it outperforms compared to the second one, though it can be only used if merging session window can be applied altogether with aggregation. It is not applicable on all the cases, so second one is used to cover the remaining cases.

This PR also applies the optimization on merging input rows and existing sessions with retaining the order (group keys + start timestamp of session window), leveraging the fact the number of existing sessions per group key won't be huge.

The state format is versioned, so that we can bring a new state format if we find a better one.

### Why are the changes needed?

For now, to deal with sessionization, Spark requires end users to play with (flat)MapGroupsWithState directly which has a couple of major drawbacks:

1. (flat)MapGroupsWithState is lower level API and end users have to code everything in details for defining session window and merging windows
2. built-in aggregate functions cannot be used and end users have to deal with aggregation by themselves
3. (flat)MapGroupsWithState is only available in Scala/Java.

With native support of session window, end users simply use "session_window" like they use "window" for tumbling/sliding window, and leverage built-in aggregate functions as well as UDAFs to simply define aggregations.

Quoting the query example from test suite:

```
    val inputData = MemoryStream[(String, Long)]

    // Split the lines into words, treat words as sessionId of events
    val events = inputData.toDF()
      .select($"_1".as("value"), $"_2".as("timestamp"))
      .withColumn("eventTime", $"timestamp".cast("timestamp"))
      .selectExpr("explode(split(value, ' ')) AS sessionId", "eventTime")
      .withWatermark("eventTime", "30 seconds")

    val sessionUpdates = events
      .groupBy(session_window($"eventTime", "10 seconds") as 'session, 'sessionId)
      .agg(count("*").as("numEvents"))
      .selectExpr("sessionId", "CAST(session.start AS LONG)", "CAST(session.end AS LONG)",
        "CAST(session.end AS LONG) - CAST(session.start AS LONG) AS durationMs",
        "numEvents")
```

which is same as StructuredSessionization (native session window is shorter and clearer even ignoring model classes).

39542bb81f/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredSessionization.scala (L66-L105)

(Worth noting that the code in StructuredSessionization only works with processing time. The code doesn't consider old event can update the start time of old session.)

### Does this PR introduce _any_ user-facing change?

Yes. This PR brings the new feature to support session window on both batch and streaming query, which adds a new function "session_window" which usage is similar with "window".

### How was this patch tested?

New test suites. Also tested with benchmark code.

Closes #33081 from HeartSaVioR/SPARK-34893-SPARK-10816-PR-31570-part-5.

Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
(cherry picked from commit f2bf8b051b)
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-07-16 20:38:35 +09:00
Hyukjin Kwon e9cc81151d [SPARK-36146][PYTHON][INFRA][TESTS] Upgrade Python version from 3.6 to 3.9 in GitHub Actions' linter/docs
This PR proposes to use Python 3.9 in documentation and linter at GitHub Actions. This PR also contains the fixes for mypy check (introduced by Python 3.9 upgrade)

```
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:64: error: Name "np.ndarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:91: error: Name "np.recarray" is not defined
python/pyspark/sql/pandas/_typing/protocols/frame.pyi:165: error: Name "np.ndarray" is not defined
python/pyspark/pandas/categorical.py:82: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/categorical.py:109: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/ml/linalg/__init__.pyi:184: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/ml/linalg/__init__.pyi:217: error: Return type "ndarray[Any, Any]" of "toArray" incompatible with return type "NoReturn" in supertype "Matrix"
python/pyspark/pandas/typedef/typehints.py:163: error: Module has no attribute "bool"; maybe "bool_" or "bool8"?
python/pyspark/pandas/typedef/typehints.py:174: error: Module has no attribute "float"; maybe "float_", "cfloat", or "float96"?
python/pyspark/pandas/typedef/typehints.py:180: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/ml.py:81: error: Value of type variable "_DTypeScalar_co" of "dtype" cannot be "object"
python/pyspark/pandas/indexing.py:1649: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/indexing.py:1656: error: Module has no attribute "int"; maybe "uint", "rint", or "intp"?
python/pyspark/pandas/frame.py:4969: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4969: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:4970: error: Function "numpy.array" is not valid as a type
python/pyspark/pandas/frame.py:4970: note: Perhaps you need "Callable[...]" or a callback protocol?
python/pyspark/pandas/frame.py:7402: error: "List[Any]" has no attribute "tolist"
python/pyspark/pandas/series.py:1030: error: Module has no attribute "_NoValue"
python/pyspark/pandas/series.py:1031: error: Module has no attribute "_NoValue"
python/pyspark/pandas/indexes/category.py:159: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/indexes/category.py:180: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/namespace.py:2036: error: Argument 1 to "column_name" has incompatible type "float"; expected "str"
python/pyspark/pandas/mlflow.py:59: error: Incompatible types in assignment (expression has type "Type[floating[Any]]", variable has type "str")
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/data_type_ops/categorical_ops.py:43: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "ordered"
python/pyspark/pandas/data_type_ops/categorical_ops.py:56: error: Item "dtype[Any]" of "Union[dtype[Any], Any]" has no attribute "categories"
python/pyspark/pandas/tests/test_typedef.py:70: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:77: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:85: error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py💯 error: Name "np.float" is not defined
python/pyspark/pandas/tests/test_typedef.py:108: error: Name "np.float" is not defined
python/pyspark/mllib/clustering.pyi:152: error: Incompatible types in assignment (expression has type "ndarray[Any, Any]", base class "KMeansModel" defined the type as "List[ndarray[Any, Any]]")
python/pyspark/mllib/classification.pyi:93: error: Signature of "predict" incompatible with supertype "LinearClassificationModel"
Found 32 errors in 15 files (checked 315 source files)
1
```

Python 3.6 is deprecated at SPARK-35938

No. Maybe static analysis, etc. by some type hints but they are really non-breaking..

I manually checked by GitHub Actions build in forked repository.

Closes #33356 from HyukjinKwon/SPARK-36146.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
(cherry picked from commit a71dd6af2f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-16 11:41:53 +09:00
Dominik Gehl 3a09024636 [SPARK-36154][DOCS] Documenting week and quarter as valid formats in pyspark sql/functions trunc
### What changes were proposed in this pull request?
Added missing documentation of week and quarter as valid formats to pyspark sql/functions trunc

### Why are the changes needed?
Pyspark documentation and scala documentation didn't mentioned the same supported formats

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Only documentation change

Closes #33359 from dominikgehl/feature/SPARK-36154.

Authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 802f632a28)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-07-15 16:51:25 +03:00
Xinrong Meng ca8a3f2e23 [SPARK-36125][PYTHON] Implement non-equality comparison operators between two Categoricals
### What changes were proposed in this pull request?
Implement non-equality comparison operators between two Categoricals.
Non-goal: supporting Scalar input will be a follow-up task.

### Why are the changes needed?
pandas supports non-equality comparisons between two Categoricals. We should follow that.

### Does this PR introduce _any_ user-facing change?
Yes. No `NotImplementedError` for `<`, `<=`, `>`, `>=` operators between two Categoricals. An example is shown as below:

From:
```py
>>> import pyspark.pandas as ps
>>> from pandas.api.types import CategoricalDtype
>>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     psser <= other_psser
...
Traceback (most recent call last):
...
NotImplementedError: <= can not be applied to categoricals.
```

To:
```py
>>> import pyspark.pandas as ps
>>> from pandas.api.types import CategoricalDtype
>>> psser = ps.Series([1, 2, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> other_psser = ps.Series([2, 1, 3]).astype(CategoricalDtype([3, 2, 1], ordered=True))
>>> with ps.option_context("compute.ops_on_diff_frames", True):
...     psser <= other_psser
...
0    False
1     True
2     True
dtype: bool
```
### How was this patch tested?
Unit tests.

Closes #33331 from xinrong-databricks/categorical_compare.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 0cb120f390)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-14 14:01:23 -07:00
Kousuke Saruta 9e802f25aa [SPARK-36104][PYTHON][FOLLOWUP] Remove unused import "typing.cast"
### What changes were proposed in this pull request?

This is a followup PR for SPARK-36104 (#33307) and removes unused import `typing.cast`.
After that change, Python linter fails.
```
   ./dev/lint-python
  shell: sh -e {0}
  env:
    LC_ALL: C.UTF-8
    LANG: C.UTF-8
    pythonLocation: /__t/Python/3.6.13/x64
    LD_LIBRARY_PATH: /__t/Python/3.6.13/x64/lib
starting python compilation test...
python compilation succeeded.

starting black test...
black checks passed.

starting pycodestyle test...
pycodestyle checks passed.

starting flake8 test...
flake8 checks failed:
./python/pyspark/pandas/data_type_ops/num_ops.py:19:1: F401 'typing.cast' imported but unused
from typing import cast, Any, Union
^
1     F401 'typing.cast' imported but unused
```

### Why are the changes needed?

To recover CI.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CI.

Closes #33315 from sarutak/followup-SPARK-36104.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 47fd3173a5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 13:13:44 +09:00
Xinrong Meng eae79dd31b [SPARK-36104][PYTHON] Manage InternalField in DataTypeOps.neg/abs
### What changes were proposed in this pull request?
Manage InternalField for DataTypeOps.neg/abs.

### Why are the changes needed?
The spark data type and nullability must be the same as the original when DataTypeOps.neg/abs.
We should manage InternalField for this case.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33307 from xinrong-databricks/internalField.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 5afc27f899)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 12:07:14 +09:00
Takuya UESHIN 8d9758ee46 [SPARK-36103][PYTHON] Manage InternalField in DataTypeOps.invert
### What changes were proposed in this pull request?

Properly set `InternalField` for `DataTypeOps.invert`.

### Why are the changes needed?

The spark data type and nullability must be the same as the original when `DataTypeOps.invert`.
We should manage `InternalField` for this case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33306 from ueshin/issues/SPARK-36103/invert.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e2021daafb)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-13 09:22:37 +09:00
Xinrong Meng 606a99c01e [SPARK-36003][PYTHON] Implement unary operator invert of integral ps.Series/Index
### What changes were proposed in this pull request?
Implement unary operator `invert` of integral ps.Series/Index.

### Why are the changes needed?
Currently, unary operator `invert` of integral ps.Series/Index is not supported. We ought to implement that following pandas' behaviors.

### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, 3])
>>> ~psser
Traceback (most recent call last):
...
NotImplementedError: Unary ~ can not be applied to integrals.
```

After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, 3])
>>> ~psser
0   -2
1   -3
2   -4
dtype: int64
```

### How was this patch tested?
Unit tests.

Closes #33285 from xinrong-databricks/numeric_invert.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit badb0393d4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-12 15:10:37 +09:00
Takuya UESHIN 455c8922e2 [SPARK-36064][PYTHON] Manage InternalField more in DataTypeOps
### What changes were proposed in this pull request?

Properly set `InternalField` more in `DataTypeOps`.

### Why are the changes needed?

There are more places in `DataTypeOps` where we can manage `InternalField`.
We should manage `InternalField` for these cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33275 from ueshin/issues/SPARK-36064/fields.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 95e6c6e3e9)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-12 11:55:20 +09:00
Xinrong Meng 862178b2a0 [SPARK-36035][PYTHON] Adjust test_astype, test_neg for old pandas versions
### What changes were proposed in this pull request?
Adjust `test_astype`, `test_neg`  for old pandas versions.

### Why are the changes needed?
There are issues in old pandas versions that fail tests in pandas API on Spark. We ought to adjust `test_astype` and `test_neg` for old pandas versions.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests. Please refer to https://github.com/apache/spark/pull/33272 for test results with pandas 1.0.1.

Closes #33250 from xinrong-databricks/SPARK-36035.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 698c4ec16b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 17:24:33 +09:00
Yikun Jiang fd277dc036 [SPARK-36002][PYTHON] Consolidate tests for data-type-based operations of decimal Series
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops

- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)

### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:

- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py

We'd better merge test_decimal_ops into test_num_ops.

See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unittests passed

Closes #33206 from Yikun/SPARK-36002.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fdc50f4452)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 14:08:23 +09:00
Xinrong Meng cb9bd5f455 [SPARK-36001][PYTHON] Assume result's index to be disordered in tests with operations on different Series
### What changes were proposed in this pull request?
For tests with operations on different Series, sort index of results before comparing them with pandas.

### Why are the changes needed?
We have many tests with operations on different Series in `spark/python/pyspark/pandas/tests/data_type_ops/` that assume the result's index to be sorted and then compare to the pandas' behavior.

The assumption on the result's index ordering is wrong since Spark DataFrame join is used internally and the order is not preserved if the data being in different partitions.

So we should assume the result to be disordered and sort the index of such results before comparing them with pandas.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit tests.

Closes #33274 from xinrong-databricks/datatypeops_testdiffframe.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit af81ad0d7e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 12:42:58 +09:00
Takuya UESHIN 55111cafd1 [SPARK-36062][PYTHON] Try to capture faulthanlder when a Python worker crashes
### What changes were proposed in this pull request?

Try to capture the error message from the `faulthandler` when the Python worker crashes.

### Why are the changes needed?

Currently, we just see an error message saying `"exited unexpectedly (crashed)"` when the UDFs causes the Python worker to crash by like segmentation fault.
We should take advantage of [`faulthandler`](https://docs.python.org/3/library/faulthandler.html) and try to capture the error message from the `faulthandler`.

### Does this PR introduce _any_ user-facing change?

Yes, when a Spark config `spark.python.worker.faulthandler.enabled` is `true`, the stack trace will be seen in the error message when the Python worker crashes.

```py
>>> def f():
...   import ctypes
...   ctypes.string_at(0)
...
>>> sc.parallelize([1]).map(lambda x: f()).count()
```

```
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault

Current thread 0x000000010965b5c0 (most recent call first):
  File "/.../ctypes/__init__.py", line 525 in string_at
  File "<stdin>", line 3 in f
  File "<stdin>", line 1 in <lambda>
...
```

### How was this patch tested?

Added some tests, and manually.

Closes #33273 from ueshin/issues/SPARK-36062/faulthandler.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 115b8a180f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-09 11:31:00 +09:00
Xinrong Meng b0cd00b062 [SPARK-35340][PYTHON] Standardize TypeError messages for unsupported basic operations
### What changes were proposed in this pull request?
The PR is proposed to standardize TypeError messages for unsupported basic operations by:
- Capitalize the first letter
- Leverage TypeError messages defined in `pyspark/pandas/data_type_ops/base.py`
- Take advantage of the utility `is_valid_operand_for_numeric_arithmetic` to save duplicated TypeError messages

Related unit tests should be adjusted as well.

### Why are the changes needed?
Inconsistent TypeError messages are shown for unsupported data-type-based basic operations.

Take addition's TypeError messages for example:
- addition can not be applied to given types.
- string addition can only be applied to string series or literals.

Standardizing TypeError messages would improve user experience and reduce maintenance costs.

### Does this PR introduce _any_ user-facing change?
No user-facing behavior change. Only TypeError messages are modified.

### How was this patch tested?

Unit tests.

Closes #33237 from xinrong-databricks/datatypeops_err.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 819c482498)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-08 12:28:00 -07:00
Xinrong Meng 61bfdf0c03 [SPARK-35615][PYTHON] Make unary and comparison operators data-type-based
### What changes were proposed in this pull request?
Make unary and comparison operators data-type-based. Refactored operators include:
- Unary operators: `__neg__`, `__abs__`, `__invert__`,
- Comparison operators: `>`, `>=`, `<`, `<=`, `==`, `!=`

Non-goal: Tasks below are inspired during the development of this PR.
[[SPARK-35997] Implement comparison operators for CategoricalDtype in pandas API on Spark](https://issues.apache.org/jira/browse/SPARK-35997)
[[SPARK-36000] Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled](https://issues.apache.org/jira/browse/SPARK-36000)
[[SPARK-36001] Assume result's index to be disordered in tests with operations on different Series](https://issues.apache.org/jira/browse/SPARK-36001)
[[SPARK-36002] Consolidate tests for data-type-based operations of decimal Series](https://issues.apache.org/jira/browse/SPARK-36002)
[[SPARK-36003] Implement unary operator `invert` of numeric ps.Series/Index](https://issues.apache.org/jira/browse/SPARK-36003)

### Why are the changes needed?

We have been refactoring basic operators to be data-type-based for readability, flexibility, and extensibility.
Unary and comparison operators are still not data-type-based yet. We should fill the gaps.

### Does this PR introduce _any_ user-facing change?

Yes.

- Better error messages. For example,

Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```
After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
TypeError: Unary - can not be applied to binaries.
>>>
```
- Support unary `-` of `bool` Series. For example,

Before:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```

After:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
0    False
1     True
2    False
dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #33162 from xinrong-databricks/datatypeops_refactor.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 6e4e04f2a1)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-07 13:47:04 -07:00
Hyukjin Kwon 9cf1db33c7 [SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to bump up the mypy version to 0.910 which is the latest.

### Why are the changes needed?

To catch the type hint mistakes better in PySpark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

GitHub Actions should test it out.

Closes #33223 from HyukjinKwon/SPARK-35684.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 16c195ccfb)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-07 13:26:41 +09:00
Tomas Pereira de Vasconcelos e8991266c8 [SPARK-35986][PYSPARK] Fix type hint for RDD.histogram's buckets
Fix the type hint for `pyspark.rdd .RDD.histogram`'s `buckets` argument

The current type hint is incomplete.
![image](https://user-images.githubusercontent.com/17701527/124248180-df7fd580-db22-11eb-8391-ba0bb51d689b.png)
From `pyspark.rdd .RDD.histogram`'s source:
```python
if isinstance(buckets, int):
    ...
elif isinstance(buckets, (list, tuple)):
    ...
else:
    raise TypeError("buckets should be a list or tuple or number(int or long)")
```

Fixed the warning displayed above.

Fixed warning above with this change.

Closes #33185 from tpvasconcelos/master.

Authored-by: Tomas Pereira de Vasconcelos <tomasvasconcelos1@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 495d234c6e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-04 10:24:55 +09:00
Takuya UESHIN fcc9e66c9b [SPARK-35981][PYTHON][TEST][3.2] Use check_exact=False to loosen the check precision
### What changes were proposed in this pull request?

This is a cherry-pick of #33179.

We should use `check_exact=False` because the value check in `StatsTest.test_cov_corr_meta` is too strict.

### Why are the changes needed?

In some environment, the precision could be different in pandas' `DataFrame.corr` function and the test `StatsTest.test_cov_corr_meta` fails.

```
AssertionError: DataFrame.iloc[:, 0] (column name="a") are different
DataFrame.iloc[:, 0] (column name="a") values are different (14.28571 %)
[index]: [a, b, c, d, e, f, g]
[left]:  [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
[right]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.807406715958909e-17]
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified tests should still pass.

Closes #33193 from ueshin/issuse/SPARK-35981/3.2/corr.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-07-02 14:08:50 -07:00
Wenchen Fan c1d8178817 [SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing
### What changes were proposed in this pull request?

By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.

However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.

This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.

### Why are the changes needed?

AQE is default on now, we should make the perf better in the default case.

### Does this PR introduce _any_ user-facing change?

yes, a new config.

### How was this patch tested?

new tests

Closes #33172 from cloud-fan/aqe2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0c9c8ff569)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-07-02 16:07:46 +08:00
Xinrong Meng 95d94948c5 [SPARK-35339][PYTHON] Improve unit tests for data-type-based basic operations
### What changes were proposed in this pull request?

Improve unit tests for data-type-based basic operations by:
- removing redundant test cases
- adding `astype` test for ExtensionDtypes

### Why are the changes needed?

Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases.
`astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #33095 from xinrong-databricks/datatypeops_test.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-01 17:37:32 -07:00
Takuya UESHIN a98c8ae57d [SPARK-35944][PYTHON] Introduce Name and Label type aliases
### What changes were proposed in this pull request?

Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`.

- `Label`: `Tuple[Any, ...]`
  Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures.
- `Name`: `Union[Any, Label]`
  External expression for user-facing names, which can be scalar values or tuples.

### Why are the changes needed?

Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33159 from ueshin/issues/SPARK-35944/name_and_label.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-01 09:40:07 +09:00
Xinrong Meng 5ad12611ec [SPARK-35938][PYTHON] Add deprecation warning for Python 3.6
### What changes were proposed in this pull request?

Add deprecation warning for Python 3.6.

### Why are the changes needed?

According to https://endoflife.date/python, Python 3.6 will be EOL on 23 Dec, 2021.
We should prepare for the deprecation of Python 3.6 support in Spark in advance.

### Does this PR introduce _any_ user-facing change?

N/A.

### How was this patch tested?

Manual tests.

Closes #33139 from xinrong-databricks/deprecate3.6_warn.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-01 09:32:25 +09:00
Hyukjin Kwon 8d28839689 [SPARK-35946][PYTHON] Respect Py4J server in InheritableThread API
### What changes were proposed in this pull request?

Currently ,we sets the environment variable `PYSPARK_PIN_THREAD` at the client side of `InhertiableThread` API for Py4J (`python/pyspark/util.py`). If the Py4J gateway is created somewhere else (e.g., Zeppelin, etc), it could introduce a breakage at:

```python
from pyspark import SparkContext
jvm = SparkContext._jvm
thread_connection = jvm._gateway_client.get_thread_connection()
# `AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'` (non-pinned thread mode)
# `get_thread_connection` is only in 'ClientServer' (pinned thread mode)
```

This PR proposes to check the given gateway created, and do the pinned thread mode behaviour accordingly so we can avoid any breakage when Py4J server/gateway is created separately from somewhere else without a pinned thread mode.

### Why are the changes needed?

To avoid any potential breakage.

### Does this PR introduce _any_ user-facing change?

No, the change happened only in the master (fdd7ca5f4e).

### How was this patch tested?

This is actually a partial revert of fdd7ca5f4e. As long as the existing tests pass, I guess we're all good.

I also manually tested to make doubly sure:

**Before**:

```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
Traceback (most recent call last):
  File "/.../python3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/.../python3.8/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/.../spark/python/pyspark/util.py", line 361, in copy_local_properties
    InheritableThread._clean_py4j_conn_for_current_thread()
  File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
    thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/util.py", line 324, in wrapped
    InheritableThread._clean_py4j_conn_for_current_thread()
  File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
    thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'
```

**After**:

```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
1
```

Closes #33147 from HyukjinKwon/SPARK-35946.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-29 22:18:54 -07:00
Takuya UESHIN 0a838dcd71 [SPARK-35943][PYTHON] Introduce Axis type alias
### What changes were proposed in this pull request?

Introduces `Axis` type alias for `axis` argument to be consistent.

### Why are the changes needed?

There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33144 from ueshin/issues/SPARK-35943/axis.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-30 10:46:59 +09:00
itholic 28a201a442 [SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes removing the legacy Koalas version from pandas API on Spark package.

And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version.

### Why are the changes needed?

Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas.

### Does this PR introduce _any_ user-facing change?

Now the legacy Koalas user should follow the version from PySpark.

### How was this patch tested?

Manually built the package and see it's successfully done.

Closes #33128 from itholic/SPARK-35873.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-30 10:01:51 +09:00
Takuya UESHIN 1f6e2f55d7 Revert "[SPARK-35721][PYTHON] Path level discover for python unittests"
This reverts commit 5db51efa1a.
2021-06-29 12:08:09 -07:00
Takuya UESHIN 2702fb9af0 [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark
### What changes were proposed in this pull request?

Cleaning up the type hints in pandas-on-Spark.

- Use a single file `_typing.py` for type variables or aliases
- Rename `IndexOpsLike` to `SeriesOrIndex`.
- Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively
- Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]`

### Why are the changes needed?

This is a cleanup for the mypy check stuff series.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33117 from ueshin/issues/SPARK-35859/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-29 10:52:24 -07:00
Yikun Jiang 5db51efa1a [SPARK-35721][PYTHON] Path level discover for python unittests
### What changes were proposed in this pull request?
Add path level discover for python unittests.

### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.

Such as:
- pyspark-core pyspark.tests.test_pin_thread

Thus we need some auto-discover way to find all testcase rather than specified every case by manually.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
    for g in sorted(m.python_test_goals):
        print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKL

Closes #32867 from Yikun/SPARK_DISCOVER_TEST.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-29 17:56:13 +09:00
Xinrong Meng 5f0113e3a6 [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark
### What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.

```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method

Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

### Why are the changes needed?

Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions`  is adjusted in order to support that in pandas-on-Spark.

### Does this PR introduce _any_ user-facing change?

Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```

After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #32955 from xinrong-databricks/datatypeops_literal.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-28 19:03:42 -07:00
Takuya UESHIN 8c401beb80 [SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window
### What changes were proposed in this pull request?

Refines type hints in `pyspark.pandas.window`.

Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`.

### Why are the changes needed?

We can use more strict type hints for functions in pyspark.pandas.window using the generic way.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33097 from ueshin/issues/SPARK-35901/window.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-28 12:23:32 +09:00
itholic 03e6de2abe [SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame
### What changes were proposed in this pull request?

This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.

### Why are the changes needed?

Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.

### Does this PR introduce _any_ user-facing change?

No, it's kinda internal refactoring stuff.

### How was this patch tested?

Added the related tests and manually check they're passed.

Closes #33054 from itholic/SPARK-35605.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-28 11:47:09 +09:00
Takuya UESHIN a9ebfc5374 [SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.data_type_ops.*
### What changes were proposed in this pull request?

Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-25 18:16:25 -07:00
Takuya UESHIN 6497ac3585 [SPARK-35471][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.frame
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-25 14:41:58 +09:00
Takuya UESHIN cfcfbca965 [SPARK-35476][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.series
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 19:32:33 +09:00
Hyukjin Kwon 5a7686a393 [SPARK-35301][PYTHON][DOCS] Document migration guide from Koalas to pandas APIs on Spark
### What changes were proposed in this pull request?

This PR proposes to add a migration guide for legacy Koalas users in pandas API on Spark.

### Why are the changes needed?

For easier migration.

### Does this PR introduce _any_ user-facing change?

Yes, this adds a new page for migration from Koalas.

### How was this patch tested?

Manually built the docs and checked manually.

Closes #33050 from HyukjinKwon/SPARK-35301.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 17:58:09 +09:00
itholic 92ddef7cfb [SPARK-35696][PYTHON][DOCS][FOLLOW-UP] Fix underline for title in FAQ to remove warnings
### What changes were proposed in this pull request?

This PR follow-up for SPARK-35696 to fix incorrect underline in the documents to remove warnings.

### Why are the changes needed?

We should build the docs without any incorrect documentation style

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually build docs and see the warning is removed

Closes #33052 from itholic/SPARK-35696-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 15:20:13 +09:00
itholic 712ed87faa [SPARK-35696][PYTHON][DOCS] Refine the code examples in pandas-on-Spark documentation
### What changes were proposed in this pull request?

This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas.

For example,

```python
kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```

should be refined to

```python
psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```

Also fixed the several remaining Koalas stuffs in FAQ

### Why are the changes needed?

Because we don't want to use the name "Koalas" in the Apache Spark anymore.

### Does this PR introduce _any_ user-facing change?

Yes, the examples in the documentation will be changed with refined names.

### How was this patch tested?

Manually built the docs and check one by one.

Closes #33017 from itholic/SPARK-35696.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 14:48:13 +09:00
Ruifeng Zheng 37f70422b5 [SPARK-35678][ML][FOLLOWUP] Revert changes in ANN
### What changes were proposed in this pull request?
revert changes related to ANN

### Why are the changes needed?
using the new `softmax` may cause flaky failure

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
reverted testsuite

Closes #33049 from zhengruifeng/revert_softmax_ann.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 14:02:28 +09:00
Ruifeng Zheng a66738823c [SPARK-35678][ML][FOLLOWUP] softmax support offset and step
### What changes were proposed in this pull request?
softmax support offset and step, then we can use it in ANN and NB

### Why are the changes needed?
to simplify impl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuite

Closes #32991 from zhengruifeng/softmax_support_offset_step.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
2021-06-23 21:03:18 -05:00
Hyukjin Kwon be9089731a [SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes to fix:
- the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one.
- update quickstart of pandas API on Spark, and make it working

The notebooks can be easily reviewed here:

https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb

Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

- To show the working examples of quickstart to end users.
- To allow users to try out the examples without installation easily.

### Does this PR introduce _any_ user-facing change?

No to end users because the existing quickstart of pandas API on Spark is not released yet.

### How was this patch tested?

I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0

Closes #33041 from HyukjinKwon/SPARK-35588-2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 10:17:22 +09:00
Yikun Jiang 4824c53398 [SPARK-35812][PYTHON] Throw ValueError if version and timestamp are used together in to_delta
### What changes were proposed in this pull request?

Throw ValueError if version and timestamp are used together in to_delta

### Why are the changes needed?
read_delta has arguments named `version` and `timestamp`, but they cannot be used together.

We should raise the proper error message when they are used together.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #33023 from Yikun/SPARK-35812.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-23 19:04:45 +09:00
Takuya UESHIN 68b54b702c [SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.groupby
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-23 09:51:33 +09:00
Takuya UESHIN c418803df7 [SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull
### What changes were proposed in this pull request?

Properly set `InternalField` for `DataTypeOps.isnull`.

### Why are the changes needed?

The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some more tests.

Closes #33005 from ueshin/issues/SPARK-35847/isnull_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-22 12:54:01 -07:00
Yikun Jiang 1c26433f1d [SPARK-35849][PYTHON] Make astype method data-type-based for DecimalOps
### What changes were proposed in this pull request?
Make DecimalOps astype data-type-based.

See more in:
https://github.com/apache/spark/pull/32821#issuecomment-861119905

### Why are the changes needed?
Make DecimalOps astype data-type-based.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py

Closes #33009 from Yikun/SPARK-35849.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-22 10:41:22 -07:00
Hyukjin Kwon 27046582e4 [SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section
### What changes were proposed in this pull request?

This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts.

### Why are the changes needed?

pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete.

### Does this PR introduce _any_ user-facing change?

To end users, no because the docs are not released yet.

### How was this patch tested?

I manually built the docs and checked the output

Closes #33018 from HyukjinKwon/SPARK-35645.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-22 09:36:27 -07:00
Takuya UESHIN a8fdb98ecb [SPARK-35470][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.base
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-22 11:25:16 +09:00
Xinrong Meng 6ca56b01dc [SPARK-35614][PYTHON] Make the conversion to pandas data-type-based for ExtensionDtypes
### What changes were proposed in this pull request?

We propose to
- introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps`
- make the "conversion to pandas" data-type-based for ExtensionDtypes

Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed.

### Why are the changes needed?

The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly.
That makes code hard to change or maintain.

Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32910 from xinrong-databricks/datatypeops_pd_ext.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-21 13:19:55 -07:00
Hyukjin Kwon 248fda3ead [SPARK-35834][PYTHON] Use the same cleanup logic as Py4J in inheritable thread API
### What changes were proposed in this pull request?

This PR fixes the cleanup logic in inheritable thread API by following Py4J cleanup logic at https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278.

Currently the tests that use `inheritable_thread_target` are flaky (https://github.com/apache/spark/runs/2870944288):

```
======================================================================
ERROR [71.813s]: test_save_load_pipeline_estimator (pyspark.ml.tests.test_tuning.CrossValidatorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in test_save_load_pipeline_estimator
    self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
  File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in _run_test_save_load_pipeline_estimator
    cvModel2 = crossval2.fit(training)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
    return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
    bestModel = est.fit(dataset, epm[bestIndex])
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
    return self.copy(params)._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
    model = stage.fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
    return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
    model = stage.fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
    return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in _fit
    models = pool.map(inheritable_thread_target(trainSingleClass), range(numClasses))
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
    InheritableThread._clean_py4j_conn_for_current_thread()
  File "/__w/spark/spark/python/pyspark/util.py", line 389, in _clean_py4j_conn_for_current_thread
    del connections[i]
IndexError: deque index out of range

----------------------------------------------------------------------
```

This seems to be because the connection deque `jvm._gateway_client.deque` is accessed, and modified by other threads. Therefore, the number of threads could be changed in the middle. Using `SparkContext._lock` doesn't protect because the deque can be updated for every Java instance access in Py4J.

This PR proposes to use the atomic `deque.remove` in the problematic dequeue alone with try-catch on `ValueError` in case it's [deleted by Py4J](https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278).

### Why are the changes needed?

To fix the flakiness in the tests, and avoid possible breakage in user application by using this API.

### Does this PR introduce _any_ user-facing change?

If users were dependent on InheritableThread with pinned thread mode on, they might have faced such issues intermittently. This PR fixes it.

### How was this patch tested?

Manually tested. CI should test it out too.

Closes #32989 from HyukjinKwon/SPARK-35834.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-21 12:00:16 +09:00
Kevin Su 653be9d774 [SPARK-35811][PYTHON] Deprecate DataFrame.to_spark_io
### What changes were proposed in this pull request?

Deprecate the `DataFrame.to_spark_io`

### Why are the changes needed?

We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas.

### Does this PR introduce _any_ user-facing change?

Yes, users will get warning while using `DataFrame.to_spark_io` api.

### How was this patch tested?

Pass the CIs

Closes #32964 from pingsutw/SPARK-35811.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-21 10:43:34 +09:00
Hyukjin Kwon 6d309914df [SPARK-35303][SPARK-35498][PYTHON][FOLLOW-UP] Copy local properties when starting the thread, and use inheritable thread in the current codebase
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.

This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.

### Why are the changes needed?

To mimic the JVM behaviour about thread lifecycle

### Does this PR introduce _any_ user-facing change?

Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.

### How was this patch tested?

Existing tests should cover this.

Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-20 11:48:38 +09:00