Commit graph

79 commits

Author SHA1 Message Date
Xinrong Meng 95d94948c5 [SPARK-35339][PYTHON] Improve unit tests for data-type-based basic operations
### What changes were proposed in this pull request?

Improve unit tests for data-type-based basic operations by:
- removing redundant test cases
- adding `astype` test for ExtensionDtypes

### Why are the changes needed?

Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases.
`astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #33095 from xinrong-databricks/datatypeops_test.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-07-01 17:37:32 -07:00
Takuya UESHIN a98c8ae57d [SPARK-35944][PYTHON] Introduce Name and Label type aliases
### What changes were proposed in this pull request?

Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`.

- `Label`: `Tuple[Any, ...]`
  Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures.
- `Name`: `Union[Any, Label]`
  External expression for user-facing names, which can be scalar values or tuples.

### Why are the changes needed?

Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33159 from ueshin/issues/SPARK-35944/name_and_label.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-07-01 09:40:07 +09:00
Takuya UESHIN 0a838dcd71 [SPARK-35943][PYTHON] Introduce Axis type alias
### What changes were proposed in this pull request?

Introduces `Axis` type alias for `axis` argument to be consistent.

### Why are the changes needed?

There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33144 from ueshin/issues/SPARK-35943/axis.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-30 10:46:59 +09:00
itholic 28a201a442 [SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes removing the legacy Koalas version from pandas API on Spark package.

And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version.

### Why are the changes needed?

Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas.

### Does this PR introduce _any_ user-facing change?

Now the legacy Koalas user should follow the version from PySpark.

### How was this patch tested?

Manually built the package and see it's successfully done.

Closes #33128 from itholic/SPARK-35873.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-30 10:01:51 +09:00
Takuya UESHIN 1f6e2f55d7 Revert "[SPARK-35721][PYTHON] Path level discover for python unittests"
This reverts commit 5db51efa1a.
2021-06-29 12:08:09 -07:00
Takuya UESHIN 2702fb9af0 [SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark
### What changes were proposed in this pull request?

Cleaning up the type hints in pandas-on-Spark.

- Use a single file `_typing.py` for type variables or aliases
- Rename `IndexOpsLike` to `SeriesOrIndex`.
- Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively
- Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]`

### Why are the changes needed?

This is a cleanup for the mypy check stuff series.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33117 from ueshin/issues/SPARK-35859/cleanup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-29 10:52:24 -07:00
Yikun Jiang 5db51efa1a [SPARK-35721][PYTHON] Path level discover for python unittests
### What changes were proposed in this pull request?
Add path level discover for python unittests.

### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.

Such as:
- pyspark-core pyspark.tests.test_pin_thread

Thus we need some auto-discover way to find all testcase rather than specified every case by manually.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
    for g in sorted(m.python_test_goals):
        print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKL

Closes #32867 from Yikun/SPARK_DISCOVER_TEST.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-29 17:56:13 +09:00
Xinrong Meng 5f0113e3a6 [SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark
### What changes were proposed in this pull request?

The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.

```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method

Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.

### Why are the changes needed?

Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions`  is adjusted in order to support that in pandas-on-Spark.

### Does this PR introduce _any_ user-facing change?

Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```

After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0     True
1     True
2    False
3    False
4    False
Name: source, dtype: bool
```

### How was this patch tested?

Unit tests.

Closes #32955 from xinrong-databricks/datatypeops_literal.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-28 19:03:42 -07:00
Takuya UESHIN 8c401beb80 [SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window
### What changes were proposed in this pull request?

Refines type hints in `pyspark.pandas.window`.

Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`.

### Why are the changes needed?

We can use more strict type hints for functions in pyspark.pandas.window using the generic way.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #33097 from ueshin/issues/SPARK-35901/window.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-28 12:23:32 +09:00
itholic 03e6de2abe [SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame
### What changes were proposed in this pull request?

This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.

### Why are the changes needed?

Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.

### Does this PR introduce _any_ user-facing change?

No, it's kinda internal refactoring stuff.

### How was this patch tested?

Added the related tests and manually check they're passed.

Closes #33054 from itholic/SPARK-35605.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-28 11:47:09 +09:00
Takuya UESHIN a9ebfc5374 [SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.data_type_ops.*
### What changes were proposed in this pull request?

Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-25 18:16:25 -07:00
Takuya UESHIN 6497ac3585 [SPARK-35471][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.frame
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-25 14:41:58 +09:00
Takuya UESHIN cfcfbca965 [SPARK-35476][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.series
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 19:32:33 +09:00
Yikun Jiang 4824c53398 [SPARK-35812][PYTHON] Throw ValueError if version and timestamp are used together in to_delta
### What changes were proposed in this pull request?

Throw ValueError if version and timestamp are used together in to_delta

### Why are the changes needed?
read_delta has arguments named `version` and `timestamp`, but they cannot be used together.

We should raise the proper error message when they are used together.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #33023 from Yikun/SPARK-35812.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-23 19:04:45 +09:00
Takuya UESHIN 68b54b702c [SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.groupby
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-23 09:51:33 +09:00
Takuya UESHIN c418803df7 [SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull
### What changes were proposed in this pull request?

Properly set `InternalField` for `DataTypeOps.isnull`.

### Why are the changes needed?

The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some more tests.

Closes #33005 from ueshin/issues/SPARK-35847/isnull_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-22 12:54:01 -07:00
Yikun Jiang 1c26433f1d [SPARK-35849][PYTHON] Make astype method data-type-based for DecimalOps
### What changes were proposed in this pull request?
Make DecimalOps astype data-type-based.

See more in:
https://github.com/apache/spark/pull/32821#issuecomment-861119905

### Why are the changes needed?
Make DecimalOps astype data-type-based.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py

Closes #33009 from Yikun/SPARK-35849.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-22 10:41:22 -07:00
Takuya UESHIN a8fdb98ecb [SPARK-35470][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.base
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-22 11:25:16 +09:00
Xinrong Meng 6ca56b01dc [SPARK-35614][PYTHON] Make the conversion to pandas data-type-based for ExtensionDtypes
### What changes were proposed in this pull request?

We propose to
- introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps`
- make the "conversion to pandas" data-type-based for ExtensionDtypes

Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed.

### Why are the changes needed?

The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly.
That makes code hard to change or maintain.

Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32910 from xinrong-databricks/datatypeops_pd_ext.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-21 13:19:55 -07:00
Kevin Su 653be9d774 [SPARK-35811][PYTHON] Deprecate DataFrame.to_spark_io
### What changes were proposed in this pull request?

Deprecate the `DataFrame.to_spark_io`

### Why are the changes needed?

We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas.

### Does this PR introduce _any_ user-facing change?

Yes, users will get warning while using `DataFrame.to_spark_io` api.

### How was this patch tested?

Pass the CIs

Closes #32964 from pingsutw/SPARK-35811.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-21 10:43:34 +09:00
Takuya UESHIN 1589d32732 [SPARK-35472][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.generic
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-20 11:48:01 +09:00
Yikun Jiang b7df75a777 [SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.

### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
test passed.

Closes #32859 from Yikun/SPARK-XXXXX1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 18:54:50 -07:00
Takuya UESHIN c879510d2f [SPARK-35478][PYTHON][FOLLOWUP] Fix Jenkins' linter
### What changes were proposed in this pull request?

This is a follow-up of #32886 to fix the Jenkins' linter.

### Why are the changes needed?

The PR #32886 was mistakenly merged before Jenkins' linter passes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #32965 from ueshin/issues/SPARK-35478/fup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-18 13:52:54 -07:00
Kevin Su 3fb044e043 [SPARK-35478][PYTHON] Enable disallow_untyped_defs mypy check for pyspark.pandas.window
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32886 from pingsutw/SPARK-35478.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 11:21:33 -07:00
Yikun Jiang f84a720fe3 [SPARK-35342][PYTHON] Introduce DecimalOps and make isnull method data-type-based
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based

### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.

https://issues.apache.org/jira/browse/SPARK-35342

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed

Closes #32821 from Yikun/SPARK-35342.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 10:44:35 -07:00
Takuya UESHIN 2f537a838a [SPARK-35469][PYTHON] Fix disallow_untyped_defs mypy checks
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-18 20:43:59 +09:00
itholic b9aeeb4e6c [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side
### What changes were proposed in this pull request?

This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007
- it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```
- it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous.

And added the related test cases.

### Why are the changes needed?

To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily.

```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
MultiIndex([('a', 'x', 1),
            ('b', 'z', 2),
            ('k', 'z', 3)],
           )
```

And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`.

### Does this PR introduce _any_ user-facing change?

Yes, the previous bug is fixed as described in the above code examples.

### How was this patch tested?

Manually tested with linter and unittest in local, and it might be passed on CI.

Closes #32853 from itholic/SPARK-35683.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 14:18:54 +09:00
Takuya UESHIN 2a56cc36ca [SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings
### What changes were proposed in this pull request?

Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings.

### Why are the changes needed?

The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings.
We should use type-annotation based `pandas_udf` or avoid specifying udf types.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32913 from ueshin/issues/SPARK-35761/suppress_warnings.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 11:17:56 +09:00
Hyukjin Kwon 95f36e76c6 [SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark"
### What changes were proposed in this pull request?

This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface).

### Why are the changes needed?

To make it sound more natural.

### Does this PR introduce _any_ user-facing change?

It fixes a typo in the unreleased changes.

### How was this patch tested?

N/A

Closes #32903 from HyukjinKwon/SPARK-34885.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 10:01:04 +09:00
Xinrong Meng 03756618fc [SPARK-35616][PYTHON] Make astype method data-type-based
### What changes were proposed in this pull request?

Make `astype` method data-type-based.

**Non-goal: Match pandas' `astype` TypeErrors.**
Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of  Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages.
Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed.

### Why are the changes needed?

There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32847 from xinrong-databricks/datatypeops_astype.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-14 16:33:15 -07:00
Hyukjin Kwon 76e08a8e3d [SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots
### What changes were proposed in this pull request?

This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172.

```python
ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200)
```

**Before:**

```
pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).;
'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))]
+- Project [a#1L, b#2, c#3L]
   +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L]
      +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false
```

**After:**

```python
Figure({
    'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}',
              'name': 'a',
...
```

### Why are the changes needed?

To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns.

### Does this PR introduce _any_ user-facing change?

No to end users since the changes is not released yet. Yes to dev as described before.

### How was this patch tested?

Manually tested, added a test and tested in notebooks:

![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png)

![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png)

Closes #32884 from HyukjinKwon/fix-hist-plot.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-12 14:36:46 +09:00
Takuya UESHIN 4d21b94d13 [SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-11 11:07:11 -07:00
Kevin Su cadd3a0588 [SPARK-35474] Enable disallow_untyped_defs mypy check for pyspark.pandas.indexing
### What changes were proposed in this pull request?

Adds more type annotations in the file:
`python/pyspark/pandas/spark/indexing.py`
and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.
`./dev/lint-python`

Closes #32738 from pingsutw/SPARK-35474.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-09 22:35:12 -07:00
Xinrong Meng e9d60156c4 [SPARK-35705][PYTHON] Adjust pandas-on-spark test_groupby_multiindex_columns test for different pandas versions
### What changes were proposed in this pull request?

Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions.

### Why are the changes needed?

pandas had introduced bugs as below:

- For pandas 1.1.3 and 1.1.4
Type error: only integer scalar arrays can be converted to a scalar index

- For pandas < 1.0.4
Type error: Can only tuple-index with a MultiIndex

We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32851 from xinrong-databricks/SPARK-35705.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-10 10:36:19 +09:00
Xinrong Meng 3c66c11aa6 [SPARK-35601][PYTHON] Complete arithmetic operators involving bool literals, Series, and Index
### What changes were proposed in this pull request?

Completing arithmetic operators involving bool literals, Series, and Index consists of two main tasks:
- Support arithmetic operations against bool literals
- Support operators (+, *) between bool Series/Indexes.

### Why are the changes needed?

Arithmetic operators involving bool literals, Series, and Index are incomplete now.
We ought to match pandas' behaviors.

### Does this PR introduce _any_ user-facing change?

Yes.

Newly supported operations example:
```py
>>> ps.Series([1, 2, 3]) + True
0    2
1    3
2    4
dtype: int64
>>> ps.Series([1, 2, 3]) + False
0    1
1    2
2    3
dtype: int64
>>> ps.Series([True, False, True]) + True
0    True
1    True
2    True
dtype: bool
>>> ps.Series([True, False, True]) + False
0     True
1    False
2     True
dtype: bool
>>> ps.Series([True, False, True]) * True
0     True
1    False
2     True
dtype: bool
>>> ps.Series([True, False, True]) * False
0    False
1    False
2    False
dtype: bool
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> ps.Series([True, True, False]) + ps.Series([True, False, True])
0    True
1    True
2    True
dtype: bool
>>> ps.Series([True, True, False]) * ps.Series([True, False, True])
0     True
1    False
2    False
dtype: bool
```
Before the change, operations above are not supported, raising a TypeError such as
```py
>>> ps.Series([True, False, True]) + True
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans and the given type.
>>> ps.Series([True, False, True]) + False
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans and the given type.
```

### How was this patch tested?

Unit tests.

Closes #32785 from xinrong-databricks/datatypeops_arith_bool.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-09 15:13:03 -07:00
Takuya UESHIN 04418e18d7 [SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and StructFields
### What changes were proposed in this pull request?

Introduces `InternalField` to manage dtypes and `StructField`s.

`InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue.

It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark.

### Why are the changes needed?

Currently there are some performance issues in the pandas-on-Spark layer.

One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types.

We should reduce the amount of unnecessary access.

### Does this PR introduce _any_ user-facing change?

Improves the performance in pandas-on-Spark layer:

```py
df = ps.read_parquet("/path/to/test.parquet")  # contains ~75 columns
df = df[(df["col"] > 0) & (df["col"] < 10000)]
```

Before the PR, it took about **2.15 sec** and after **1.15 sec**.

### How was this patch tested?

Existing tests.

Closes #32775 from ueshin/issues/SPARK-35638/field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 11:57:28 +09:00
Xinrong Meng dfd8a8dc67 [SPARK-35341][PYTHON] Introduce BooleanExtensionOps
### What changes were proposed in this pull request?

- Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based.
- Improve error messages for operators `and` and `or`.

### Why are the changes needed?

Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based

BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced.

These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR.

### Does this PR introduce _any_ user-facing change?

Yes. Error messages for operators `and` and `or` are improved.
Before:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).;
'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))]
+- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L]
   +- LogicalRDD [__index_level_0__#8L, 0#9], false

```

After:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
TypeError: Bitwise or can not be applied to categoricals.
```

### How was this patch tested?

Unit tests.

Closes #32698 from xinrong-databricks/datatypeops_extension.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-07 15:43:52 -07:00
Xinrong Meng 04a8d2cbcf [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes
### What changes were proposed in this pull request?

Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.

### Why are the changes needed?

The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32592 from xinrong-databricks/datatypeop_pd_conversion.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-07 13:12:12 -07:00
itholic b8740a1d1e [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes
### What changes were proposed in this pull request?

This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.

By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.

### Why are the changes needed?

This can be reduces the cost of static analysis during development.

It has been used continuously for about a year in the Koalas project and its convenience has been proven.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.

Closes #32779 from itholic/SPARK-35499.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-06-06 17:30:07 -07:00
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00
itholic 0ad5ae54b2 [SPARK-35539][PYTHON] Restore to_koalas to keep the backward compatibility
### What changes were proposed in this pull request?

This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning.

### Why are the changes needed?

If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work.

### Does this PR introduce _any_ user-facing change?

No. It's restoring the existing functionality.

### How was this patch tested?

Manually tested in local.

```shell
>>> sdf.to_koalas()
.../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead.
  warnings.warn(
```

Closes #32729 from itholic/SPARK-35539.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-02 10:39:24 +09:00
Xinrong Meng 0ac5c16177 [SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin
### What changes were proposed in this pull request?

Support arithmetic operations against bool IndexOpsMixin.

### Why are the changes needed?

Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors.

pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is.

We aim to match pandas' behaviors.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> 1 + psser
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> ps.Series([1, 2, 3]) + psser
Traceback (most recent call last):
...
TypeError: addition can not be applied to given types.
```

After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
0    2
1    2
2    1
dtype: int64
>>> 1 + psser
0    2
1    2
2    1
dtype: int64
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
0    2
1    3
2    3
dtype: int64
>>> ps.Series([1, 2, 3]) + psser
0    2
1    3
2    3
dtype: int64

```

### How was this patch tested?

Unit tests.

Closes #32611 from xinrong-databricks/datatypeop_arith_bool.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-01 10:57:12 -07:00
itholic 7e2717333b [SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor
### What changes were proposed in this pull request?

This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor".

### Why are the changes needed?

Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark".

So, the related code bases are all need to be changed.

### Does this PR introduce _any_ user-facing change?

Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`.

**Note:** `df.koalas.[...]` is still available but with deprecated warnings.

### How was this patch tested?

Manually tested in local and checked one by one.

Closes #32674 from itholic/SPARK-35453.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-01 10:33:10 +09:00
Hyukjin Kwon 7eb74482a7 [SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true
### What changes were proposed in this pull request?

This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657.

Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.

pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.

For the time being, I propose to exclude boolean case alone in percentile/quartile test case

### Why are the changes needed?

To keep the test coverage.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

I roughly locally tested. But it should pass in CI.

Closes #32690 from HyukjinKwon/SPARK-35510.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-28 17:35:01 +09:00
Xinrong Meng 79a2a46cdb [SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases
### What changes were proposed in this pull request?

Re-enable some pandas-on-Spark test cases.

### Why are the changes needed?

pandas version in GitHub Actions is upgraded now so we can re-enable  some pandas-on-Spark test cases.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32682 from xinrong-databricks/enable_tests.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-27 12:33:30 +09:00
Takuya UESHIN d6d3209c2f [SPARK-35537][PYTHON] Introduce a util function spark_column_equals
### What changes were proposed in this pull request?

Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not.

### Why are the changes needed?

In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one.
We should introduce a util function for it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

The existing tests.

Closes #32680 from ueshin/issues/SPARK-35537/spark_column_equals.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-27 12:14:43 +09:00
Xinrong Meng 8cc7232ffa [SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType
### What changes were proposed in this pull request?

BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.

### Why are the changes needed?

The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.

```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0    [49, 49]
1    [50, 50]
2    [51, 51]
dtype: object
>>> psser + b'1'
0    [49, 49]
1    [50, 49]
2    [51, 49]
dtype: object
```

### How was this patch tested?

Unit tests.

Closes #32665 from xinrong-databricks/datatypeops_binary.

Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-26 14:30:24 -07:00
Xinrong Meng 266608d50e [SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps
### What changes were proposed in this pull request?

The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.

### Why are the changes needed?

StructType, ArrayType, and MapType are not accepted by DataTypeOps now.

We should handle these complex types. Among them:

- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.

- StructOps will be helpful to make to/from pandas conversion data-type-based.

### Does this PR introduce _any_ user-facing change?

Yes.

Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```

After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0    [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0    [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```

### How was this patch tested?

Unit tests.

Closes #32626 from xinrong-databricks/datatypeop_complex.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-26 10:40:01 -07:00
Hyukjin Kwon e47e615c0e [SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions
### What changes were proposed in this pull request?

This PR enables GitHub Actions to test PySpark with Python 3.9.

### Why are the changes needed?

To verify the support of Python 3.9.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Existing tests should cover.

Closes #32657 from HyukjinKwon/SPARK-35506.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-05-26 09:25:51 +09:00
Takuya UESHIN d67d73b708 [SPARK-35505][PYTHON] Remove APIs which have been deprecated in Koalas
### What changes were proposed in this pull request?

Removes APIs which have been deprecated in Koalas.

### Why are the changes needed?

There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, the APIs deprecated in Koalas will be no longer available.

### How was this patch tested?

Modified some tests which use the deprecated APIs, and the other existing tests should pass.

Closes #32656 from ueshin/issues/SPARK-35505/remove_deprecated.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-05-25 11:16:27 -07:00