### What changes were proposed in this pull request?
Improve unit tests for data-type-based basic operations by:
- removing redundant test cases
- adding `astype` test for ExtensionDtypes
### Why are the changes needed?
Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases.
`astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33095 from xinrong-databricks/datatypeops_test.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`.
- `Label`: `Tuple[Any, ...]`
Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures.
- `Name`: `Union[Any, Label]`
External expression for user-facing names, which can be scalar values or tuples.
### Why are the changes needed?
Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33159 from ueshin/issues/SPARK-35944/name_and_label.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduces `Axis` type alias for `axis` argument to be consistent.
### Why are the changes needed?
There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33144 from ueshin/issues/SPARK-35943/axis.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes removing the legacy Koalas version from pandas API on Spark package.
And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version.
### Why are the changes needed?
Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas.
### Does this PR introduce _any_ user-facing change?
Now the legacy Koalas user should follow the version from PySpark.
### How was this patch tested?
Manually built the package and see it's successfully done.
Closes#33128 from itholic/SPARK-35873.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Cleaning up the type hints in pandas-on-Spark.
- Use a single file `_typing.py` for type variables or aliases
- Rename `IndexOpsLike` to `SeriesOrIndex`.
- Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively
- Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]`
### Why are the changes needed?
This is a cleanup for the mypy check stuff series.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33117 from ueshin/issues/SPARK-35859/cleanup.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Add path level discover for python unittests.
### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.
Such as:
- pyspark-core pyspark.tests.test_pin_thread
Thus we need some auto-discover way to find all testcase rather than specified every case by manually.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
for g in sorted(m.python_test_goals):
print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKLCloses#32867 from Yikun/SPARK_DISCOVER_TEST.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.
```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method
Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.
### Why are the changes needed?
Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark.
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```
After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0 True
1 True
2 False
3 False
4 False
Name: source, dtype: bool
```
### How was this patch tested?
Unit tests.
Closes#32955 from xinrong-databricks/datatypeops_literal.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Refines type hints in `pyspark.pandas.window`.
Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`.
### Why are the changes needed?
We can use more strict type hints for functions in pyspark.pandas.window using the generic way.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33097 from ueshin/issues/SPARK-35901/window.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.
### Why are the changes needed?
Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.
### Does this PR introduce _any_ user-facing change?
No, it's kinda internal refactoring stuff.
### How was this patch tested?
Added the related tests and manually check they're passed.
Closes#33054 from itholic/SPARK-35605.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Throw ValueError if version and timestamp are used together in to_delta
### Why are the changes needed?
read_delta has arguments named `version` and `timestamp`, but they cannot be used together.
We should raise the proper error message when they are used together.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#33023 from Yikun/SPARK-35812.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Properly set `InternalField` for `DataTypeOps.isnull`.
### Why are the changes needed?
The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added some more tests.
Closes#33005 from ueshin/issues/SPARK-35847/isnull_field.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make DecimalOps astype data-type-based.
See more in:
https://github.com/apache/spark/pull/32821#issuecomment-861119905
### Why are the changes needed?
Make DecimalOps astype data-type-based.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py
Closes#33009 from Yikun/SPARK-35849.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
We propose to
- introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps`
- make the "conversion to pandas" data-type-based for ExtensionDtypes
Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed.
### Why are the changes needed?
The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32910 from xinrong-databricks/datatypeops_pd_ext.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Deprecate the `DataFrame.to_spark_io`
### Why are the changes needed?
We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas.
### Does this PR introduce _any_ user-facing change?
Yes, users will get warning while using `DataFrame.to_spark_io` api.
### How was this patch tested?
Pass the CIs
Closes#32964 from pingsutw/SPARK-35811.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.
### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.
### Does this PR introduce _any_ user-facing change?
No, test only
### How was this patch tested?
test passed.
Closes#32859 from Yikun/SPARK-XXXXX1.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of #32886 to fix the Jenkins' linter.
### Why are the changes needed?
The PR #32886 was mistakenly merged before Jenkins' linter passes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#32965 from ueshin/issues/SPARK-35478/fup.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32886 from pingsutw/SPARK-35478.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based
### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.
https://issues.apache.org/jira/browse/SPARK-35342
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed
Closes#32821 from Yikun/SPARK-35342.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007
- it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```
- it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous.
And added the related test cases.
### Why are the changes needed?
To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
MultiIndex([('a', 'x', 1),
('b', 'z', 2),
('k', 'z', 3)],
)
```
And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`.
### Does this PR introduce _any_ user-facing change?
Yes, the previous bug is fixed as described in the above code examples.
### How was this patch tested?
Manually tested with linter and unittest in local, and it might be passed on CI.
Closes#32853 from itholic/SPARK-35683.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings.
### Why are the changes needed?
The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings.
We should use type-annotation based `pandas_udf` or avoid specifying udf types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32913 from ueshin/issues/SPARK-35761/suppress_warnings.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface).
### Why are the changes needed?
To make it sound more natural.
### Does this PR introduce _any_ user-facing change?
It fixes a typo in the unreleased changes.
### How was this patch tested?
N/A
Closes#32903 from HyukjinKwon/SPARK-34885.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make `astype` method data-type-based.
**Non-goal: Match pandas' `astype` TypeErrors.**
Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages.
Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed.
### Why are the changes needed?
There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32847 from xinrong-databricks/datatypeops_astype.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172.
```python
ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200)
```
**Before:**
```
pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).;
'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))]
+- Project [a#1L, b#2, c#3L]
+- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L]
+- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false
```
**After:**
```python
Figure({
'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}',
'name': 'a',
...
```
### Why are the changes needed?
To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns.
### Does this PR introduce _any_ user-facing change?
No to end users since the changes is not released yet. Yes to dev as described before.
### How was this patch tested?
Manually tested, added a test and tested in notebooks:
![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png)
![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png)
Closes#32884 from HyukjinKwon/fix-hist-plot.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file:
`python/pyspark/pandas/spark/indexing.py`
and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
`./dev/lint-python`
Closes#32738 from pingsutw/SPARK-35474.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions.
### Why are the changes needed?
pandas had introduced bugs as below:
- For pandas 1.1.3 and 1.1.4
Type error: only integer scalar arrays can be converted to a scalar index
- For pandas < 1.0.4
Type error: Can only tuple-index with a MultiIndex
We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32851 from xinrong-databricks/SPARK-35705.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduces `InternalField` to manage dtypes and `StructField`s.
`InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue.
It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark.
### Why are the changes needed?
Currently there are some performance issues in the pandas-on-Spark layer.
One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types.
We should reduce the amount of unnecessary access.
### Does this PR introduce _any_ user-facing change?
Improves the performance in pandas-on-Spark layer:
```py
df = ps.read_parquet("/path/to/test.parquet") # contains ~75 columns
df = df[(df["col"] > 0) & (df["col"] < 10000)]
```
Before the PR, it took about **2.15 sec** and after **1.15 sec**.
### How was this patch tested?
Existing tests.
Closes#32775 from ueshin/issues/SPARK-35638/field.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based.
- Improve error messages for operators `and` and `or`.
### Why are the changes needed?
Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based
BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced.
These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR.
### Does this PR introduce _any_ user-facing change?
Yes. Error messages for operators `and` and `or` are improved.
Before:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).;
'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))]
+- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L]
+- LogicalRDD [__index_level_0__#8L, 0#9], false
```
After:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
TypeError: Bitwise or can not be applied to categoricals.
```
### How was this patch tested?
Unit tests.
Closes#32698 from xinrong-databricks/datatypeops_extension.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.
### Why are the changes needed?
The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32592 from xinrong-databricks/datatypeop_pd_conversion.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.
By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.
### Why are the changes needed?
This can be reduces the cost of static analysis during development.
It has been used continuously for about a year in the Koalas project and its convenience has been proven.
### Does this PR introduce _any_ user-facing change?
No, it's dev-only.
### How was this patch tested?
Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.
Closes#32779 from itholic/SPARK-35499.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:
- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation
Other then that,
- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion
### Why are the changes needed?
To document pandas APIs on Spark.
### Does this PR introduce _any_ user-facing change?
Yes, it adds new documentations.
### How was this patch tested?
Manually built the docs and checked the output.
Closes#32726 from HyukjinKwon/SPARK-35587.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning.
### Why are the changes needed?
If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work.
### Does this PR introduce _any_ user-facing change?
No. It's restoring the existing functionality.
### How was this patch tested?
Manually tested in local.
```shell
>>> sdf.to_koalas()
.../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead.
warnings.warn(
```
Closes#32729 from itholic/SPARK-35539.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support arithmetic operations against bool IndexOpsMixin.
### Why are the changes needed?
Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors.
pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is.
We aim to match pandas' behaviors.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> 1 + psser
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> ps.Series([1, 2, 3]) + psser
Traceback (most recent call last):
...
TypeError: addition can not be applied to given types.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
0 2
1 2
2 1
dtype: int64
>>> 1 + psser
0 2
1 2
2 1
dtype: int64
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
0 2
1 3
2 3
dtype: int64
>>> ps.Series([1, 2, 3]) + psser
0 2
1 3
2 3
dtype: int64
```
### How was this patch tested?
Unit tests.
Closes#32611 from xinrong-databricks/datatypeop_arith_bool.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor".
### Why are the changes needed?
Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark".
So, the related code bases are all need to be changed.
### Does this PR introduce _any_ user-facing change?
Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`.
**Note:** `df.koalas.[...]` is still available but with deprecated warnings.
### How was this patch tested?
Manually tested in local and checked one by one.
Closes#32674 from itholic/SPARK-35453.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657.
Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.
pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.
For the time being, I propose to exclude boolean case alone in percentile/quartile test case
### Why are the changes needed?
To keep the test coverage.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
I roughly locally tested. But it should pass in CI.
Closes#32690 from HyukjinKwon/SPARK-35510.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Re-enable some pandas-on-Spark test cases.
### Why are the changes needed?
pandas version in GitHub Actions is upgraded now so we can re-enable some pandas-on-Spark test cases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32682 from xinrong-databricks/enable_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not.
### Why are the changes needed?
In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one.
We should introduce a util function for it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
The existing tests.
Closes#32680 from ueshin/issues/SPARK-35537/spark_column_equals.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.
### Why are the changes needed?
The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0 [49, 49]
1 [50, 50]
2 [51, 51]
dtype: object
>>> psser + b'1'
0 [49, 49]
1 [50, 49]
2 [51, 49]
dtype: object
```
### How was this patch tested?
Unit tests.
Closes#32665 from xinrong-databricks/datatypeops_binary.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.
### Why are the changes needed?
StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
We should handle these complex types. Among them:
- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.
- StructOps will be helpful to make to/from pandas conversion data-type-based.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0 [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0 [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```
### How was this patch tested?
Unit tests.
Closes#32626 from xinrong-databricks/datatypeop_complex.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR enables GitHub Actions to test PySpark with Python 3.9.
### Why are the changes needed?
To verify the support of Python 3.9.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
Existing tests should cover.
Closes#32657 from HyukjinKwon/SPARK-35506.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Removes APIs which have been deprecated in Koalas.
### Why are the changes needed?
There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark.
### Does this PR introduce _any_ user-facing change?
Yes, the APIs deprecated in Koalas will be no longer available.
### How was this patch tested?
Modified some tests which use the deprecated APIs, and the other existing tests should pass.
Closes#32656 from ueshin/issues/SPARK-35505/remove_deprecated.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>