Commit graph

2800 commits

Author SHA1 Message Date
Takuya UESHIN 6497ac3585 [SPARK-35471][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.frame
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-25 14:41:58 +09:00
Takuya UESHIN cfcfbca965 [SPARK-35476][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.series
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 19:32:33 +09:00
Hyukjin Kwon 5a7686a393 [SPARK-35301][PYTHON][DOCS] Document migration guide from Koalas to pandas APIs on Spark
### What changes were proposed in this pull request?

This PR proposes to add a migration guide for legacy Koalas users in pandas API on Spark.

### Why are the changes needed?

For easier migration.

### Does this PR introduce _any_ user-facing change?

Yes, this adds a new page for migration from Koalas.

### How was this patch tested?

Manually built the docs and checked manually.

Closes #33050 from HyukjinKwon/SPARK-35301.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 17:58:09 +09:00
itholic 92ddef7cfb [SPARK-35696][PYTHON][DOCS][FOLLOW-UP] Fix underline for title in FAQ to remove warnings
### What changes were proposed in this pull request?

This PR follow-up for SPARK-35696 to fix incorrect underline in the documents to remove warnings.

### Why are the changes needed?

We should build the docs without any incorrect documentation style

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually build docs and see the warning is removed

Closes #33052 from itholic/SPARK-35696-followup.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 15:20:13 +09:00
itholic 712ed87faa [SPARK-35696][PYTHON][DOCS] Refine the code examples in pandas-on-Spark documentation
### What changes were proposed in this pull request?

This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas.

For example,

```python
kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```

should be refined to

```python
psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```

Also fixed the several remaining Koalas stuffs in FAQ

### Why are the changes needed?

Because we don't want to use the name "Koalas" in the Apache Spark anymore.

### Does this PR introduce _any_ user-facing change?

Yes, the examples in the documentation will be changed with refined names.

### How was this patch tested?

Manually built the docs and check one by one.

Closes #33017 from itholic/SPARK-35696.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 14:48:13 +09:00
Ruifeng Zheng 37f70422b5 [SPARK-35678][ML][FOLLOWUP] Revert changes in ANN
### What changes were proposed in this pull request?
revert changes related to ANN

### Why are the changes needed?
using the new `softmax` may cause flaky failure

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
reverted testsuite

Closes #33049 from zhengruifeng/revert_softmax_ann.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 14:02:28 +09:00
Ruifeng Zheng a66738823c [SPARK-35678][ML][FOLLOWUP] softmax support offset and step
### What changes were proposed in this pull request?
softmax support offset and step, then we can use it in ANN and NB

### Why are the changes needed?
to simplify impl

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing testsuite

Closes #32991 from zhengruifeng/softmax_support_offset_step.

Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
2021-06-23 21:03:18 -05:00
Hyukjin Kwon be9089731a [SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes to fix:
- the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one.
- update quickstart of pandas API on Spark, and make it working

The notebooks can be easily reviewed here:

https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb

Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html

### Why are the changes needed?

- To show the working examples of quickstart to end users.
- To allow users to try out the examples without installation easily.

### Does this PR introduce _any_ user-facing change?

No to end users because the existing quickstart of pandas API on Spark is not released yet.

### How was this patch tested?

I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0

Closes #33041 from HyukjinKwon/SPARK-35588-2.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-24 10:17:22 +09:00
Yikun Jiang 4824c53398 [SPARK-35812][PYTHON] Throw ValueError if version and timestamp are used together in to_delta
### What changes were proposed in this pull request?

Throw ValueError if version and timestamp are used together in to_delta

### Why are the changes needed?
read_delta has arguments named `version` and `timestamp`, but they cannot be used together.

We should raise the proper error message when they are used together.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
UT

Closes #33023 from Yikun/SPARK-35812.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-23 19:04:45 +09:00
Takuya UESHIN 68b54b702c [SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.groupby
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-23 09:51:33 +09:00
Takuya UESHIN c418803df7 [SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull
### What changes were proposed in this pull request?

Properly set `InternalField` for `DataTypeOps.isnull`.

### Why are the changes needed?

The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some more tests.

Closes #33005 from ueshin/issues/SPARK-35847/isnull_field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-22 12:54:01 -07:00
Yikun Jiang 1c26433f1d [SPARK-35849][PYTHON] Make astype method data-type-based for DecimalOps
### What changes were proposed in this pull request?
Make DecimalOps astype data-type-based.

See more in:
https://github.com/apache/spark/pull/32821#issuecomment-861119905

### Why are the changes needed?
Make DecimalOps astype data-type-based.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py

Closes #33009 from Yikun/SPARK-35849.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-22 10:41:22 -07:00
Hyukjin Kwon 27046582e4 [SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section
### What changes were proposed in this pull request?

This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts.

### Why are the changes needed?

pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete.

### Does this PR introduce _any_ user-facing change?

To end users, no because the docs are not released yet.

### How was this patch tested?

I manually built the docs and checked the output

Closes #33018 from HyukjinKwon/SPARK-35645.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-22 09:36:27 -07:00
Takuya UESHIN a8fdb98ecb [SPARK-35470][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.base
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-22 11:25:16 +09:00
Xinrong Meng 6ca56b01dc [SPARK-35614][PYTHON] Make the conversion to pandas data-type-based for ExtensionDtypes
### What changes were proposed in this pull request?

We propose to
- introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps`
- make the "conversion to pandas" data-type-based for ExtensionDtypes

Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed.

### Why are the changes needed?

The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly.
That makes code hard to change or maintain.

Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32910 from xinrong-databricks/datatypeops_pd_ext.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-21 13:19:55 -07:00
Hyukjin Kwon 248fda3ead [SPARK-35834][PYTHON] Use the same cleanup logic as Py4J in inheritable thread API
### What changes were proposed in this pull request?

This PR fixes the cleanup logic in inheritable thread API by following Py4J cleanup logic at https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278.

Currently the tests that use `inheritable_thread_target` are flaky (https://github.com/apache/spark/runs/2870944288):

```
======================================================================
ERROR [71.813s]: test_save_load_pipeline_estimator (pyspark.ml.tests.test_tuning.CrossValidatorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in test_save_load_pipeline_estimator
    self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
  File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in _run_test_save_load_pipeline_estimator
    cvModel2 = crossval2.fit(training)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
    return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
    bestModel = est.fit(dataset, epm[bestIndex])
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
    return self.copy(params)._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
    model = stage.fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
    return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
    model = stage.fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
    return self._fit(dataset)
  File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in _fit
    models = pool.map(inheritable_thread_target(trainSingleClass), range(numClasses))
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
    InheritableThread._clean_py4j_conn_for_current_thread()
  File "/__w/spark/spark/python/pyspark/util.py", line 389, in _clean_py4j_conn_for_current_thread
    del connections[i]
IndexError: deque index out of range

----------------------------------------------------------------------
```

This seems to be because the connection deque `jvm._gateway_client.deque` is accessed, and modified by other threads. Therefore, the number of threads could be changed in the middle. Using `SparkContext._lock` doesn't protect because the deque can be updated for every Java instance access in Py4J.

This PR proposes to use the atomic `deque.remove` in the problematic dequeue alone with try-catch on `ValueError` in case it's [deleted by Py4J](https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278).

### Why are the changes needed?

To fix the flakiness in the tests, and avoid possible breakage in user application by using this API.

### Does this PR introduce _any_ user-facing change?

If users were dependent on InheritableThread with pinned thread mode on, they might have faced such issues intermittently. This PR fixes it.

### How was this patch tested?

Manually tested. CI should test it out too.

Closes #32989 from HyukjinKwon/SPARK-35834.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-21 12:00:16 +09:00
Kevin Su 653be9d774 [SPARK-35811][PYTHON] Deprecate DataFrame.to_spark_io
### What changes were proposed in this pull request?

Deprecate the `DataFrame.to_spark_io`

### Why are the changes needed?

We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas.

### Does this PR introduce _any_ user-facing change?

Yes, users will get warning while using `DataFrame.to_spark_io` api.

### How was this patch tested?

Pass the CIs

Closes #32964 from pingsutw/SPARK-35811.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-21 10:43:34 +09:00
Hyukjin Kwon 6d309914df [SPARK-35303][SPARK-35498][PYTHON][FOLLOW-UP] Copy local properties when starting the thread, and use inheritable thread in the current codebase
### What changes were proposed in this pull request?

This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.

This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.

### Why are the changes needed?

To mimic the JVM behaviour about thread lifecycle

### Does this PR introduce _any_ user-facing change?

Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.

### How was this patch tested?

Existing tests should cover this.

Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-20 11:48:38 +09:00
Takuya UESHIN 1589d32732 [SPARK-35472][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.generic
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-20 11:48:01 +09:00
Yikun Jiang b7df75a777 [SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.

### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.

### Does this PR introduce _any_ user-facing change?
No, test only

### How was this patch tested?
test passed.

Closes #32859 from Yikun/SPARK-XXXXX1.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 18:54:50 -07:00
Takuya UESHIN c879510d2f [SPARK-35478][PYTHON][FOLLOWUP] Fix Jenkins' linter
### What changes were proposed in this pull request?

This is a follow-up of #32886 to fix the Jenkins' linter.

### Why are the changes needed?

The PR #32886 was mistakenly merged before Jenkins' linter passes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Closes #32965 from ueshin/issues/SPARK-35478/fup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-06-18 13:52:54 -07:00
Kevin Su 3fb044e043 [SPARK-35478][PYTHON] Enable disallow_untyped_defs mypy check for pyspark.pandas.window
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32886 from pingsutw/SPARK-35478.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 11:21:33 -07:00
Yikun Jiang f84a720fe3 [SPARK-35342][PYTHON] Introduce DecimalOps and make isnull method data-type-based
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based

### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.

https://issues.apache.org/jira/browse/SPARK-35342

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed

Closes #32821 from Yikun/SPARK-35342.

Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-18 10:44:35 -07:00
Takuya UESHIN 2f537a838a [SPARK-35469][PYTHON] Fix disallow_untyped_defs mypy checks
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-18 20:43:59 +09:00
HyukjinKwon 41af409b7b [SPARK-35303][PYTHON] Enable pinned thread mode by default
### What changes were proposed in this pull request?

PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default.

### Why are the changes needed?

To correctly support parallel job submission and management.

### Does this PR introduce _any_ user-facing change?

Yes, now Python thread is mapped to JVM thread one to one.

### How was this patch tested?

Existing tests should cover it.

Closes #32429 from HyukjinKwon/SPARK-35303.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-18 12:02:29 +09:00
Hyukjin Kwon 94bdbec380 [SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section
### What changes were proposed in this pull request?

This PR proposes to merge contents and remove obsolete pages in Development section, especially about pandas API on Spark.

Some were removed, and some were merged to the existing PySpark guides. I will inline some comments in the PRs to make the review easier.

### Why are the changes needed?

To guide developers on the code base of pandas API on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it updates the user-facing documentation.

### How was this patch tested?

Manually built the docs and checked.

Closes #32926 from HyukjinKwon/SPARK-35644.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-17 13:35:20 +09:00
itholic b9aeeb4e6c [SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side
### What changes were proposed in this pull request?

This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007
- it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```
- it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous.

And added the related test cases.

### Why are the changes needed?

To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily.

```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
MultiIndex([('a', 'x', 1),
            ('b', 'z', 2),
            ('k', 'z', 3)],
           )
```

And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`.

### Does this PR introduce _any_ user-facing change?

Yes, the previous bug is fixed as described in the above code examples.

### How was this patch tested?

Manually tested with linter and unittest in local, and it might be passed on CI.

Closes #32853 from itholic/SPARK-35683.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 14:18:54 +09:00
Takuya UESHIN 2a56cc36ca [SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings
### What changes were proposed in this pull request?

Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings.

### Why are the changes needed?

The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings.
We should use type-annotation based `pandas_udf` or avoid specifying udf types.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32913 from ueshin/issues/SPARK-35761/suppress_warnings.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 11:17:56 +09:00
Hyukjin Kwon 95f36e76c6 [SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark"
### What changes were proposed in this pull request?

This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface).

### Why are the changes needed?

To make it sound more natural.

### Does this PR introduce _any_ user-facing change?

It fixes a typo in the unreleased changes.

### How was this patch tested?

N/A

Closes #32903 from HyukjinKwon/SPARK-34885.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 10:01:04 +09:00
Takuya UESHIN ef7545b788 [SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark
### What changes were proposed in this pull request?

Removes the upperbound for numpy for pandas-on-Spark.

### Why are the changes needed?

We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #32908 from ueshin/issues/SPARK-35759/numpy.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-15 09:59:05 +09:00
Xinrong Meng 03756618fc [SPARK-35616][PYTHON] Make astype method data-type-based
### What changes were proposed in this pull request?

Make `astype` method data-type-based.

**Non-goal: Match pandas' `astype` TypeErrors.**
Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of  Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages.
Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed.

### Why are the changes needed?

There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32847 from xinrong-databricks/datatypeops_astype.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-14 16:33:15 -07:00
Hyukjin Kwon 76e08a8e3d [SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots
### What changes were proposed in this pull request?

This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172.

```python
ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200)
```

**Before:**

```
pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).;
'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))]
+- Project [a#1L, b#2, c#3L]
   +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L]
      +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false
```

**After:**

```python
Figure({
    'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}',
              'name': 'a',
...
```

### Why are the changes needed?

To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns.

### Does this PR introduce _any_ user-facing change?

No to end users since the changes is not released yet. Yes to dev as described before.

### How was this patch tested?

Manually tested, added a test and tested in notebooks:

![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png)

![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png)

Closes #32884 from HyukjinKwon/fix-hist-plot.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-12 14:36:46 +09:00
Takuya UESHIN 4d21b94d13 [SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks
### What changes were proposed in this pull request?

Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.

Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-11 11:07:11 -07:00
itholic ebe529e8e1 [SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents
### What changes were proposed in this pull request?

This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents.

### Why are the changes needed?

Since we don't use the name "Koalas" anymore.

We should use "Pandas APIs on Spark" instead.

### Does this PR introduce _any_ user-facing change?

Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents.

### How was this patch tested?

Manually built the docs and checked one by one.

Closes #32835 from itholic/SPARK-35591.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-11 20:42:38 +09:00
Kevin Su cadd3a0588 [SPARK-35474] Enable disallow_untyped_defs mypy check for pyspark.pandas.indexing
### What changes were proposed in this pull request?

Adds more type annotations in the file:
`python/pyspark/pandas/spark/indexing.py`
and fixes the mypy check failures.

### Why are the changes needed?

We should enable more disallow_untyped_defs mypy checks.

### Does this PR introduce _any_ user-facing change?

Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.

### How was this patch tested?

The mypy check with a new configuration and existing tests should pass.
`./dev/lint-python`

Closes #32738 from pingsutw/SPARK-35474.

Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-09 22:35:12 -07:00
Xinrong Meng e9d60156c4 [SPARK-35705][PYTHON] Adjust pandas-on-spark test_groupby_multiindex_columns test for different pandas versions
### What changes were proposed in this pull request?

Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions.

### Why are the changes needed?

pandas had introduced bugs as below:

- For pandas 1.1.3 and 1.1.4
Type error: only integer scalar arrays can be converted to a scalar index

- For pandas < 1.0.4
Type error: Can only tuple-index with a MultiIndex

We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32851 from xinrong-databricks/SPARK-35705.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-10 10:36:19 +09:00
Xinrong Meng 3c66c11aa6 [SPARK-35601][PYTHON] Complete arithmetic operators involving bool literals, Series, and Index
### What changes were proposed in this pull request?

Completing arithmetic operators involving bool literals, Series, and Index consists of two main tasks:
- Support arithmetic operations against bool literals
- Support operators (+, *) between bool Series/Indexes.

### Why are the changes needed?

Arithmetic operators involving bool literals, Series, and Index are incomplete now.
We ought to match pandas' behaviors.

### Does this PR introduce _any_ user-facing change?

Yes.

Newly supported operations example:
```py
>>> ps.Series([1, 2, 3]) + True
0    2
1    3
2    4
dtype: int64
>>> ps.Series([1, 2, 3]) + False
0    1
1    2
2    3
dtype: int64
>>> ps.Series([True, False, True]) + True
0    True
1    True
2    True
dtype: bool
>>> ps.Series([True, False, True]) + False
0     True
1    False
2     True
dtype: bool
>>> ps.Series([True, False, True]) * True
0     True
1    False
2     True
dtype: bool
>>> ps.Series([True, False, True]) * False
0    False
1    False
2    False
dtype: bool
>>> ps.set_option('compute.ops_on_diff_frames', True)
>>> ps.Series([True, True, False]) + ps.Series([True, False, True])
0    True
1    True
2    True
dtype: bool
>>> ps.Series([True, True, False]) * ps.Series([True, False, True])
0     True
1    False
2    False
dtype: bool
```
Before the change, operations above are not supported, raising a TypeError such as
```py
>>> ps.Series([True, False, True]) + True
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans and the given type.
>>> ps.Series([True, False, True]) + False
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans and the given type.
```

### How was this patch tested?

Unit tests.

Closes #32785 from xinrong-databricks/datatypeops_arith_bool.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-09 15:13:03 -07:00
Hyukjin Kwon afff42178c [SPARK-35647][PYTHON][DOCS] Restructure User Guide in PySpark documentation
### What changes were proposed in this pull request?

This PR proposes to restructure User Guide in PySpark documentation for pandas APIs on Spark.

**Before**

![Screen Shot 2021-06-08 at 8 47 41 PM](https://user-images.githubusercontent.com/6477701/121179493-cb85e280-c89a-11eb-8b93-552ebe7cd0a8.png)

**After**

![Screen Shot 2021-06-08 at 8 46 58 PM](https://user-images.githubusercontent.com/6477701/121179419-b3ae5e80-c89a-11eb-82a0-6dabbf1de12d.png)

Note that I mostly just moved the contents around except minor changes:
- Removing some questions in FAQ that don't make sense in Apache Spark
- Rename a subtitle "Working with pandas and PySpark" to "From/to pandas and PySpark DataFrames"

For renaming Koalas to either pandas-on-Spark or pandas APIs on Spark, it will be done at SPARK-35591

### Why are the changes needed?

For better readability.

### Does this PR introduce _any_ user-facing change?

Yes, it restructures the documentation as shown above.

### How was this patch tested?

I manually built the docs and tested.

Closes #32820 from HyukjinKwon/SPARK-35647.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-09 12:13:25 +09:00
liuqi e79dd89cf6 [SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function
### What changes were proposed in this pull request?
Limit the batch size for `add_shuffle_key` in `partitionBy` function to fix `OverflowError: cannot convert float infinity to integer`

### Why are the changes needed?
It's not easy to write a UT, but I can use some simple code to explain the bug.
* Original code
```
        def add_shuffle_key(split, iterator):

            buckets = defaultdict(list)
            c, batch = 0, min(10 * numPartitions, 1000)

            for k, v in iterator:
                buckets[partitionFunc(k) % numPartitions].append((k, v))
                c += 1

                # check used memory and avg size of chunk of objects
                if (c % 1000 == 0 and get_used_memory() > limit
                        or c > batch):
                    n, size = len(buckets), 0
                    for split in list(buckets.keys()):
                        yield pack_long(split)
                        d = outputSerializer.dumps(buckets[split])
                        del buckets[split]
                        yield d
                        size += len(d)

                    avg = int(size / n) >> 20
                    # let 1M < avg < 10M
                    if avg < 1:
                        batch *= 1.5
                    elif avg > 10:
                        batch = max(int(batch / 1.5), 1)
                    c = 0
```
if `get_used_memory() > limit` always is `True` and `avg < 1` always is `True`, the variable `batch` will grow to infinity. then `batch = max(int(batch / 1.5), 1)` may raise `OverflowError` if `avg > 10` at some time.
* sample code to reproduce the bug
```
import sys

limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)

while True:
    c += 1
    if (c % 1000 == 0 and used_memory > limit or c > batch):
        batch = batch * 1.5
        d = max(int(batch / 1.5), 1)
        print(c, batch)
```

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
It's not easy to write a UT, there is sample code to test
```
import sys

limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)

while True:
    c += 1
    if (c % 1000 == 0 and used_memory > limit or c > batch):
        batch = min(sys.maxsize, batch * 1.5)
        d = max(int(batch / 1.5), 1)
        print(c, batch)
```

Closes #32667 from nolanliou/fix_partitionby.

Authored-by: liuqi <nolan.liou@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-09 10:57:27 +09:00
Hyukjin Kwon 921abc51cf [SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout
### What changes were proposed in this pull request?

This PR proposes to restructure API files according to the layout, see https://github.com/apache/spark/pull/32799. Now the pandas APIs on Spark are under a separate directory which is same level as other modules such as Spark SQL.

```bash
tree reference
```

**Before:**

```
reference
├── index.rst
├── ps_extensions.rst
├── ps_frame.rst
├── ps_general_functions.rst
├── ps_groupby.rst
├── ps_indexing.rst
├── ps_io.rst
├── ps_ml.rst
├── ps_series.rst
├── ps_window.rst
├── pyspark.ml.rst
├── pyspark.mllib.rst
├── pyspark.pandas.rst
├── pyspark.resource.rst
├── pyspark.rst
├── pyspark.sql.rst
├── pyspark.ss.rst
└── pyspark.streaming.rst
```

**After:**

```
reference
├── index.rst
├── pyspark.ml.rst
├── pyspark.mllib.rst
├── pyspark.pandas
│   ├── extensions.rst
│   ├── frame.rst
│   ├── general_functions.rst
│   ├── groupby.rst
│   ├── index.rst
│   ├── indexing.rst
│   ├── io.rst
│   ├── ml.rst
│   ├── series.rst
│   └── window.rst
├── pyspark.resource.rst
├── pyspark.rst
├── pyspark.sql.rst
├── pyspark.ss.rst
└── pyspark.streaming.rst
```

### Why are the changes needed?

To make the directory structure easier to follow.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually built and tested the docs.

Closes #32812 from HyukjinKwon/SPARK-35646-followup.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 19:01:56 +09:00
Takuya UESHIN 04418e18d7 [SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and StructFields
### What changes were proposed in this pull request?

Introduces `InternalField` to manage dtypes and `StructField`s.

`InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue.

It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark.

### Why are the changes needed?

Currently there are some performance issues in the pandas-on-Spark layer.

One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types.

We should reduce the amount of unnecessary access.

### Does this PR introduce _any_ user-facing change?

Improves the performance in pandas-on-Spark layer:

```py
df = ps.read_parquet("/path/to/test.parquet")  # contains ~75 columns
df = df[(df["col"] > 0) & (df["col"] < 10000)]
```

Before the PR, it took about **2.15 sec** and after **1.15 sec**.

### How was this patch tested?

Existing tests.

Closes #32775 from ueshin/issues/SPARK-35638/field.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-08 11:57:28 +09:00
Xinrong Meng dfd8a8dc67 [SPARK-35341][PYTHON] Introduce BooleanExtensionOps
### What changes were proposed in this pull request?

- Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based.
- Improve error messages for operators `and` and `or`.

### Why are the changes needed?

Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based

BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced.

These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR.

### Does this PR introduce _any_ user-facing change?

Yes. Error messages for operators `and` and `or` are improved.
Before:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).;
'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))]
+- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L]
   +- LogicalRDD [__index_level_0__#8L, 0#9], false

```

After:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
TypeError: Bitwise or can not be applied to categoricals.
```

### How was this patch tested?

Unit tests.

Closes #32698 from xinrong-databricks/datatypeops_extension.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-07 15:43:52 -07:00
Xinrong Meng 04a8d2cbcf [SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes
### What changes were proposed in this pull request?

Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.

### Why are the changes needed?

The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #32592 from xinrong-databricks/datatypeop_pd_conversion.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-06-07 13:12:12 -07:00
Hyukjin Kwon 7ce7aa4758 [SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in documentation
### What changes were proposed in this pull request?

This PR proposes to change from:

![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png)

to:

![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png)

### Why are the changes needed?

pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL).

### Does this PR introduce _any_ user-facing change?

Yes, it changes the structure of the docs. To end users, no as it's only in development branch.

### How was this patch tested?

Manually tested as above.

Closes #32799 from HyukjinKwon/SPARK-35646.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-07 16:37:58 +09:00
Xinrong Meng 50f7686de9 [SPARK-35599][PYTHON] Adjust check_exact parameter for older pd.testing
### What changes were proposed in this pull request?

Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions.

### Why are the changes needed?

`pd.testing` utils are utilized in pandas-on-Spark tests.
Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`.  We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #32772 from xinrong-databricks/test_util.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-07 11:12:49 +09:00
itholic b8740a1d1e [SPARK-35499][PYTHON] Apply black to pandas API on Spark codes
### What changes were proposed in this pull request?

This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.

By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.

### Why are the changes needed?

This can be reduces the cost of static analysis during development.

It has been used continuously for about a year in the Koalas project and its convenience has been proven.

### Does this PR introduce _any_ user-facing change?

No, it's dev-only.

### How was this patch tested?

Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.

Closes #32779 from itholic/SPARK-35499.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-06-06 17:30:07 -07:00
Keerthan Vasist f2c0a049a6 [SPARK-35643][PYTHON] Fix ambiguous reference in functions.py column()
### What changes were proposed in this pull request?
In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended.

This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643.

### Why are the changes needed?
In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I don't believe this patch needs additional testing.

Closes #32771 from keerthanvasist/col.

Lead-authored-by: Keerthan Vasist <kvasist@amazon.com>
Co-authored-by: keerthanvasist <kvasist@amazon.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-05 12:40:39 +09:00
Hyukjin Kwon 3d158f9c91 [SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation
### What changes were proposed in this pull request?

This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:

- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation

Other then that,

- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion

### Why are the changes needed?

To document pandas APIs on Spark.

### Does this PR introduce _any_ user-facing change?

Yes, it adds new documentations.

### How was this patch tested?

Manually built the docs and checked the output.

Closes #32726 from HyukjinKwon/SPARK-35587.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-04 11:11:09 +09:00
itholic 2658bc590f [SPARK-35081][DOCS] Add Data Source Option links to missing documents
### What changes were proposed in this pull request?

This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`.

- Before
<img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png">

- After
<img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png">

### Why are the changes needed?

To provide users available options in detail with the proper documentation link.

### Does this PR introduce _any_ user-facing change?

Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture.

### How was this patch tested?

Manually built the docs and checked one by one.

Closes #32762 from itholic/SPARK-35081.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-03 13:52:46 +09:00
itholic 48252bac95 [SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page
### What changes were proposed in this pull request?

This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page.

### Why are the changes needed?

So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language.

### Does this PR introduce _any_ user-facing change?

Yes, the documents will be shown below after this change:

- "JDBC To Other Databases" page
<img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png">

- Python
![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png)

- Scala
![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png)

- Java
![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png)

### How was this patch tested?

Manually build docs and confirm the page.

Closes #32723 from itholic/SPARK-35583.

Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-06-02 14:21:16 +09:00