### What changes were proposed in this pull request?
This is a followup PR for SPARK-36104 (#33307) and removes unused import `typing.cast`.
After that change, Python linter fails.
```
./dev/lint-python
shell: sh -e {0}
env:
LC_ALL: C.UTF-8
LANG: C.UTF-8
pythonLocation: /__t/Python/3.6.13/x64
LD_LIBRARY_PATH: /__t/Python/3.6.13/x64/lib
starting python compilation test...
python compilation succeeded.
starting black test...
black checks passed.
starting pycodestyle test...
pycodestyle checks passed.
starting flake8 test...
flake8 checks failed:
./python/pyspark/pandas/data_type_ops/num_ops.py:19:1: F401 'typing.cast' imported but unused
from typing import cast, Any, Union
^
1 F401 'typing.cast' imported but unused
```
### Why are the changes needed?
To recover CI.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
CI.
Closes#33315 from sarutak/followup-SPARK-36104.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 47fd3173a5)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Manage InternalField for DataTypeOps.neg/abs.
### Why are the changes needed?
The spark data type and nullability must be the same as the original when DataTypeOps.neg/abs.
We should manage InternalField for this case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33307 from xinrong-databricks/internalField.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 5afc27f899)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Properly set `InternalField` for `DataTypeOps.invert`.
### Why are the changes needed?
The spark data type and nullability must be the same as the original when `DataTypeOps.invert`.
We should manage `InternalField` for this case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33306 from ueshin/issues/SPARK-36103/invert.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit e2021daafb)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Implement unary operator `invert` of integral ps.Series/Index.
### Why are the changes needed?
Currently, unary operator `invert` of integral ps.Series/Index is not supported. We ought to implement that following pandas' behaviors.
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, 3])
>>> ~psser
Traceback (most recent call last):
...
NotImplementedError: Unary ~ can not be applied to integrals.
```
After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([1, 2, 3])
>>> ~psser
0 -2
1 -3
2 -4
dtype: int64
```
### How was this patch tested?
Unit tests.
Closes#33285 from xinrong-databricks/numeric_invert.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit badb0393d4)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Properly set `InternalField` more in `DataTypeOps`.
### Why are the changes needed?
There are more places in `DataTypeOps` where we can manage `InternalField`.
We should manage `InternalField` for these cases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33275 from ueshin/issues/SPARK-36064/fields.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 95e6c6e3e9)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adjust `test_astype`, `test_neg` for old pandas versions.
### Why are the changes needed?
There are issues in old pandas versions that fail tests in pandas API on Spark. We ought to adjust `test_astype` and `test_neg` for old pandas versions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests. Please refer to https://github.com/apache/spark/pull/33272 for test results with pandas 1.0.1.
Closes#33250 from xinrong-databricks/SPARK-36035.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 698c4ec16b)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Merge test_decimal_ops into test_num_ops
- merge test_isnull() into test_num_ops.test_isnull()
- remove test_datatype_ops(), which already covered in 11fcbc73cb/python/pyspark/pandas/tests/data_type_ops/test_base.py (L58-L59)
### Why are the changes needed?
Tests for data-type-based operations of decimal Series are in two places:
- python/pyspark/pandas/tests/data_type_ops/test_decimal_ops.py
- python/pyspark/pandas/tests/data_type_ops/test_num_ops.py
We'd better merge test_decimal_ops into test_num_ops.
See also [SPARK-36002](https://issues.apache.org/jira/browse/SPARK-36002) .
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
unittests passed
Closes#33206 from Yikun/SPARK-36002.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit fdc50f4452)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For tests with operations on different Series, sort index of results before comparing them with pandas.
### Why are the changes needed?
We have many tests with operations on different Series in `spark/python/pyspark/pandas/tests/data_type_ops/` that assume the result's index to be sorted and then compare to the pandas' behavior.
The assumption on the result's index ordering is wrong since Spark DataFrame join is used internally and the order is not preserved if the data being in different partitions.
So we should assume the result to be disordered and sort the index of such results before comparing them with pandas.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33274 from xinrong-databricks/datatypeops_testdiffframe.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit af81ad0d7e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Try to capture the error message from the `faulthandler` when the Python worker crashes.
### Why are the changes needed?
Currently, we just see an error message saying `"exited unexpectedly (crashed)"` when the UDFs causes the Python worker to crash by like segmentation fault.
We should take advantage of [`faulthandler`](https://docs.python.org/3/library/faulthandler.html) and try to capture the error message from the `faulthandler`.
### Does this PR introduce _any_ user-facing change?
Yes, when a Spark config `spark.python.worker.faulthandler.enabled` is `true`, the stack trace will be seen in the error message when the Python worker crashes.
```py
>>> def f():
... import ctypes
... ctypes.string_at(0)
...
>>> sc.parallelize([1]).map(lambda x: f()).count()
```
```
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed): Fatal Python error: Segmentation fault
Current thread 0x000000010965b5c0 (most recent call first):
File "/.../ctypes/__init__.py", line 525 in string_at
File "<stdin>", line 3 in f
File "<stdin>", line 1 in <lambda>
...
```
### How was this patch tested?
Added some tests, and manually.
Closes#33273 from ueshin/issues/SPARK-36062/faulthandler.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 115b8a180f)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The PR is proposed to standardize TypeError messages for unsupported basic operations by:
- Capitalize the first letter
- Leverage TypeError messages defined in `pyspark/pandas/data_type_ops/base.py`
- Take advantage of the utility `is_valid_operand_for_numeric_arithmetic` to save duplicated TypeError messages
Related unit tests should be adjusted as well.
### Why are the changes needed?
Inconsistent TypeError messages are shown for unsupported data-type-based basic operations.
Take addition's TypeError messages for example:
- addition can not be applied to given types.
- string addition can only be applied to string series or literals.
Standardizing TypeError messages would improve user experience and reduce maintenance costs.
### Does this PR introduce _any_ user-facing change?
No user-facing behavior change. Only TypeError messages are modified.
### How was this patch tested?
Unit tests.
Closes#33237 from xinrong-databricks/datatypeops_err.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 819c482498)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make unary and comparison operators data-type-based. Refactored operators include:
- Unary operators: `__neg__`, `__abs__`, `__invert__`,
- Comparison operators: `>`, `>=`, `<`, `<=`, `==`, `!=`
Non-goal: Tasks below are inspired during the development of this PR.
[[SPARK-35997] Implement comparison operators for CategoricalDtype in pandas API on Spark](https://issues.apache.org/jira/browse/SPARK-35997)
[[SPARK-36000] Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled](https://issues.apache.org/jira/browse/SPARK-36000)
[[SPARK-36001] Assume result's index to be disordered in tests with operations on different Series](https://issues.apache.org/jira/browse/SPARK-36001)
[[SPARK-36002] Consolidate tests for data-type-based operations of decimal Series](https://issues.apache.org/jira/browse/SPARK-36002)
[[SPARK-36003] Implement unary operator `invert` of numeric ps.Series/Index](https://issues.apache.org/jira/browse/SPARK-36003)
### Why are the changes needed?
We have been refactoring basic operators to be data-type-based for readability, flexibility, and extensibility.
Unary and comparison operators are still not data-type-based yet. We should fill the gaps.
### Does this PR introduce _any_ user-facing change?
Yes.
- Better error messages. For example,
Before:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```
After:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b"2", b"3", b"4"])
>>> -psser
Traceback (most recent call last):
...
TypeError: Unary - can not be applied to binaries.
>>>
```
- Support unary `-` of `bool` Series. For example,
Before:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ...
```
After:
```py
>>> psser = ps.Series([True, False, True])
>>> -psser
0 False
1 True
2 False
dtype: bool
```
### How was this patch tested?
Unit tests.
Closes#33162 from xinrong-databricks/datatypeops_refactor.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
(cherry picked from commit 6e4e04f2a1)
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to bump up the mypy version to 0.910 which is the latest.
### Why are the changes needed?
To catch the type hint mistakes better in PySpark.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
GitHub Actions should test it out.
Closes#33223 from HyukjinKwon/SPARK-35684.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 16c195ccfb)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
Fix the type hint for `pyspark.rdd .RDD.histogram`'s `buckets` argument
The current type hint is incomplete.
![image](https://user-images.githubusercontent.com/17701527/124248180-df7fd580-db22-11eb-8391-ba0bb51d689b.png)
From `pyspark.rdd .RDD.histogram`'s source:
```python
if isinstance(buckets, int):
...
elif isinstance(buckets, (list, tuple)):
...
else:
raise TypeError("buckets should be a list or tuple or number(int or long)")
```
Fixed the warning displayed above.
Fixed warning above with this change.
Closes#33185 from tpvasconcelos/master.
Authored-by: Tomas Pereira de Vasconcelos <tomasvasconcelos1@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
(cherry picked from commit 495d234c6e)
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a cherry-pick of #33179.
We should use `check_exact=False` because the value check in `StatsTest.test_cov_corr_meta` is too strict.
### Why are the changes needed?
In some environment, the precision could be different in pandas' `DataFrame.corr` function and the test `StatsTest.test_cov_corr_meta` fails.
```
AssertionError: DataFrame.iloc[:, 0] (column name="a") are different
DataFrame.iloc[:, 0] (column name="a") values are different (14.28571 %)
[index]: [a, b, c, d, e, f, g]
[left]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0]
[right]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.807406715958909e-17]
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified tests should still pass.
Closes#33193 from ueshin/issuse/SPARK-35981/3.2/corr.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.
However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.
This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.
### Why are the changes needed?
AQE is default on now, we should make the perf better in the default case.
### Does this PR introduce _any_ user-facing change?
yes, a new config.
### How was this patch tested?
new tests
Closes#33172 from cloud-fan/aqe2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
(cherry picked from commit 0c9c8ff569)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improve unit tests for data-type-based basic operations by:
- removing redundant test cases
- adding `astype` test for ExtensionDtypes
### Why are the changes needed?
Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases.
`astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33095 from xinrong-databricks/datatypeops_test.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`.
- `Label`: `Tuple[Any, ...]`
Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures.
- `Name`: `Union[Any, Label]`
External expression for user-facing names, which can be scalar values or tuples.
### Why are the changes needed?
Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33159 from ueshin/issues/SPARK-35944/name_and_label.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add deprecation warning for Python 3.6.
### Why are the changes needed?
According to https://endoflife.date/python, Python 3.6 will be EOL on 23 Dec, 2021.
We should prepare for the deprecation of Python 3.6 support in Spark in advance.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Manual tests.
Closes#33139 from xinrong-databricks/deprecate3.6_warn.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently ,we sets the environment variable `PYSPARK_PIN_THREAD` at the client side of `InhertiableThread` API for Py4J (`python/pyspark/util.py`). If the Py4J gateway is created somewhere else (e.g., Zeppelin, etc), it could introduce a breakage at:
```python
from pyspark import SparkContext
jvm = SparkContext._jvm
thread_connection = jvm._gateway_client.get_thread_connection()
# `AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'` (non-pinned thread mode)
# `get_thread_connection` is only in 'ClientServer' (pinned thread mode)
```
This PR proposes to check the given gateway created, and do the pinned thread mode behaviour accordingly so we can avoid any breakage when Py4J server/gateway is created separately from somewhere else without a pinned thread mode.
### Why are the changes needed?
To avoid any potential breakage.
### Does this PR introduce _any_ user-facing change?
No, the change happened only in the master (fdd7ca5f4e).
### How was this patch tested?
This is actually a partial revert of fdd7ca5f4e. As long as the existing tests pass, I guess we're all good.
I also manually tested to make doubly sure:
**Before**:
```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
Traceback (most recent call last):
File "/.../python3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/.../python3.8/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/.../spark/python/pyspark/util.py", line 361, in copy_local_properties
InheritableThread._clean_py4j_conn_for_current_thread()
File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/util.py", line 324, in wrapped
InheritableThread._clean_py4j_conn_for_current_thread()
File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'
```
**After**:
```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
1
```
Closes#33147 from HyukjinKwon/SPARK-35946.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Introduces `Axis` type alias for `axis` argument to be consistent.
### Why are the changes needed?
There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33144 from ueshin/issues/SPARK-35943/axis.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes removing the legacy Koalas version from pandas API on Spark package.
And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version.
### Why are the changes needed?
Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas.
### Does this PR introduce _any_ user-facing change?
Now the legacy Koalas user should follow the version from PySpark.
### How was this patch tested?
Manually built the package and see it's successfully done.
Closes#33128 from itholic/SPARK-35873.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Cleaning up the type hints in pandas-on-Spark.
- Use a single file `_typing.py` for type variables or aliases
- Rename `IndexOpsLike` to `SeriesOrIndex`.
- Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively
- Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]`
### Why are the changes needed?
This is a cleanup for the mypy check stuff series.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33117 from ueshin/issues/SPARK-35859/cleanup.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Add path level discover for python unittests.
### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.
Such as:
- pyspark-core pyspark.tests.test_pin_thread
Thus we need some auto-discover way to find all testcase rather than specified every case by manually.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
for g in sorted(m.python_test_goals):
print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKLCloses#32867 from Yikun/SPARK_DISCOVER_TEST.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.
```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method
Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.
### Why are the changes needed?
Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark.
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```
After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0 True
1 True
2 False
3 False
4 False
Name: source, dtype: bool
```
### How was this patch tested?
Unit tests.
Closes#32955 from xinrong-databricks/datatypeops_literal.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Refines type hints in `pyspark.pandas.window`.
Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`.
### Why are the changes needed?
We can use more strict type hints for functions in pyspark.pandas.window using the generic way.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33097 from ueshin/issues/SPARK-35901/window.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.
### Why are the changes needed?
Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.
### Does this PR introduce _any_ user-facing change?
No, it's kinda internal refactoring stuff.
### How was this patch tested?
Added the related tests and manually check they're passed.
Closes#33054 from itholic/SPARK-35605.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add a migration guide for legacy Koalas users in pandas API on Spark.
### Why are the changes needed?
For easier migration.
### Does this PR introduce _any_ user-facing change?
Yes, this adds a new page for migration from Koalas.
### How was this patch tested?
Manually built the docs and checked manually.
Closes#33050 from HyukjinKwon/SPARK-35301.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR follow-up for SPARK-35696 to fix incorrect underline in the documents to remove warnings.
### Why are the changes needed?
We should build the docs without any incorrect documentation style
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually build docs and see the warning is removed
Closes#33052 from itholic/SPARK-35696-followup.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas.
For example,
```python
kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```
should be refined to
```python
psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
```
Also fixed the several remaining Koalas stuffs in FAQ
### Why are the changes needed?
Because we don't want to use the name "Koalas" in the Apache Spark anymore.
### Does this PR introduce _any_ user-facing change?
Yes, the examples in the documentation will be changed with refined names.
### How was this patch tested?
Manually built the docs and check one by one.
Closes#33017 from itholic/SPARK-35696.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
revert changes related to ANN
### Why are the changes needed?
using the new `softmax` may cause flaky failure
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
reverted testsuite
Closes#33049 from zhengruifeng/revert_softmax_ann.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
softmax support offset and step, then we can use it in ANN and NB
### Why are the changes needed?
to simplify impl
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuite
Closes#32991 from zhengruifeng/softmax_support_offset_step.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
### What changes were proposed in this pull request?
This PR proposes to fix:
- the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one.
- update quickstart of pandas API on Spark, and make it working
The notebooks can be easily reviewed here:
https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb
Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html
### Why are the changes needed?
- To show the working examples of quickstart to end users.
- To allow users to try out the examples without installation easily.
### Does this PR introduce _any_ user-facing change?
No to end users because the existing quickstart of pandas API on Spark is not released yet.
### How was this patch tested?
I manually tested it by uploading built Spark distribution to Binder. See 3bc15310a0Closes#33041 from HyukjinKwon/SPARK-35588-2.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Throw ValueError if version and timestamp are used together in to_delta
### Why are the changes needed?
read_delta has arguments named `version` and `timestamp`, but they cannot be used together.
We should raise the proper error message when they are used together.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#33023 from Yikun/SPARK-35812.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Properly set `InternalField` for `DataTypeOps.isnull`.
### Why are the changes needed?
The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added some more tests.
Closes#33005 from ueshin/issues/SPARK-35847/isnull_field.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make DecimalOps astype data-type-based.
See more in:
https://github.com/apache/spark/pull/32821#issuecomment-861119905
### Why are the changes needed?
Make DecimalOps astype data-type-based.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py
Closes#33009 from Yikun/SPARK-35849.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts.
### Why are the changes needed?
pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete.
### Does this PR introduce _any_ user-facing change?
To end users, no because the docs are not released yet.
### How was this patch tested?
I manually built the docs and checked the output
Closes#33018 from HyukjinKwon/SPARK-35645.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
We propose to
- introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps`
- make the "conversion to pandas" data-type-based for ExtensionDtypes
Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed.
### Why are the changes needed?
The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32910 from xinrong-databricks/datatypeops_pd_ext.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR fixes the cleanup logic in inheritable thread API by following Py4J cleanup logic at https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278.
Currently the tests that use `inheritable_thread_target` are flaky (https://github.com/apache/spark/runs/2870944288):
```
======================================================================
ERROR [71.813s]: test_save_load_pipeline_estimator (pyspark.ml.tests.test_tuning.CrossValidatorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in test_save_load_pipeline_estimator
self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in _run_test_save_load_pipeline_estimator
cvModel2 = crossval2.fit(training)
File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
bestModel = est.fit(dataset, epm[bestIndex])
File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
return self.copy(params)._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
model = stage.fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
model = stage.fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in _fit
models = pool.map(inheritable_thread_target(trainSingleClass), range(numClasses))
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
InheritableThread._clean_py4j_conn_for_current_thread()
File "/__w/spark/spark/python/pyspark/util.py", line 389, in _clean_py4j_conn_for_current_thread
del connections[i]
IndexError: deque index out of range
----------------------------------------------------------------------
```
This seems to be because the connection deque `jvm._gateway_client.deque` is accessed, and modified by other threads. Therefore, the number of threads could be changed in the middle. Using `SparkContext._lock` doesn't protect because the deque can be updated for every Java instance access in Py4J.
This PR proposes to use the atomic `deque.remove` in the problematic dequeue alone with try-catch on `ValueError` in case it's [deleted by Py4J](https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278).
### Why are the changes needed?
To fix the flakiness in the tests, and avoid possible breakage in user application by using this API.
### Does this PR introduce _any_ user-facing change?
If users were dependent on InheritableThread with pinned thread mode on, they might have faced such issues intermittently. This PR fixes it.
### How was this patch tested?
Manually tested. CI should test it out too.
Closes#32989 from HyukjinKwon/SPARK-35834.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Deprecate the `DataFrame.to_spark_io`
### Why are the changes needed?
We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas.
### Does this PR introduce _any_ user-facing change?
Yes, users will get warning while using `DataFrame.to_spark_io` api.
### How was this patch tested?
Pass the CIs
Closes#32964 from pingsutw/SPARK-35811.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.
This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.
### Why are the changes needed?
To mimic the JVM behaviour about thread lifecycle
### Does this PR introduce _any_ user-facing change?
Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.
### How was this patch tested?
Existing tests should cover this.
Closes#32962 from HyukjinKwon/SPARK-35498-SPARK-35303.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.
### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.
### Does this PR introduce _any_ user-facing change?
No, test only
### How was this patch tested?
test passed.
Closes#32859 from Yikun/SPARK-XXXXX1.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of #32886 to fix the Jenkins' linter.
### Why are the changes needed?
The PR #32886 was mistakenly merged before Jenkins' linter passes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#32965 from ueshin/issues/SPARK-35478/fup.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32886 from pingsutw/SPARK-35478.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based
### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.
https://issues.apache.org/jira/browse/SPARK-35342
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed
Closes#32821 from Yikun/SPARK-35342.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default.
### Why are the changes needed?
To correctly support parallel job submission and management.
### Does this PR introduce _any_ user-facing change?
Yes, now Python thread is mapped to JVM thread one to one.
### How was this patch tested?
Existing tests should cover it.
Closes#32429 from HyukjinKwon/SPARK-35303.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to merge contents and remove obsolete pages in Development section, especially about pandas API on Spark.
Some were removed, and some were merged to the existing PySpark guides. I will inline some comments in the PRs to make the review easier.
### Why are the changes needed?
To guide developers on the code base of pandas API on Spark.
### Does this PR introduce _any_ user-facing change?
Yes, it updates the user-facing documentation.
### How was this patch tested?
Manually built the docs and checked.
Closes#32926 from HyukjinKwon/SPARK-35644.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007
- it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```
- it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous.
And added the related test cases.
### Why are the changes needed?
To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
MultiIndex([('a', 'x', 1),
('b', 'z', 2),
('k', 'z', 3)],
)
```
And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`.
### Does this PR introduce _any_ user-facing change?
Yes, the previous bug is fixed as described in the above code examples.
### How was this patch tested?
Manually tested with linter and unittest in local, and it might be passed on CI.
Closes#32853 from itholic/SPARK-35683.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings.
### Why are the changes needed?
The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings.
We should use type-annotation based `pandas_udf` or avoid specifying udf types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32913 from ueshin/issues/SPARK-35761/suppress_warnings.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface).
### Why are the changes needed?
To make it sound more natural.
### Does this PR introduce _any_ user-facing change?
It fixes a typo in the unreleased changes.
### How was this patch tested?
N/A
Closes#32903 from HyukjinKwon/SPARK-34885.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Removes the upperbound for numpy for pandas-on-Spark.
### Why are the changes needed?
We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32908 from ueshin/issues/SPARK-35759/numpy.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make `astype` method data-type-based.
**Non-goal: Match pandas' `astype` TypeErrors.**
Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages.
Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed.
### Why are the changes needed?
There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32847 from xinrong-databricks/datatypeops_astype.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172.
```python
ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200)
```
**Before:**
```
pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).;
'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))]
+- Project [a#1L, b#2, c#3L]
+- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L]
+- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false
```
**After:**
```python
Figure({
'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}',
'name': 'a',
...
```
### Why are the changes needed?
To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns.
### Does this PR introduce _any_ user-facing change?
No to end users since the changes is not released yet. Yes to dev as described before.
### How was this patch tested?
Manually tested, added a test and tested in notebooks:
![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png)
![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png)
Closes#32884 from HyukjinKwon/fix-hist-plot.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents.
### Why are the changes needed?
Since we don't use the name "Koalas" anymore.
We should use "Pandas APIs on Spark" instead.
### Does this PR introduce _any_ user-facing change?
Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents.
### How was this patch tested?
Manually built the docs and checked one by one.
Closes#32835 from itholic/SPARK-35591.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file:
`python/pyspark/pandas/spark/indexing.py`
and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
`./dev/lint-python`
Closes#32738 from pingsutw/SPARK-35474.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions.
### Why are the changes needed?
pandas had introduced bugs as below:
- For pandas 1.1.3 and 1.1.4
Type error: only integer scalar arrays can be converted to a scalar index
- For pandas < 1.0.4
Type error: Can only tuple-index with a MultiIndex
We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32851 from xinrong-databricks/SPARK-35705.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to restructure User Guide in PySpark documentation for pandas APIs on Spark.
**Before**
![Screen Shot 2021-06-08 at 8 47 41 PM](https://user-images.githubusercontent.com/6477701/121179493-cb85e280-c89a-11eb-8b93-552ebe7cd0a8.png)
**After**
![Screen Shot 2021-06-08 at 8 46 58 PM](https://user-images.githubusercontent.com/6477701/121179419-b3ae5e80-c89a-11eb-82a0-6dabbf1de12d.png)
Note that I mostly just moved the contents around except minor changes:
- Removing some questions in FAQ that don't make sense in Apache Spark
- Rename a subtitle "Working with pandas and PySpark" to "From/to pandas and PySpark DataFrames"
For renaming Koalas to either pandas-on-Spark or pandas APIs on Spark, it will be done at SPARK-35591
### Why are the changes needed?
For better readability.
### Does this PR introduce _any_ user-facing change?
Yes, it restructures the documentation as shown above.
### How was this patch tested?
I manually built the docs and tested.
Closes#32820 from HyukjinKwon/SPARK-35647.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Limit the batch size for `add_shuffle_key` in `partitionBy` function to fix `OverflowError: cannot convert float infinity to integer`
### Why are the changes needed?
It's not easy to write a UT, but I can use some simple code to explain the bug.
* Original code
```
def add_shuffle_key(split, iterator):
buckets = defaultdict(list)
c, batch = 0, min(10 * numPartitions, 1000)
for k, v in iterator:
buckets[partitionFunc(k) % numPartitions].append((k, v))
c += 1
# check used memory and avg size of chunk of objects
if (c % 1000 == 0 and get_used_memory() > limit
or c > batch):
n, size = len(buckets), 0
for split in list(buckets.keys()):
yield pack_long(split)
d = outputSerializer.dumps(buckets[split])
del buckets[split]
yield d
size += len(d)
avg = int(size / n) >> 20
# let 1M < avg < 10M
if avg < 1:
batch *= 1.5
elif avg > 10:
batch = max(int(batch / 1.5), 1)
c = 0
```
if `get_used_memory() > limit` always is `True` and `avg < 1` always is `True`, the variable `batch` will grow to infinity. then `batch = max(int(batch / 1.5), 1)` may raise `OverflowError` if `avg > 10` at some time.
* sample code to reproduce the bug
```
import sys
limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)
while True:
c += 1
if (c % 1000 == 0 and used_memory > limit or c > batch):
batch = batch * 1.5
d = max(int(batch / 1.5), 1)
print(c, batch)
```
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
It's not easy to write a UT, there is sample code to test
```
import sys
limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)
while True:
c += 1
if (c % 1000 == 0 and used_memory > limit or c > batch):
batch = min(sys.maxsize, batch * 1.5)
d = max(int(batch / 1.5), 1)
print(c, batch)
```
Closes#32667 from nolanliou/fix_partitionby.
Authored-by: liuqi <nolan.liou@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to restructure API files according to the layout, see https://github.com/apache/spark/pull/32799. Now the pandas APIs on Spark are under a separate directory which is same level as other modules such as Spark SQL.
```bash
tree reference
```
**Before:**
```
reference
├── index.rst
├── ps_extensions.rst
├── ps_frame.rst
├── ps_general_functions.rst
├── ps_groupby.rst
├── ps_indexing.rst
├── ps_io.rst
├── ps_ml.rst
├── ps_series.rst
├── ps_window.rst
├── pyspark.ml.rst
├── pyspark.mllib.rst
├── pyspark.pandas.rst
├── pyspark.resource.rst
├── pyspark.rst
├── pyspark.sql.rst
├── pyspark.ss.rst
└── pyspark.streaming.rst
```
**After:**
```
reference
├── index.rst
├── pyspark.ml.rst
├── pyspark.mllib.rst
├── pyspark.pandas
│ ├── extensions.rst
│ ├── frame.rst
│ ├── general_functions.rst
│ ├── groupby.rst
│ ├── index.rst
│ ├── indexing.rst
│ ├── io.rst
│ ├── ml.rst
│ ├── series.rst
│ └── window.rst
├── pyspark.resource.rst
├── pyspark.rst
├── pyspark.sql.rst
├── pyspark.ss.rst
└── pyspark.streaming.rst
```
### Why are the changes needed?
To make the directory structure easier to follow.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually built and tested the docs.
Closes#32812 from HyukjinKwon/SPARK-35646-followup.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduces `InternalField` to manage dtypes and `StructField`s.
`InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue.
It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark.
### Why are the changes needed?
Currently there are some performance issues in the pandas-on-Spark layer.
One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types.
We should reduce the amount of unnecessary access.
### Does this PR introduce _any_ user-facing change?
Improves the performance in pandas-on-Spark layer:
```py
df = ps.read_parquet("/path/to/test.parquet") # contains ~75 columns
df = df[(df["col"] > 0) & (df["col"] < 10000)]
```
Before the PR, it took about **2.15 sec** and after **1.15 sec**.
### How was this patch tested?
Existing tests.
Closes#32775 from ueshin/issues/SPARK-35638/field.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based.
- Improve error messages for operators `and` and `or`.
### Why are the changes needed?
Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based
BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced.
These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR.
### Does this PR introduce _any_ user-facing change?
Yes. Error messages for operators `and` and `or` are improved.
Before:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).;
'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))]
+- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L]
+- LogicalRDD [__index_level_0__#8L, 0#9], false
```
After:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
TypeError: Bitwise or can not be applied to categoricals.
```
### How was this patch tested?
Unit tests.
Closes#32698 from xinrong-databricks/datatypeops_extension.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.
### Why are the changes needed?
The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32592 from xinrong-databricks/datatypeop_pd_conversion.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to change from:
![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png)
to:
![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png)
### Why are the changes needed?
pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL).
### Does this PR introduce _any_ user-facing change?
Yes, it changes the structure of the docs. To end users, no as it's only in development branch.
### How was this patch tested?
Manually tested as above.
Closes#32799 from HyukjinKwon/SPARK-35646.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions.
### Why are the changes needed?
`pd.testing` utils are utilized in pandas-on-Spark tests.
Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`. We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#32772 from xinrong-databricks/test_util.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.
By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.
### Why are the changes needed?
This can be reduces the cost of static analysis during development.
It has been used continuously for about a year in the Koalas project and its convenience has been proven.
### Does this PR introduce _any_ user-facing change?
No, it's dev-only.
### How was this patch tested?
Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.
Closes#32779 from itholic/SPARK-35499.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended.
This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643.
### Why are the changes needed?
In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
I don't believe this patch needs additional testing.
Closes#32771 from keerthanvasist/col.
Lead-authored-by: Keerthan Vasist <kvasist@amazon.com>
Co-authored-by: keerthanvasist <kvasist@amazon.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to port Koalas documentation to PySpark documentation as its initial step.
It ports almost as is except these differences:
- Renamed import from `databricks.koalas` to `pyspark.pandas`.
- Renamed `to_koalas` -> `to_pandas_on_spark`
- Renamed `(Series|DataFrame).koalas` -> `(Series|DataFrame).pandas_on_spark`
- Added a `ps_` prefix in the RST file names of Koalas documentation
Other then that,
- Excluded `python/docs/build/html` in linter
- Fixed GA dependency installataion
### Why are the changes needed?
To document pandas APIs on Spark.
### Does this PR introduce _any_ user-facing change?
Yes, it adds new documentations.
### How was this patch tested?
Manually built the docs and checked the output.
Closes#32726 from HyukjinKwon/SPARK-35587.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`.
- Before
<img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png">
- After
<img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png">
### Why are the changes needed?
To provide users available options in detail with the proper documentation link.
### Does this PR introduce _any_ user-facing change?
Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture.
### How was this patch tested?
Manually built the docs and checked one by one.
Closes#32762 from itholic/SPARK-35081.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning.
### Why are the changes needed?
If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work.
### Does this PR introduce _any_ user-facing change?
No. It's restoring the existing functionality.
### How was this patch tested?
Manually tested in local.
```shell
>>> sdf.to_koalas()
.../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead.
warnings.warn(
```
Closes#32729 from itholic/SPARK-35539.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support arithmetic operations against bool IndexOpsMixin.
### Why are the changes needed?
Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors.
pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is.
We aim to match pandas' behaviors.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> 1 + psser
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
Traceback (most recent call last):
...
TypeError: Addition can not be applied to booleans.
>>> ps.Series([1, 2, 3]) + psser
Traceback (most recent call last):
...
TypeError: addition can not be applied to given types.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([True, True, False])
>>> psser + 1
0 2
1 2
2 1
dtype: int64
>>> 1 + psser
0 2
1 2
2 1
dtype: int64
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> psser + ps.Series([1, 2, 3])
0 2
1 3
2 3
dtype: int64
>>> ps.Series([1, 2, 3]) + psser
0 2
1 3
2 3
dtype: int64
```
### How was this patch tested?
Unit tests.
Closes#32611 from xinrong-databricks/datatypeop_arith_bool.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents.
### Why are the changes needed?
If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below:
<img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png">
### Does this PR introduce _any_ user-facing change?
Yes, the `# noqa` is no more shown in the documents as below:
<img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png">
### How was this patch tested?
Manually build docs and check.
Closes#32728 from itholic/SPARK-35582.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor".
### Why are the changes needed?
Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark".
So, the related code bases are all need to be changed.
### Does this PR introduce _any_ user-facing change?
Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`.
**Note:** `df.koalas.[...]` is still available but with deprecated warnings.
### How was this patch tested?
Manually tested in local and checked one by one.
Closes#32674 from itholic/SPARK-35453.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657.
Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`.
pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy.
For the time being, I propose to exclude boolean case alone in percentile/quartile test case
### Why are the changes needed?
To keep the test coverage.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
I roughly locally tested. But it should pass in CI.
Closes#32690 from HyukjinKwon/SPARK-35510.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Re-enable some pandas-on-Spark test cases.
### Why are the changes needed?
pandas version in GitHub Actions is upgraded now so we can re-enable some pandas-on-Spark test cases.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32682 from xinrong-databricks/enable_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not.
### Why are the changes needed?
In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one.
We should introduce a util function for it.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
The existing tests.
Closes#32680 from ueshin/issues/SPARK-35537/spark_column_equals.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it.
### Why are the changes needed?
The data-type-based-operations class should be set for each individual data type, including BinaryType.
In addition, BinaryType has its special way of addition, which means concatenation.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> psser + b'1'
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> psser = ps.Series([b'1', b'2', b'3'])
>>> psser + psser
0 [49, 49]
1 [50, 50]
2 [51, 51]
dtype: object
>>> psser + b'1'
0 [49, 49]
1 [50, 49]
2 [51, 49]
dtype: object
```
### How was this patch tested?
Unit tests.
Closes#32665 from xinrong-databricks/datatypeops_binary.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately.
### Why are the changes needed?
StructType, ArrayType, and MapType are not accepted by DataTypeOps now.
We should handle these complex types. Among them:
- ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation.
- StructOps will be helpful to make to/from pandas conversion data-type-based.
### Does this PR introduce _any_ user-facing change?
Yes.
Before the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Type object was not understood.
```
After the change:
```py
>>> import pyspark.pandas as ps
>>> from pyspark.pandas.config import set_option
>>> set_option("compute.ops_on_diff_frames", True)
>>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]])
0 [1.0, 2.0, 3.0, 0.4, 0.5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]])
0 [1, 2, 3, 4, 5]
dtype: object
>>> ps.Series([[1, 2, 3]]) + ps.Series([['x']])
Traceback (most recent call last):
...
TypeError: Concatenation can only be applied to arrays of the same type
```
### How was this patch tested?
Unit tests.
Closes#32626 from xinrong-databricks/datatypeop_complex.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python.
While I am here, I fixed another minor issue at `DataFrams.schema` together:
```diff
- except AttributeError as e:
- raise Exception(
- "Unable to parse datatype from schema. %s" % e)
+ except Exception as e:
+ raise ValueError(
+ "Unable to parse datatype from schema. %s" % e) from e
```
Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases.
### Why are the changes needed?
For users to expect the proper exceptions.
### Does this PR introduce _any_ user-facing change?
Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit.
### How was this patch tested?
Existing unittests should cover,
Closes#31238Closes#32650 from HyukjinKwon/SPARK-32194.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR enables GitHub Actions to test PySpark with Python 3.9.
### Why are the changes needed?
To verify the support of Python 3.9.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
Existing tests should cover.
Closes#32657 from HyukjinKwon/SPARK-35506.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Removes APIs which have been deprecated in Koalas.
### Why are the changes needed?
There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark.
### Does this PR introduce _any_ user-facing change?
Yes, the APIs deprecated in Koalas will be no longer available.
### How was this patch tested?
Modified some tests which use the deprecated APIs, and the other existing tests should pass.
Closes#32656 from ueshin/issues/SPARK-35505/remove_deprecated.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR enables plot tests with plotly
```bash
./python/run-tests --python-executables=python3 --modules=pyspark-pandas
```
**Before**:
```
Traceback (most recent call last):
File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/.../pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 42, in <module>
plotly_requirement_message + " Or pandas<1.0; pandas<1.0 does not support latest plotly "
TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
```
**After**:
```
...
Starting test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly
...
Finished test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly (23s)
...
Tests passed in 1296 seconds
```
### Why are the changes needed?
For test coverage.
### Does this PR introduce _any_ user-facing change?
No, test-only.
### How was this patch tested?
By running the tests.
Closes#32649 from HyukjinKwon/SPARK-35497.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add thread target wrapper API for pyspark pin thread mode.
### Why are the changes needed?
A helper method which make user easier to write threading code under pin thread mode.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual.
Closes#32644 from WeichenXu123/add_thread_target_wrapper_api.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the files:
- `python/pyspark/pandas/spark/accessors.py`
- `python/pyspark/pandas/typedef/typehints.py`
- `python/pyspark/pandas/utils.py`
and fixes the mypy check failures.
### Why are the changes needed?
We should enable more `disallow_untyped_defs` mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32627 from ueshin/issues/SPARK-35467_35468_35477/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Sets up the `mypy` configuration to enable `disallow_untyped_defs` check for pandas APIs on Spark module.
### Why are the changes needed?
Currently many functions in the main codes in pandas APIs on Spark module are still missing type annotations and disabled `mypy` check `disallow_untyped_defs`.
We should add more type annotations and enable the mypy check.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32614 from ueshin/issues/SPARK-35465/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
There are still naming related to Koalas in test and function name. This PR addressed them to fit pandas-on-spark.
- kdf -> psdf
- kser -> psser
- kidx -> psidx
- kmidx -> psmidx
- to_koalas() -> to_pandas_on_spark()
### Why are the changes needed?
This is because the name Koalas is no longer used in PySpark.
### Does this PR introduce _any_ user-facing change?
`to_koalas()` function is renamed to `to_pandas_on_spark()`
### How was this patch tested?
Tested in local manually.
After changing the related naming, I checked them one by one.
Closes#32516 from itholic/SPARK-35364.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`
DataTypeOps and subclasses are introduced.
The existing behaviors of each arithmetic operation should be preserved.
### Why are the changes needed?
Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.
Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.
Closes#32596 from xinrong-databricks/datatypeop_arith_fix.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5.
This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message.
**NOTE** that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message.
### Why are the changes needed?
To keep unofficial Python 3.5 support
### Does this PR introduce _any_ user-facing change?
Officially nope.
### How was this patch tested?
Ran the linters.
Closes#32598 from HyukjinKwon/SPARK-35408=followup.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`
DataTypeOps and subclasses are introduced.
The existing behaviors of each arithmetic operation should be preserved.
### Why are the changes needed?
Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.
Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.
Closes#32469 from xinrong-databricks/datatypeop_arith.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`.
### Why are the changes needed?
This function can be only used from SQL for now.
It's good if we can use this function from Scala/Python code as well as SQL.
### Does this PR introduce _any_ user-facing change?
Yes. Users can use this function from Scala and Python.
### How was this patch tested?
New test.
Closes#32566 from sarutak/sentences-function.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers:
```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../python/pyspark/sql/utils.py", line 127, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```
**After**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../python/pyspark/sql/utils.py", line 127, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```
Note that the traceback (`return f(*args, **kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging.
In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it.
### Why are the changes needed?
To show simplified exception messages to end users.
### Does this PR introduce _any_ user-facing change?
Yes, it will hide the internal Python worker traceback.
### How was this patch tested?
Existing test cases should cover.
Closes#32569 from HyukjinKwon/SPARK-35419.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fixes `mypy` errors and enables `mypy` check for pandas-on-Spark.
### Why are the changes needed?
The `mypy` check for pandas-on-Spark was disabled when the initial porting.
It should be enabled again; otherwise we will miss type checking errors.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The enabled `mypy` check and existing unit tests should pass.
Closes#32540 from ueshin/issues/SPARK-34941/pandas_mypy.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Provide clearer error message tied to the user's Python code if incorrect parameters are passed to `DataFrame.show` rather than the message about a missing JVM method the user is not calling directly.
```
py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748
```
### Why are the changes needed?
For faster debugging through actionable error message.
### Does this PR introduce _any_ user-facing change?
No change for the correct parameters but different error messages for the parameters triggering an exception.
### How was this patch tested?
- unit test
- manually in PySpark REPL
Closes#32555 from gerashegalov/df_show_validation.
Authored-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit
### Why are the changes needed?
The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work.
### Does this PR introduce _any_ user-facing change?
No, docs only change.
### How was this patch tested?
Existing tests.
Closes#32554 from srowen/TuningImports.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR removes the check of `summary.logLikelihood` in ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if:
- change number of partitions;
- just change the way to compute the sum of weights;
- change the underlying BLAS impl
Also uses more permissive precision on `Word2Vec` test case.
### Why are the changes needed?
To recover the build and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test cases.
Closes#32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test.
Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes the same issue as #32424.
```py
from pyspark.sql.functions import flatten, struct, transform
df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters")
df.select(flatten(
transform(
"numbers",
lambda number: transform(
"letters",
lambda letter: struct(number.alias("n"), letter.alias("l"))
)
)
).alias("zipped")).show(truncate=False)
```
**Before:**
```
+------------------------------------------------------------------------+
|zipped |
+------------------------------------------------------------------------+
|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]|
+------------------------------------------------------------------------+
```
**After:**
```
+------------------------------------------------------------------------+
|zipped |
+------------------------------------------------------------------------+
|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]|
+------------------------------------------------------------------------+
```
### Why are the changes needed?
To produce the correct results.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the results to be correct as mentioned above.
### How was this patch tested?
Added a unit test as well as manually.
Closes#32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use full names of modules in `install.rst` when specifying dependencies.
### Why are the changes needed?
Using full names makes it more clear.
In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual verification.
Closes#32427 from xinrong-databricks/nameDoc.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Port Koalas dependencies appropriately to PySpark dependencies.
### Why are the changes needed?
pandas-on-Spark has its own required dependency and optional dependencies.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual test.
Closes#32386 from xinrong-databricks/portDeps.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The parameter **no_implicit_optional** is defined twice in the mypy configuration, [ligne 20](https://github.com/apache/spark/blob/master/python/mypy.ini#L20) and ligne 105.
### Why are the changes needed?
We would like to keep the mypy configuration clean.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
This patch can be tested with `dev/lint-python`
Closes#32418 from garawalid/feature/clean-mypy-config.
Authored-by: garawalid <gwalid94@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too.
`STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation.
### Why are the changes needed?
To correctly document.
### Does this PR introduce _any_ user-facing change?
Yes, it fixes the user-facing documentation.
### How was this patch tested?
I checked them via running linters.
Closes#32423 from HyukjinKwon/SPARK-35250.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR corrects some exception type when the function input params are failed to validate due to TypeError.
In order to convenient to review, there are 3 commits in this PR:
- Standardize input validation error type on sql
- Standardize input validation error type on ml
- Standardize input validation error type on pandas
### Why are the changes needed?
As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them.
[1] https://docs.python.org/3/library/exceptions.html#TypeError
Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch.
### Does this PR introduce _any_ user-facing change?
Yes, code can raise the right TypeError instead of ValueError.
### How was this patch tested?
Existing test case and UT
Closes#32368 from Yikun/SPARK-35176.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0.
### Why are the changes needed?
The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021.
See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979).
### Does this PR introduce _any_ user-facing change?
Yes, this doc can help user install arrow on aarch64.
### How was this patch tested?
doc test passed.
Closes#32363 from Yikun/SPARK-34979.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: hyukjinkwon <gurwls223@apache.org>
There are two types of dense vectors:
* pyspark.ml.linalg.DenseVector
* pyspark.mllib.linalg.DenseVector
In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector.
The documentation is ambiguous & can lead to the false conclusion that instances of
pyspark.mllib.linalg.DenseVector will be returned.
Conversion from ml versions to mllib versions can easly be achieved with
mlutils.convertVectorColumnsToML helper.
### What changes were proposed in this pull request?
Make documentation more explicit
### Why are the changes needed?
The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No test were run as only the documentation was changed
Closes#32255 from jlafaye/master.
Authored-by: Julien Lafaye <jlafaye@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
1, remove existing agg, and use a new agg supporting virtual centering
2, add related testsuites
### Why are the changes needed?
centering vectors should accelerate convergence, and generate solution more close to R
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
updated testsuites and added testsuites
Closes#32124 from zhengruifeng/svc_agg_refactor.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Removes PySpark version dependent codes from pyspark.pandas test codes.
### Why are the changes needed?
There are several places to check the PySpark version and switch the logic, but now those are not necessary.
We should remove them.
We will do the same thing after we finish porting tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32300 from xinrong-databricks/port.rmv_spark_version_chk_in_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Consolidate PySpark testing utils by removing `python/pyspark/pandas/testing`, and then creating a file `pandasutils` under `python/pyspark/testing` for test utilities used in `pyspark/pandas`.
### Why are the changes needed?
`python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and `python/pyspark/testing` contain test utilities for pyspark. Consolidating them makes code cleaner and easier to maintain.
Updated import statements are as shown below:
- from pyspark.testing.sqlutils import SQLTestUtils
- from pyspark.testing.pandasutils import PandasOnSparkTestCase, TestUtils
(PandasOnSparkTestCase is the original ReusedSQLTestCase in `python/pyspark/pandas/testing/utils.py`)
Minor improvements include:
- Usage of missing library's requirement_message
- `except ImportError` rather than `except`
- import pyspark.pandas alias as `ps` rather than `pp`
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests under python/pyspark/pandas/tests.
Closes#32177 from xinrong-databricks/port.merge_utils.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`.
### Why are the changes needed?
Bugfix
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Unit test.
Closes#32245 from harupy/SPARK-35142.
Authored-by: harupy <17039389+harupy@users.noreply.github.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
There are some more changes in Koalas such as [databricks/koalas#2141](c8f803d6be), [databricks/koalas#2143](913d68868d) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`.
### Why are the changes needed?
We should port the whole Koalas codes into PySpark and synchronize them.
### Does this PR introduce _any_ user-facing change?
Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring.
### How was this patch tested?
Manually tested in local.
Closes#32197 from itholic/SPARK-34995-fix.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable Index unit tests.
Closes#32139 from xinrong-databricks/port.indexes_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
There are some more changes in Koalas such as [databricks/koalas#2141](c8f803d6be), [databricks/koalas#2143](913d68868d) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`.
### Why are the changes needed?
We should port the whole Koalas codes into PySpark and synchronize them.
### Does this PR introduce _any_ user-facing change?
Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring.
### How was this patch tested?
Manually tested in local.
Closes#32154 from itholic/SPARK-34995.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to rename Koalas to pandas-on-Spark in main codes
### Why are the changes needed?
To have the correct name in PySpark. NOTE that the official name in the main documentation will be pandas APIs on Spark to be extra clear. pandas-on-Spark is not the official term.
### Does this PR introduce _any_ user-facing change?
No, it's master-only change. It changes the docstring and class names.
### How was this patch tested?
Manually tested via:
```bash
./python/run-tests --python-executable=python3 --modules pyspark-pandas
```
Closes#32166 from HyukjinKwon/rename-koalas.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas miscellaneous unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable miscellaneous unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable miscellaneous unit tests.
Closes#32152 from xinrong-databricks/port.misc_tests.
Lead-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Co-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch add `__version__` into pyspark.__init__.__all__ to make the `__version__` as exported explicitly, see more in https://github.com/apache/spark/pull/32110#issuecomment-817331896
### Why are the changes needed?
1. make the `__version__` as exported explicitly
2. cleanup `noqa: F401` on `__version`
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Python related CI passed
Closes#32125 from Yikun/SPARK-34629-Follow.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
### What changes were proposed in this pull request?
Removes PySpark version dependent codes from `pyspark.pandas` main codes.
### Why are the changes needed?
There are several places to check the PySpark version and switch the logic, but now those are not necessary.
We should remove them.
We will do the same thing after we finish porting tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32138 from ueshin/issues/SPARK-35039/pyspark_version.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas internal implementation unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the internal implementation unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable internal implementation unit tests.
Closes#32137 from xinrong-databricks/port.test_internal_impl.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas plot unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the plot unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable plot unit tests.
Closes#32151 from xinrong-databricks/port.plot_tests.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a new line to the `lineSep` parameter so that the doc renders correctly.
### Why are the changes needed?
> <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png">
The first line of the description is part of the signature and is **bolded**.
### Does this PR introduce _any_ user-facing change?
Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered.
### How was this patch tested?
I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious.
Closes#32153 from AlexMooney/patch-1.
Authored-by: Alex Mooney <alexmooney@fastmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame-related unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not fully tested. We should enable the DataFrame-related unit tests first.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable DataFrame-related unit tests.
Closes#32131 from xinrong-databricks/port.test_dataframe_related.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Series related unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not fully tested. We should enable the Series related unit tests first.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable Series-related unit tests.
Closes#32117 from xinrong-databricks/port.test_series_related.
Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas operations on different frames unit tests to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested fully. We should enable the operations on different frames unit tests.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable operations on different frames unit tests.
Closes#32133 from xinrong-databricks/port.test_ops_on_diff_frames.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix type hints mismatches in pyspark.sql.*
### Why are the changes needed?
There were some mismatches in pyspark.sql.*
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
dev/lint-python passed.
Closes#32122 from Yikun/SPARK-35019.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix:
```python
import pyspark.pandas as pp
```
to
```python
import pyspark.pandas as ps
```
### Why are the changes needed?
`pp` might sound offensive in some contexts.
### Does this PR introduce _any_ user-facing change?
The change is in master only. We'll use `ps` as the short name instead of `pp`.
### How was this patch tested?
The CI in this PR will test it out.
Closes#32108 from LSturtew/renaming_pyspark.pandas.
Authored-by: Luka Sturtewagen <luka.sturtewagen@linkit.nl>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of #32069.
Makes some doctests which could be flaky skip.
### Why are the changes needed?
Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic.
- groupby-apply with UDF which has a return type annotation will lose its index.
- `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32116 from ueshin/issues/SPARK-34972/fix_flaky_tests.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds the typehint of pyspark.__version__, which was mentioned in [SPARK-34630](https://issues.apache.org/jira/browse/SPARK-34630).
### Why are the changes needed?
There were some short discussion happened in https://github.com/apache/spark/pull/31823#discussion_r593830911 .
After further deep investigation on [1][2], we can see the `pyspark.__version__` is added by [setup.py](c06758834e/python/setup.py (L201)), it makes `__version__` embedded into pyspark module, that means the `__init__.pyi` is the right place to add the typehint for `__version__`.
So, this patch adds the type hint `__version__` in pyspark/__init__.pyi.
[1] [PEP-396 Module Version Numbers](https://www.python.org/dev/peps/pep-0396/)
[2] https://packaging.python.org/guides/single-sourcing-package-version/
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
1. Disable the ignore_error on
ee7bf7d962/python/mypy.ini (L132)
2. Run mypy:
- Before fix
```shell
(venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark | grep version
python/pyspark/pandas/spark/accessors.py:884: error: Module has no attribute "__version__"
```
- After fix
```shell
(venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark | grep version
```
no output
Closes#32110 from Yikun/SPARK-34629.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame unit test to PySpark.
### Why are the changes needed?
Currently, the pandas-on-Spark modules are not tested at all. We should enable the DataFrame unit test first.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enable the DataFrame unit test.
Closes#32083 from xinrong-databricks/port.test_dataframe.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure.
### Why are the changes needed?
Currently the pandas-on-Spark modules are not tested at all.
We should enable doctests first, and we will port other unit tests separately later.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Enabled the whole doctests.
Closes#32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch add the `-fr` argument to `xargs rm`.
### Why are the changes needed?
This cmd is unavailable in basic case. If the find command does not get any search results, the rm command is invoked with an empty argument list, and then we will get a `rm: missing operand` and break, then the coverage report does not generate.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python
The coverage report result is generated without break.
Closes#32064 from Yikun/patch-1.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
As a first step of [SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849), this PR proposes porting the Koalas main code into PySpark.
This PR contains minimal changes to the existing Koalas code as follows:
1. `databricks.koalas` -> `pyspark.pandas`
2. `from databricks import koalas as ks` -> `from pyspark import pandas as pp`
3. `ks.xxx -> pp.xxx`
Other than them:
1. Added a line to `python/mypy.ini` in order to ignore the mypy test. See related issue at [SPARK-34941](https://issues.apache.org/jira/browse/SPARK-34941).
2. Added a comment to several lines in several files to ignore the flake8 F401. See related issue at [SPARK-34943](https://issues.apache.org/jira/browse/SPARK-34943).
When this PR is merged, all the features that were previously used in [Koalas](https://github.com/databricks/koalas) will be available in PySpark as well.
Users can access to the pandas API in PySpark as below:
```python
>>> from pyspark import pandas as pp
>>> ppdf = pp.DataFrame({"A": [1, 2, 3], "B": [15, 20, 25]})
>>> ppdf
A B
0 1 15
1 2 20
2 3 25
```
The existing "options and settings" in Koalas are also available in the same way:
```python
>>> from pyspark.pandas.config import set_option, reset_option, get_option
>>> ppser1 = pp.Series([1, 2, 3])
>>> ppser2 = pp.Series([3, 4, 5])
>>> ppser1 + ppser2
Traceback (most recent call last):
...
ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option.
>>> set_option("compute.ops_on_diff_frames", True)
>>> ppser1 + ppser2
0 4
1 6
2 8
dtype: int64
```
Please also refer to the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html) and [Options and Settings](https://koalas.readthedocs.io/en/latest/user_guide/options.html) for more detail.
**NOTE** that this PR intentionally ports the main codes of Koalas first almost as are with minimal changes because:
- Koalas project is fairly large. Making some changes together for PySpark will make it difficult to review the individual change.
Koalas dev includes multiple Spark committers who will review. By doing this, the committers will be able to more easily and effectively review and drive the development.
- Koalas tests and documentation require major changes to make it look great together with PySpark whereas main codes do not require.
- We lately froze the Koalas codebase, and plan to work together on the initial porting. By porting the main codes first as are, it unblocks the Koalas dev to work on other items in parallel.
I promise and will make sure on:
- Rename Koalas to PySpark pandas APIs and/or pandas-on-Spark accordingly in documentation, and the docstrings and comments in the main codes.
- Triage APIs to remove that don’t make sense when Koalas is in PySpark
The documentation changes will be tracked in [SPARK-34885](https://issues.apache.org/jira/browse/SPARK-34885), the test code changes will be tracked in [SPARK-34886](https://issues.apache.org/jira/browse/SPARK-34886).
### Why are the changes needed?
Please refer to:
- [[DISCUSS] Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html)
- [[VOTE] SPIP: Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html)
### Does this PR introduce _any_ user-facing change?
Yes, now users can use the pandas APIs on Spark
### How was this patch tested?
Manually tested for exposed major APIs and options as described above.
### Koalas contributors
Koalas would not have been possible without the following contributors:
ueshin
HyukjinKwon
rxin
xinrong-databricks
RainFung
charlesdong1991
harupy
floscha
beobest2
thunterdb
garawalid
LucasG0
shril
deepyaman
gioa
fwani
90jam
thoo
AbdealiJK
abishekganesh72
gliptak
DumbMachine
dvgodoy
stbof
nitlev
hjoo
gatorsmile
tomspur
icexelloss
awdavidson
guyao
akhilputhiry
scook12
patryk-oleniuk
tracek
dennyglee
athena15
gstaubli
WeichenXu123
hsubbaraj
lfdversluis
ktksq
shengjh
margaret-databricks
LSturtew
sllynn
manuzhang
jijosg
sadikovi
Closes#32036 from itholic/SPARK-34890.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation
### Why are the changes needed?
To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782
### Does this PR introduce _any_ user-facing change?
Virtually no.
### How was this patch tested?
Found via (Mac OS):
```bash
# In Spark root directory
cd python
pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .`
```
Closes#32048 from HyukjinKwon/minor-fix.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include:
- toPandas() may be slower;
- the resulting dataframe may not support some Pandas operations due to immutable backing arrays.
### Why are the changes needed?
This will hopefully reduce user confusion as with SPARK-34463.
### Does this PR introduce _any_ user-facing change?
Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental.
### How was this patch tested?
This is a documentation-only change.
Closes#31738 from lidavidm/spark-34463.
Authored-by: David Li <li.davidm96@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `quoteIfNeeded` quotes a name only if it contains `.` or ``` ` ```.
This method should quote it if it contains non-word characters.
### Why are the changes needed?
It's a potential bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#31964 from sarutak/fix-quoteIfNeeded.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR implements the missing typehints as per SPARK-34630.
### Why are the changes needed?
To satisfy the aforementioned Jira ticket
### Does this PR introduce _any_ user-facing change?
No, just adding a missing typehint for Project Zen
### How was this patch tested?
No tests needed (just adding a typehint)
Closes#31823 from dannymeijer/feature/SPARK-34630.
Authored-by: Danny Meijer <danny.meijer@nike.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>