### What changes were proposed in this pull request?
By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions.
However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless.
This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config.
### Why are the changes needed?
AQE is default on now, we should make the perf better in the default case.
### Does this PR introduce _any_ user-facing change?
yes, a new config.
### How was this patch tested?
new tests
Closes#33172 from cloud-fan/aqe2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improve unit tests for data-type-based basic operations by:
- removing redundant test cases
- adding `astype` test for ExtensionDtypes
### Why are the changes needed?
Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases.
`astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#33095 from xinrong-databricks/datatypeops_test.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`.
- `Label`: `Tuple[Any, ...]`
Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures.
- `Name`: `Union[Any, Label]`
External expression for user-facing names, which can be scalar values or tuples.
### Why are the changes needed?
Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33159 from ueshin/issues/SPARK-35944/name_and_label.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add deprecation warning for Python 3.6.
### Why are the changes needed?
According to https://endoflife.date/python, Python 3.6 will be EOL on 23 Dec, 2021.
We should prepare for the deprecation of Python 3.6 support in Spark in advance.
### Does this PR introduce _any_ user-facing change?
N/A.
### How was this patch tested?
Manual tests.
Closes#33139 from xinrong-databricks/deprecate3.6_warn.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently ,we sets the environment variable `PYSPARK_PIN_THREAD` at the client side of `InhertiableThread` API for Py4J (`python/pyspark/util.py`). If the Py4J gateway is created somewhere else (e.g., Zeppelin, etc), it could introduce a breakage at:
```python
from pyspark import SparkContext
jvm = SparkContext._jvm
thread_connection = jvm._gateway_client.get_thread_connection()
# `AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'` (non-pinned thread mode)
# `get_thread_connection` is only in 'ClientServer' (pinned thread mode)
```
This PR proposes to check the given gateway created, and do the pinned thread mode behaviour accordingly so we can avoid any breakage when Py4J server/gateway is created separately from somewhere else without a pinned thread mode.
### Why are the changes needed?
To avoid any potential breakage.
### Does this PR introduce _any_ user-facing change?
No, the change happened only in the master (fdd7ca5f4e).
### How was this patch tested?
This is actually a partial revert of fdd7ca5f4e. As long as the existing tests pass, I guess we're all good.
I also manually tested to make doubly sure:
**Before**:
```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
Traceback (most recent call last):
File "/.../python3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/.../python3.8/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/.../spark/python/pyspark/util.py", line 361, in copy_local_properties
InheritableThread._clean_py4j_conn_for_current_thread()
File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../spark/python/pyspark/util.py", line 324, in wrapped
InheritableThread._clean_py4j_conn_for_current_thread()
File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread
thread_connection = jvm._gateway_client.get_thread_connection()
AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'
```
**After**:
```python
>>> from pyspark import InheritableThread, inheritable_thread_target
>>> InheritableThread(lambda: 1).start()
>>> inheritable_thread_target(lambda: 1)()
1
```
Closes#33147 from HyukjinKwon/SPARK-35946.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Introduces `Axis` type alias for `axis` argument to be consistent.
### Why are the changes needed?
There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33144 from ueshin/issues/SPARK-35943/axis.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes removing the legacy Koalas version from pandas API on Spark package.
And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version.
### Why are the changes needed?
Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas.
### Does this PR introduce _any_ user-facing change?
Now the legacy Koalas user should follow the version from PySpark.
### How was this patch tested?
Manually built the package and see it's successfully done.
Closes#33128 from itholic/SPARK-35873.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Cleaning up the type hints in pandas-on-Spark.
- Use a single file `_typing.py` for type variables or aliases
- Rename `IndexOpsLike` to `SeriesOrIndex`.
- Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively
- Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]`
### Why are the changes needed?
This is a cleanup for the mypy check stuff series.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33117 from ueshin/issues/SPARK-35859/cleanup.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Add path level discover for python unittests.
### Why are the changes needed?
Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed.
Such as:
- pyspark-core pyspark.tests.test_pin_thread
Thus we need some auto-discover way to find all testcase rather than specified every case by manually.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add below code in end of `dev/sparktestsupport/modules.py`
```python
for m in sorted(all_modules):
for g in sorted(m.python_test_goals):
print(m.name, g)
```
Compare the result before and after:
https://www.diffchecker.com/iO3FvhKLCloses#32867 from Yikun/SPARK_DISCOVER_TEST.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly:
- Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input.
```py
>>> from pyspark.pandas.spark import functions as SF
>>> SF.lit(np.int64(1))
Column<'CAST(1 AS BIGINT)'>
>>> SF.lit(np.int32(1))
Column<'CAST(1 AS INT)'>
>>> SF.lit(np.int8(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.byte(1))
Column<'CAST(1 AS TINYINT)'>
>>> SF.lit(np.float32(1))
Column<'CAST(1.0 AS FLOAT)'>
```
- Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals.
- Enable numpy literals input in `isin` method
Non-goal:
- Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161).
- To complete mappings between all kinds of numpy literals and Spark data types should be a followup task.
### Why are the changes needed?
Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value.
So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark.
### Does this PR introduce _any_ user-facing change?
Yes.
Before:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
Traceback (most recent call last):
...
AttributeError: 'numpy.int64' object has no attribute '_get_object_id'
```
After:
```py
>>> a = ps.DataFrame({'source': [1,2,3,4,5]})
>>> a.source.isin([np.int64(1), np.int64(2)])
0 True
1 True
2 False
3 False
4 False
Name: source, dtype: bool
```
### How was this patch tested?
Unit tests.
Closes#32955 from xinrong-databricks/datatypeops_literal.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Refines type hints in `pyspark.pandas.window`.
Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`.
### Why are the changes needed?
We can use more strict type hints for functions in pyspark.pandas.window using the generic way.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#33097 from ueshin/issues/SPARK-35901/window.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests.
### Why are the changes needed?
Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore.
And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense.
### Does this PR introduce _any_ user-facing change?
No, it's kinda internal refactoring stuff.
### How was this patch tested?
Added the related tests and manually check they're passed.
Closes#33054 from itholic/SPARK-35605.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
revert changes related to ANN
### Why are the changes needed?
using the new `softmax` may cause flaky failure
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
reverted testsuite
Closes#33049 from zhengruifeng/revert_softmax_ann.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
softmax support offset and step, then we can use it in ANN and NB
### Why are the changes needed?
to simplify impl
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuite
Closes#32991 from zhengruifeng/softmax_support_offset_step.
Authored-by: Ruifeng Zheng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>
### What changes were proposed in this pull request?
Throw ValueError if version and timestamp are used together in to_delta
### Why are the changes needed?
read_delta has arguments named `version` and `timestamp`, but they cannot be used together.
We should raise the proper error message when they are used together.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT
Closes#33023 from Yikun/SPARK-35812.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Properly set `InternalField` for `DataTypeOps.isnull`.
### Why are the changes needed?
The result of `DataTypeOps.isnull` must always be non-nullable boolean.
We should manage `InternalField` for this case.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added some more tests.
Closes#33005 from ueshin/issues/SPARK-35847/isnull_field.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make DecimalOps astype data-type-based.
See more in:
https://github.com/apache/spark/pull/32821#issuecomment-861119905
### Why are the changes needed?
Make DecimalOps astype data-type-based.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py
Closes#33009 from Yikun/SPARK-35849.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
We propose to
- introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps`
- make the "conversion to pandas" data-type-based for ExtensionDtypes
Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed.
### Why are the changes needed?
The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32910 from xinrong-databricks/datatypeops_pd_ext.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR fixes the cleanup logic in inheritable thread API by following Py4J cleanup logic at https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278.
Currently the tests that use `inheritable_thread_target` are flaky (https://github.com/apache/spark/runs/2870944288):
```
======================================================================
ERROR [71.813s]: test_save_load_pipeline_estimator (pyspark.ml.tests.test_tuning.CrossValidatorTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in test_save_load_pipeline_estimator
self._run_test_save_load_pipeline_estimator(DummyLogisticRegression)
File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in _run_test_save_load_pipeline_estimator
cvModel2 = crossval2.fit(training)
File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit
bestModel = est.fit(dataset, epm[bestIndex])
File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit
return self.copy(params)._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
model = stage.fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit
model = stage.fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit
return self._fit(dataset)
File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in _fit
models = pool.map(inheritable_thread_target(trainSingleClass), range(numClasses))
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 266, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 644, in get
raise self._value
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped
InheritableThread._clean_py4j_conn_for_current_thread()
File "/__w/spark/spark/python/pyspark/util.py", line 389, in _clean_py4j_conn_for_current_thread
del connections[i]
IndexError: deque index out of range
----------------------------------------------------------------------
```
This seems to be because the connection deque `jvm._gateway_client.deque` is accessed, and modified by other threads. Therefore, the number of threads could be changed in the middle. Using `SparkContext._lock` doesn't protect because the deque can be updated for every Java instance access in Py4J.
This PR proposes to use the atomic `deque.remove` in the problematic dequeue alone with try-catch on `ValueError` in case it's [deleted by Py4J](https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278).
### Why are the changes needed?
To fix the flakiness in the tests, and avoid possible breakage in user application by using this API.
### Does this PR introduce _any_ user-facing change?
If users were dependent on InheritableThread with pinned thread mode on, they might have faced such issues intermittently. This PR fixes it.
### How was this patch tested?
Manually tested. CI should test it out too.
Closes#32989 from HyukjinKwon/SPARK-35834.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Deprecate the `DataFrame.to_spark_io`
### Why are the changes needed?
We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas.
### Does this PR introduce _any_ user-facing change?
Yes, users will get warning while using `DataFrame.to_spark_io` api.
### How was this patch tested?
Pass the CIs
Closes#32964 from pingsutw/SPARK-35811.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644.
I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together.
This PR includes:
- Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429)
- Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644).
- https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default.
### Why are the changes needed?
To mimic the JVM behaviour about thread lifecycle
### Does this PR introduce _any_ user-facing change?
Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled.
In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object.
This is a small difference that wouldn't likely affect end users.
### How was this patch tested?
Existing tests should cover this.
Closes#32962 from HyukjinKwon/SPARK-35498-SPARK-35303.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch adds DataTypeOps test to check the ops is loaded as expected.
### Why are the changes needed?
When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable.
### Does this PR introduce _any_ user-facing change?
No, test only
### How was this patch tested?
test passed.
Closes#32859 from Yikun/SPARK-XXXXX1.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of #32886 to fix the Jenkins' linter.
### Why are the changes needed?
The PR #32886 was mistakenly merged before Jenkins' linter passes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#32965 from ueshin/issues/SPARK-35478/fup.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32886 from pingsutw/SPARK-35478.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
- Introduce a DecimalOps for DecimalType
- Make `isnull` method data-type-based
### Why are the changes needed?
Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN.
https://issues.apache.org/jira/browse/SPARK-35342
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- New added DecimalOpsTest passed
- Existing NumOpsTest passed
Closes#32821 from Yikun/SPARK-35342.
Authored-by: Yikun Jiang <yikunkero@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default.
### Why are the changes needed?
To correctly support parallel job submission and management.
### Does this PR introduce _any_ user-facing change?
Yes, now Python thread is mapped to JVM thread one to one.
### How was this patch tested?
Existing tests should cover it.
Closes#32429 from HyukjinKwon/SPARK-35303.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007
- it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
```
- it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous.
And added the related test cases.
### Why are the changes needed?
To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily.
```python
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> idx1 = ps.Index([1, 2, 3])
>>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)])
>>> midx1.difference(idx1)
MultiIndex([('a', 'x', 1),
('b', 'z', 2),
('k', 'z', 3)],
)
```
And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`.
### Does this PR introduce _any_ user-facing change?
Yes, the previous bug is fixed as described in the above code examples.
### How was this patch tested?
Manually tested with linter and unittest in local, and it might be passed on CI.
Closes#32853 from itholic/SPARK-35683.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings.
### Why are the changes needed?
The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings.
We should use type-annotation based `pandas_udf` or avoid specifying udf types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32913 from ueshin/issues/SPARK-35761/suppress_warnings.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface).
### Why are the changes needed?
To make it sound more natural.
### Does this PR introduce _any_ user-facing change?
It fixes a typo in the unreleased changes.
### How was this patch tested?
N/A
Closes#32903 from HyukjinKwon/SPARK-34885.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make `astype` method data-type-based.
**Non-goal: Match pandas' `astype` TypeErrors.**
Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages.
Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed.
### Why are the changes needed?
There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32847 from xinrong-databricks/datatypeops_astype.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172.
```python
ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200)
```
**Before:**
```
pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).;
'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))]
+- Project [a#1L, b#2, c#3L]
+- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L]
+- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false
```
**After:**
```python
Figure({
'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}',
'name': 'a',
...
```
### Why are the changes needed?
To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns.
### Does this PR introduce _any_ user-facing change?
No to end users since the changes is not released yet. Yes to dev as described before.
### How was this patch tested?
Manually tested, added a test and tested in notebooks:
![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png)
![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png)
Closes#32884 from HyukjinKwon/fix-hist-plot.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
Closes#32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adds more type annotations in the file:
`python/pyspark/pandas/spark/indexing.py`
and fixes the mypy check failures.
### Why are the changes needed?
We should enable more disallow_untyped_defs mypy checks.
### Does this PR introduce _any_ user-facing change?
Yes.
This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users.
### How was this patch tested?
The mypy check with a new configuration and existing tests should pass.
`./dev/lint-python`
Closes#32738 from pingsutw/SPARK-35474.
Authored-by: Kevin Su <pingsutw@apache.org>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions.
### Why are the changes needed?
pandas had introduced bugs as below:
- For pandas 1.1.3 and 1.1.4
Type error: only integer scalar arrays can be converted to a scalar index
- For pandas < 1.0.4
Type error: Can only tuple-index with a MultiIndex
We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32851 from xinrong-databricks/SPARK-35705.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Limit the batch size for `add_shuffle_key` in `partitionBy` function to fix `OverflowError: cannot convert float infinity to integer`
### Why are the changes needed?
It's not easy to write a UT, but I can use some simple code to explain the bug.
* Original code
```
def add_shuffle_key(split, iterator):
buckets = defaultdict(list)
c, batch = 0, min(10 * numPartitions, 1000)
for k, v in iterator:
buckets[partitionFunc(k) % numPartitions].append((k, v))
c += 1
# check used memory and avg size of chunk of objects
if (c % 1000 == 0 and get_used_memory() > limit
or c > batch):
n, size = len(buckets), 0
for split in list(buckets.keys()):
yield pack_long(split)
d = outputSerializer.dumps(buckets[split])
del buckets[split]
yield d
size += len(d)
avg = int(size / n) >> 20
# let 1M < avg < 10M
if avg < 1:
batch *= 1.5
elif avg > 10:
batch = max(int(batch / 1.5), 1)
c = 0
```
if `get_used_memory() > limit` always is `True` and `avg < 1` always is `True`, the variable `batch` will grow to infinity. then `batch = max(int(batch / 1.5), 1)` may raise `OverflowError` if `avg > 10` at some time.
* sample code to reproduce the bug
```
import sys
limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)
while True:
c += 1
if (c % 1000 == 0 and used_memory > limit or c > batch):
batch = batch * 1.5
d = max(int(batch / 1.5), 1)
print(c, batch)
```
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
It's not easy to write a UT, there is sample code to test
```
import sys
limit = 100
used_memory = 200
numPartitions = 64
c, batch = 0, min(10 * numPartitions, 1000)
while True:
c += 1
if (c % 1000 == 0 and used_memory > limit or c > batch):
batch = min(sys.maxsize, batch * 1.5)
d = max(int(batch / 1.5), 1)
print(c, batch)
```
Closes#32667 from nolanliou/fix_partitionby.
Authored-by: liuqi <nolan.liou@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Introduces `InternalField` to manage dtypes and `StructField`s.
`InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue.
It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark.
### Why are the changes needed?
Currently there are some performance issues in the pandas-on-Spark layer.
One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types.
We should reduce the amount of unnecessary access.
### Does this PR introduce _any_ user-facing change?
Improves the performance in pandas-on-Spark layer:
```py
df = ps.read_parquet("/path/to/test.parquet") # contains ~75 columns
df = df[(df["col"] > 0) & (df["col"] < 10000)]
```
Before the PR, it took about **2.15 sec** and after **1.15 sec**.
### How was this patch tested?
Existing tests.
Closes#32775 from ueshin/issues/SPARK-35638/field.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based.
- Improve error messages for operators `and` and `or`.
### Why are the changes needed?
Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based
BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced.
These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR.
### Does this PR introduce _any_ user-facing change?
Yes. Error messages for operators `and` and `or` are improved.
Before:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).;
'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))]
+- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L]
+- LogicalRDD [__index_level_0__#8L, 0#9], false
```
After:
```
>>> psser = ps.Series([1, "x", "y"], dtype="category")
>>> psser | True
Traceback (most recent call last):
...
TypeError: Bitwise or can not be applied to categoricals.
```
### How was this patch tested?
Unit tests.
Closes#32698 from xinrong-databricks/datatypeops_extension.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based.
NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614.
### Why are the changes needed?
The conversion from/to pandas includes logic for checking data types and behaving accordingly.
That makes code hard to change or maintain.
Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unit tests.
Closes#32592 from xinrong-databricks/datatypeop_pd_conversion.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions.
### Why are the changes needed?
`pd.testing` utils are utilized in pandas-on-Spark tests.
Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`. We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#32772 from xinrong-databricks/test_util.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis.
By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules.
### Why are the changes needed?
This can be reduces the cost of static analysis during development.
It has been used continuously for about a year in the Koalas project and its convenience has been proven.
### Does this PR introduce _any_ user-facing change?
No, it's dev-only.
### How was this patch tested?
Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed.
Closes#32779 from itholic/SPARK-35499.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended.
This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643.
### Why are the changes needed?
In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
I don't believe this patch needs additional testing.
Closes#32771 from keerthanvasist/col.
Lead-authored-by: Keerthan Vasist <kvasist@amazon.com>
Co-authored-by: keerthanvasist <kvasist@amazon.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>