ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Xinrong Meng	b0cd00b062	[SPARK-35340][PYTHON] Standardize TypeError messages for unsupported basic operations ### What changes were proposed in this pull request? The PR is proposed to standardize TypeError messages for unsupported basic operations by: - Capitalize the first letter - Leverage TypeError messages defined in `pyspark/pandas/data_type_ops/base.py` - Take advantage of the utility `is_valid_operand_for_numeric_arithmetic` to save duplicated TypeError messages Related unit tests should be adjusted as well. ### Why are the changes needed? Inconsistent TypeError messages are shown for unsupported data-type-based basic operations. Take addition's TypeError messages for example: - addition can not be applied to given types. - string addition can only be applied to string series or literals. Standardizing TypeError messages would improve user experience and reduce maintenance costs. ### Does this PR introduce _any_ user-facing change? No user-facing behavior change. Only TypeError messages are modified. ### How was this patch tested? Unit tests. Closes #33237 from xinrong-databricks/datatypeops_err. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `819c482498`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-08 12:28:00 -07:00
Xinrong Meng	61bfdf0c03	[SPARK-35615][PYTHON] Make unary and comparison operators data-type-based ### What changes were proposed in this pull request? Make unary and comparison operators data-type-based. Refactored operators include: - Unary operators: `__neg__`, `__abs__`, `__invert__`, - Comparison operators: `>`, `>=`, `<`, `<=`, `==`, `!=` Non-goal: Tasks below are inspired during the development of this PR. [[SPARK-35997] Implement comparison operators for CategoricalDtype in pandas API on Spark](https://issues.apache.org/jira/browse/SPARK-35997) [[SPARK-36000] Support creating a ps.Series/Index with `Decimal('NaN')` with Arrow disabled](https://issues.apache.org/jira/browse/SPARK-36000) [[SPARK-36001] Assume result's index to be disordered in tests with operations on different Series](https://issues.apache.org/jira/browse/SPARK-36001) [[SPARK-36002] Consolidate tests for data-type-based operations of decimal Series](https://issues.apache.org/jira/browse/SPARK-36002) [[SPARK-36003] Implement unary operator `invert` of numeric ps.Series/Index](https://issues.apache.org/jira/browse/SPARK-36003) ### Why are the changes needed? We have been refactoring basic operators to be data-type-based for readability, flexibility, and extensibility. Unary and comparison operators are still not data-type-based yet. We should fill the gaps. ### Does this PR introduce _any_ user-facing change? Yes. - Better error messages. For example, Before: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b"2", b"3", b"4"]) >>> -psser Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ... ``` After: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b"2", b"3", b"4"]) >>> -psser Traceback (most recent call last): ... TypeError: Unary - can not be applied to binaries. >>> ``` - Support unary `-` of `bool` Series. For example, Before: ```py >>> psser = ps.Series([True, False, True]) >>> -psser Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(- `0`)' due to data type mismatch: ... ``` After: ```py >>> psser = ps.Series([True, False, True]) >>> -psser 0 False 1 True 2 False dtype: bool ``` ### How was this patch tested? Unit tests. Closes #33162 from xinrong-databricks/datatypeops_refactor. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com> (cherry picked from commit `6e4e04f2a1`) Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-07 13:47:04 -07:00
Hyukjin Kwon	9cf1db33c7	[SPARK-35684][INFRA][PYTHON] Bump up mypy version in GitHub Actions ### What changes were proposed in this pull request? This PR proposes to bump up the mypy version to 0.910 which is the latest. ### Why are the changes needed? To catch the type hint mistakes better in PySpark. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? GitHub Actions should test it out. Closes #33223 from HyukjinKwon/SPARK-35684. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `16c195ccfb`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-07 13:26:41 +09:00
Tomas Pereira de Vasconcelos	e8991266c8	[SPARK-35986][PYSPARK] Fix type hint for RDD.histogram's buckets Fix the type hint for `pyspark.rdd .RDD.histogram`'s `buckets` argument The current type hint is incomplete. ![image](https://user-images.githubusercontent.com/17701527/124248180-df7fd580-db22-11eb-8391-ba0bb51d689b.png) From `pyspark.rdd .RDD.histogram`'s source: ```python if isinstance(buckets, int): ... elif isinstance(buckets, (list, tuple)): ... else: raise TypeError("buckets should be a list or tuple or number(int or long)") ``` Fixed the warning displayed above. Fixed warning above with this change. Closes #33185 from tpvasconcelos/master. Authored-by: Tomas Pereira de Vasconcelos <tomasvasconcelos1@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit `495d234c6e`) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-04 10:24:55 +09:00
Takuya UESHIN	fcc9e66c9b	[SPARK-35981][PYTHON][TEST][3.2] Use check_exact=False to loosen the check precision ### What changes were proposed in this pull request? This is a cherry-pick of #33179. We should use `check_exact=False` because the value check in `StatsTest.test_cov_corr_meta` is too strict. ### Why are the changes needed? In some environment, the precision could be different in pandas' `DataFrame.corr` function and the test `StatsTest.test_cov_corr_meta` fails. ``` AssertionError: DataFrame.iloc[:, 0] (column name="a") are different DataFrame.iloc[:, 0] (column name="a") values are different (14.28571 %) [index]: [a, b, c, d, e, f, g] [left]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 0.0] [right]: [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 4.807406715958909e-17] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Modified tests should still pass. Closes #33193 from ueshin/issuse/SPARK-35981/3.2/corr. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-07-02 14:08:50 -07:00
Wenchen Fan	c1d8178817	[SPARK-35968][SQL] Make sure partitions are not too small in AQE partition coalescing ### What changes were proposed in this pull request? By default, AQE will set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to the spark default parallelism, which is usually quite big. This is to keep the parallelism on par with non-AQE, to avoid perf regressions. However, this usually leads to many small/empty partitions, and hurts performance (although not worse than non-AQE). Users usually blindly set `COALESCE_PARTITIONS_MIN_PARTITION_NUM` to 1, which makes this config quite useless. This PR adds a new config to set the min partition size, to avoid too small partitions after coalescing. By default, Spark will not respect the target size, and only respect this min partition size, to maximize the parallelism and avoid perf regression in AQE. This PR also adds a bool config to respect the target size when coalescing partitions, and it's recommended to set it to get better overall performance. This PR also deprecates the `COALESCE_PARTITIONS_MIN_PARTITION_NUM` config. ### Why are the changes needed? AQE is default on now, we should make the perf better in the default case. ### Does this PR introduce _any_ user-facing change? yes, a new config. ### How was this patch tested? new tests Closes #33172 from cloud-fan/aqe2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit `0c9c8ff569`) Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-07-02 16:07:46 +08:00
Xinrong Meng	95d94948c5	[SPARK-35339][PYTHON] Improve unit tests for data-type-based basic operations ### What changes were proposed in this pull request? Improve unit tests for data-type-based basic operations by: - removing redundant test cases - adding `astype` test for ExtensionDtypes ### Why are the changes needed? Some test cases for basic operations are duplicated after introducing data-type-based basic operations. The PR is proposed to remove redundant test cases. `astype` is not tested for ExtensionDtypes, which will be adjusted in this PR as well. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #33095 from xinrong-databricks/datatypeops_test. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-07-01 17:37:32 -07:00
Takuya UESHIN	a98c8ae57d	[SPARK-35944][PYTHON] Introduce Name and Label type aliases ### What changes were proposed in this pull request? Introduce `Name` and `Label` type aliases to distinguish what is expected instead of `Any` or `Union[Any, Tuple]`. - `Label`: `Tuple[Any, ...]` Internal expression for name-like metadata, like `index_names`, `column_labels`, and `column_label_names` in `InternalFrame`, and similar internal structures. - `Name`: `Union[Any, Label]` External expression for user-facing names, which can be scalar values or tuples. ### Why are the changes needed? Currently `Any` or `Union[Any, Tuple]` is used for name-like types, but type aliases should be used to distinguish what is expected clearly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33159 from ueshin/issues/SPARK-35944/name_and_label. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-01 09:40:07 +09:00
Xinrong Meng	5ad12611ec	[SPARK-35938][PYTHON] Add deprecation warning for Python 3.6 ### What changes were proposed in this pull request? Add deprecation warning for Python 3.6. ### Why are the changes needed? According to https://endoflife.date/python, Python 3.6 will be EOL on 23 Dec, 2021. We should prepare for the deprecation of Python 3.6 support in Spark in advance. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Manual tests. Closes #33139 from xinrong-databricks/deprecate3.6_warn. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-07-01 09:32:25 +09:00
Hyukjin Kwon	8d28839689	[SPARK-35946][PYTHON] Respect Py4J server in InheritableThread API ### What changes were proposed in this pull request? Currently ,we sets the environment variable `PYSPARK_PIN_THREAD` at the client side of `InhertiableThread` API for Py4J (`python/pyspark/util.py`). If the Py4J gateway is created somewhere else (e.g., Zeppelin, etc), it could introduce a breakage at: ```python from pyspark import SparkContext jvm = SparkContext._jvm thread_connection = jvm._gateway_client.get_thread_connection() # `AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection'` (non-pinned thread mode) # `get_thread_connection` is only in 'ClientServer' (pinned thread mode) ``` This PR proposes to check the given gateway created, and do the pinned thread mode behaviour accordingly so we can avoid any breakage when Py4J server/gateway is created separately from somewhere else without a pinned thread mode. ### Why are the changes needed? To avoid any potential breakage. ### Does this PR introduce _any_ user-facing change? No, the change happened only in the master (`fdd7ca5f4e`). ### How was this patch tested? This is actually a partial revert of `fdd7ca5f4e`. As long as the existing tests pass, I guess we're all good. I also manually tested to make doubly sure: Before: ```python >>> from pyspark import InheritableThread, inheritable_thread_target >>> InheritableThread(lambda: 1).start() >>> inheritable_thread_target(lambda: 1)() Traceback (most recent call last): File "/.../python3.8/lib/python3.8/threading.py", line 932, in _bootstrap_inner self.run() File "/.../python3.8/lib/python3.8/threading.py", line 870, in run self._target(self._args, self._kwargs) File "/.../spark/python/pyspark/util.py", line 361, in copy_local_properties InheritableThread._clean_py4j_conn_for_current_thread() File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread thread_connection = jvm._gateway_client.get_thread_connection() AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection' Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/util.py", line 324, in wrapped InheritableThread._clean_py4j_conn_for_current_thread() File "/.../spark/python/pyspark/util.py", line 381, in _clean_py4j_conn_for_current_thread thread_connection = jvm._gateway_client.get_thread_connection() AttributeError: 'GatewayClient' object has no attribute 'get_thread_connection' ``` After*: ```python >>> from pyspark import InheritableThread, inheritable_thread_target >>> InheritableThread(lambda: 1).start() >>> inheritable_thread_target(lambda: 1)() 1 ``` Closes #33147 from HyukjinKwon/SPARK-35946. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-29 22:18:54 -07:00
Takuya UESHIN	0a838dcd71	[SPARK-35943][PYTHON] Introduce Axis type alias ### What changes were proposed in this pull request? Introduces `Axis` type alias for `axis` argument to be consistent. ### Why are the changes needed? There are many places to use `axis` argument. We should define `Axis` type alias and reuse it to be consistent. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33144 from ueshin/issues/SPARK-35943/axis. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 10:46:59 +09:00
itholic	28a201a442	[SPARK-35873][PYTHON] Cleanup the version logic from the pandas API on Spark ### What changes were proposed in this pull request? This PR proposes removing the legacy Koalas version from pandas API on Spark package. And also remove the Python version check logic since now pandas-on-Spark should follow the PySpark's Python version. ### Why are the changes needed? Since Koalas is ported into PySpark, we don't need to keep the version logic for Koalas. ### Does this PR introduce _any_ user-facing change? Now the legacy Koalas user should follow the version from PySpark. ### How was this patch tested? Manually built the package and see it's successfully done. Closes #33128 from itholic/SPARK-35873. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-30 10:01:51 +09:00
Takuya UESHIN	1f6e2f55d7	Revert "[SPARK-35721][PYTHON] Path level discover for python unittests" This reverts commit `5db51efa1a`.	2021-06-29 12:08:09 -07:00
Takuya UESHIN	2702fb9af0	[SPARK-35859][PYTHON] Cleanup type hints in pandas-on-Spark ### What changes were proposed in this pull request? Cleaning up the type hints in pandas-on-Spark. - Use a single file `_typing.py` for type variables or aliases - Rename `IndexOpsLike` to `SeriesOrIndex`. - Rename `T_Frame` and `T_IndexOps` to `FrameLike` and `IndexOpsLike` respectively - Introduce `DataFrameOrSeries` for `Union[DataFrame, Series]` ### Why are the changes needed? This is a cleanup for the mypy check stuff series. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33117 from ueshin/issues/SPARK-35859/cleanup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-29 10:52:24 -07:00
Yikun Jiang	5db51efa1a	[SPARK-35721][PYTHON] Path level discover for python unittests ### What changes were proposed in this pull request? Add path level discover for python unittests. ### Why are the changes needed? Now we need to specify the python test cases by manually when we add a new testcase. Sometime, we forgot to add the testcase to module list, the testcase would not be executed. Such as: - pyspark-core pyspark.tests.test_pin_thread Thus we need some auto-discover way to find all testcase rather than specified every case by manually. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add below code in end of `dev/sparktestsupport/modules.py` ```python for m in sorted(all_modules): for g in sorted(m.python_test_goals): print(m.name, g) ``` Compare the result before and after: https://www.diffchecker.com/iO3FvhKL Closes #32867 from Yikun/SPARK_DISCOVER_TEST. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-29 17:56:13 +09:00
Xinrong Meng	5f0113e3a6	[SPARK-35344][PYTHON] Support creating a Column of numpy literals in pandas API on Spark ### What changes were proposed in this pull request? The PR is proposed to support creating a Column of numpy literal value in pandas-on-Spark. It consists of three changes mainly: - Enable the `lit` function defined in `pyspark.pandas.spark.functions` to support numpy literals input. ```py >>> from pyspark.pandas.spark import functions as SF >>> SF.lit(np.int64(1)) Column<'CAST(1 AS BIGINT)'> >>> SF.lit(np.int32(1)) Column<'CAST(1 AS INT)'> >>> SF.lit(np.int8(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.byte(1)) Column<'CAST(1 AS TINYINT)'> >>> SF.lit(np.float32(1)) Column<'CAST(1.0 AS FLOAT)'> ``` - Substitute `F.lit` by `SF.lit`, that is, use `lit` function defined in `pyspark.pandas.spark.functions` rather than `lit` function defined in `pyspark.sql.functions` to allow creating columns out of numpy literals. - Enable numpy literals input in `isin` method Non-goal: - Some pandas-on-Spark APIs use PySpark column-related APIs internally, and these column-related APIs don't support numpy literals, thus numpy literals are disallowed as input (e.g. `to_replace` parameter in `replace` API). This PR doesn't aim to adjust all of them. This PR adjusts `isin` only, because the PR is inspired by that (as https://github.com/databricks/koalas/issues/2161). - To complete mappings between all kinds of numpy literals and Spark data types should be a followup task. ### Why are the changes needed? Spark (`lit` function defined in `pyspark.sql.functions`) doesn't support creating a Column out of numpy literal value. So `lit` function defined in `pyspark.pandas.spark.functions` is adjusted in order to support that in pandas-on-Spark. ### Does this PR introduce _any_ user-facing change? Yes. Before: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) Traceback (most recent call last): ... AttributeError: 'numpy.int64' object has no attribute '_get_object_id' ``` After: ```py >>> a = ps.DataFrame({'source': [1,2,3,4,5]}) >>> a.source.isin([np.int64(1), np.int64(2)]) 0 True 1 True 2 False 3 False 4 False Name: source, dtype: bool ``` ### How was this patch tested? Unit tests. Closes #32955 from xinrong-databricks/datatypeops_literal. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-28 19:03:42 -07:00
Takuya UESHIN	8c401beb80	[SPARK-35901][PYTHON] Refine type hints in pyspark.pandas.window ### What changes were proposed in this pull request? Refines type hints in `pyspark.pandas.window`. Also, some refactoring is included to clean up the type hierarchy of `Rolling` and `Expanding`. ### Why are the changes needed? We can use more strict type hints for functions in pyspark.pandas.window using the generic way. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #33097 from ueshin/issues/SPARK-35901/window. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 12:23:32 +09:00
itholic	03e6de2abe	[SPARK-35605][PYTHON] Move to_pandas_on_spark to the Spark DataFrame ### What changes were proposed in this pull request? This PR proposes move `to_pandas_on_spark` function from `pyspark.pandas.frame` to `pyspark.sql.dataframe`, and added the related tests to the PySpark DataFrame tests. ### Why are the changes needed? Because now the Koalas is ported into PySpark, so we don't need to Spark auto-patch anymore. And also `to_pandas_on_spark` is belongs to the pandas-on-Spark DataFrame doesn't look make sense. ### Does this PR introduce _any_ user-facing change? No, it's kinda internal refactoring stuff. ### How was this patch tested? Added the related tests and manually check they're passed. Closes #33054 from itholic/SPARK-35605. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-28 11:47:09 +09:00
Takuya UESHIN	a9ebfc5374	[SPARK-35466][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.data_type_ops.* ### What changes were proposed in this pull request? Adds more type annotations in the files `python/pyspark/pandas/data_type_ops/*.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33094 from ueshin/issues/SPARK-35466/disallow_untyped_defs_data_ops. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-25 18:16:25 -07:00
Takuya UESHIN	6497ac3585	[SPARK-35471][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.frame ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/frame.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33073 from ueshin/issues/SPARK-35471/disallow_untyped_defs_frame. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-25 14:41:58 +09:00
Takuya UESHIN	cfcfbca965	[SPARK-35476][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.series ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/series.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33045 from ueshin/issues/SPARK-35476/disallow_untyped_defs_series. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 19:32:33 +09:00
Hyukjin Kwon	5a7686a393	[SPARK-35301][PYTHON][DOCS] Document migration guide from Koalas to pandas APIs on Spark ### What changes were proposed in this pull request? This PR proposes to add a migration guide for legacy Koalas users in pandas API on Spark. ### Why are the changes needed? For easier migration. ### Does this PR introduce _any_ user-facing change? Yes, this adds a new page for migration from Koalas. ### How was this patch tested? Manually built the docs and checked manually. Closes #33050 from HyukjinKwon/SPARK-35301. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 17:58:09 +09:00
itholic	92ddef7cfb	[SPARK-35696][PYTHON][DOCS][FOLLOW-UP] Fix underline for title in FAQ to remove warnings ### What changes were proposed in this pull request? This PR follow-up for SPARK-35696 to fix incorrect underline in the documents to remove warnings. ### Why are the changes needed? We should build the docs without any incorrect documentation style ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually build docs and see the warning is removed Closes #33052 from itholic/SPARK-35696-followup. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 15:20:13 +09:00
itholic	712ed87faa	[SPARK-35696][PYTHON][DOCS] Refine the code examples in pandas-on-Spark documentation ### What changes were proposed in this pull request? This PR proposes to refine the code examples for pandas-on-Spark since some of them still follows the naming for Koalas. For example, ```python kdf = ks.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) ``` should be refined to ```python psdf = ps.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]}) ``` Also fixed the several remaining Koalas stuffs in FAQ ### Why are the changes needed? Because we don't want to use the name "Koalas" in the Apache Spark anymore. ### Does this PR introduce _any_ user-facing change? Yes, the examples in the documentation will be changed with refined names. ### How was this patch tested? Manually built the docs and check one by one. Closes #33017 from itholic/SPARK-35696. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 14:48:13 +09:00
Ruifeng Zheng	37f70422b5	[SPARK-35678][ML][FOLLOWUP] Revert changes in ANN ### What changes were proposed in this pull request? revert changes related to ANN ### Why are the changes needed? using the new `softmax` may cause flaky failure ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? reverted testsuite Closes #33049 from zhengruifeng/revert_softmax_ann. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 14:02:28 +09:00
Ruifeng Zheng	a66738823c	[SPARK-35678][ML][FOLLOWUP] softmax support offset and step ### What changes were proposed in this pull request? softmax support offset and step, then we can use it in ANN and NB ### Why are the changes needed? to simplify impl ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuite Closes #32991 from zhengruifeng/softmax_support_offset_step. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Huaxin Gao <huaxin_gao@apple.com>	2021-06-23 21:03:18 -05:00
Hyukjin Kwon	be9089731a	[SPARK-35588][PYTHON][DOCS] Merge Binder integration and quickstart notebook for pandas API on Spark ### What changes were proposed in this pull request? This PR proposes to fix: - the Binder integration of pandas API on Spark, and merge them together with the existing PySpark one. - update quickstart of pandas API on Spark, and make it working The notebooks can be easily reviewed here: https://mybinder.org/v2/gh/HyukjinKwon/spark/SPARK-35588-3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart_ps.ipynb Original page in Koalas: https://koalas.readthedocs.io/en/latest/getting_started/10min.html ### Why are the changes needed? - To show the working examples of quickstart to end users. - To allow users to try out the examples without installation easily. ### Does this PR introduce _any_ user-facing change? No to end users because the existing quickstart of pandas API on Spark is not released yet. ### How was this patch tested? I manually tested it by uploading built Spark distribution to Binder. See `3bc15310a0` Closes #33041 from HyukjinKwon/SPARK-35588-2. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-24 10:17:22 +09:00
Yikun Jiang	4824c53398	[SPARK-35812][PYTHON] Throw ValueError if version and timestamp are used together in to_delta ### What changes were proposed in this pull request? Throw ValueError if version and timestamp are used together in to_delta ### Why are the changes needed? read_delta has arguments named `version` and `timestamp`, but they cannot be used together. We should raise the proper error message when they are used together. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes #33023 from Yikun/SPARK-35812. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-23 19:04:45 +09:00
Takuya UESHIN	68b54b702c	[SPARK-35473][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.groupby ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/groupby.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #33032 from ueshin/issues/SPARK-35473/disallow_untyped_defs_groupby. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-23 09:51:33 +09:00
Takuya UESHIN	c418803df7	[SPARK-35847][PYTHON] Manage InternalField in DataTypeOps.isnull ### What changes were proposed in this pull request? Properly set `InternalField` for `DataTypeOps.isnull`. ### Why are the changes needed? The result of `DataTypeOps.isnull` must always be non-nullable boolean. We should manage `InternalField` for this case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added some more tests. Closes #33005 from ueshin/issues/SPARK-35847/isnull_field. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-22 12:54:01 -07:00
Yikun Jiang	1c26433f1d	[SPARK-35849][PYTHON] Make `astype` method data-type-based for DecimalOps ### What changes were proposed in this pull request? Make DecimalOps astype data-type-based. See more in: https://github.com/apache/spark/pull/32821#issuecomment-861119905 ### Why are the changes needed? Make DecimalOps astype data-type-based. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test NumOpsTest.test_astype in pyspark/pandas/tests/data_type_ops/test_num_ops.py Closes #33009 from Yikun/SPARK-35849. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-22 10:41:22 -07:00
Hyukjin Kwon	27046582e4	[SPARK-35645][PYTHON][DOCS] Merge contents and remove obsolete pages in Getting Started section ### What changes were proposed in this pull request? This PR revise the installation to describe `pip install pyspark[pandas_on_spark]` and removes pandas-on-Spark installation and videos/blogposts. ### Why are the changes needed? pandas-on-Spark installation is merged to PySpark installation pages. For videos/blogposts, now this is named pandas API on Spark. Old Koalas blogposts and videos are obsolete. ### Does this PR introduce _any_ user-facing change? To end users, no because the docs are not released yet. ### How was this patch tested? I manually built the docs and checked the output Closes #33018 from HyukjinKwon/SPARK-35645. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-22 09:36:27 -07:00
Takuya UESHIN	a8fdb98ecb	[SPARK-35470][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.base ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/base.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32968 from ueshin/issues/SPARK-35470/disallow_untyped_defs_base. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-22 11:25:16 +09:00
Xinrong Meng	6ca56b01dc	[SPARK-35614][PYTHON] Make the conversion to pandas data-type-based for ExtensionDtypes ### What changes were proposed in this pull request? We propose to - introduce the Ops class for ExtensionDtypes: `IntegralExtensionOps`, `FractionalExtensionOps`, `StringExtensionOps` - make the "conversion to pandas" data-type-based for ExtensionDtypes Non-goal: same arithmetic operation of ExtensionDtypes have different result dtypes between pandas and pandas API on Spark. That should be adjusted in a separated PR if needed. ### Why are the changes needed? The conversion to pandas includes logic for checking ExtensionDtypes data types and behaving accordingly. That makes code hard to change or maintain. Since we have DataTypeOps defined, we are able to dispatch the specific conversion logic to the `ExtensionOps` classes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32910 from xinrong-databricks/datatypeops_pd_ext. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-21 13:19:55 -07:00
Hyukjin Kwon	248fda3ead	[SPARK-35834][PYTHON] Use the same cleanup logic as Py4J in inheritable thread API ### What changes were proposed in this pull request? This PR fixes the cleanup logic in inheritable thread API by following Py4J cleanup logic at https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278. Currently the tests that use `inheritable_thread_target` are flaky (https://github.com/apache/spark/runs/2870944288): ``` ====================================================================== ERROR [71.813s]: test_save_load_pipeline_estimator (pyspark.ml.tests.test_tuning.CrossValidatorTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 589, in test_save_load_pipeline_estimator self._run_test_save_load_pipeline_estimator(DummyLogisticRegression) File "/__w/spark/spark/python/pyspark/ml/tests/test_tuning.py", line 572, in _run_test_save_load_pipeline_estimator cvModel2 = crossval2.fit(training) File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/tuning.py", line 747, in _fit bestModel = est.fit(dataset, epm[bestIndex]) File "/__w/spark/spark/python/pyspark/ml/base.py", line 159, in fit return self.copy(params)._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit model = stage.fit(dataset) File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/pipeline.py", line 114, in _fit model = stage.fit(dataset) File "/__w/spark/spark/python/pyspark/ml/base.py", line 161, in fit return self._fit(dataset) File "/__w/spark/spark/python/pyspark/ml/classification.py", line 2924, in _fit models = pool.map(inheritable_thread_target(trainSingleClass), range(numClasses)) File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 266, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 644, in get raise self._value File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 119, in worker result = (True, func(args, kwds)) File "/__t/Python/3.6.13/x64/lib/python3.6/multiprocessing/pool.py", line 44, in mapstar return list(map(args)) File "/__w/spark/spark/python/pyspark/util.py", line 324, in wrapped InheritableThread._clean_py4j_conn_for_current_thread() File "/__w/spark/spark/python/pyspark/util.py", line 389, in _clean_py4j_conn_for_current_thread del connections[i] IndexError: deque index out of range ---------------------------------------------------------------------- ``` This seems to be because the connection deque `jvm._gateway_client.deque` is accessed, and modified by other threads. Therefore, the number of threads could be changed in the middle. Using `SparkContext._lock` doesn't protect because the deque can be updated for every Java instance access in Py4J. This PR proposes to use the atomic `deque.remove` in the problematic dequeue alone with try-catch on `ValueError` in case it's [deleted by Py4J](https://github.com/bartdag/py4j/blob/master/py4j-python/src/py4j/clientserver.py#L269-L278). ### Why are the changes needed? To fix the flakiness in the tests, and avoid possible breakage in user application by using this API. ### Does this PR introduce _any_ user-facing change? If users were dependent on InheritableThread with pinned thread mode on, they might have faced such issues intermittently. This PR fixes it. ### How was this patch tested? Manually tested. CI should test it out too. Closes #32989 from HyukjinKwon/SPARK-35834. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-21 12:00:16 +09:00
Kevin Su	653be9d774	[SPARK-35811][PYTHON] Deprecate DataFrame.to_spark_io ### What changes were proposed in this pull request? Deprecate the `DataFrame.to_spark_io` ### Why are the changes needed? We should deprecate the `DataFrame.to_spark_io` since it's duplicated with `DataFrame.spark.to_spark_io`, and it's not existed in pandas. ### Does this PR introduce _any_ user-facing change? Yes, users will get warning while using `DataFrame.to_spark_io` api. ### How was this patch tested? Pass the CIs Closes #32964 from pingsutw/SPARK-35811. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-21 10:43:34 +09:00
Hyukjin Kwon	6d309914df	[SPARK-35303][SPARK-35498][PYTHON][FOLLOW-UP] Copy local properties when starting the thread, and use inheritable thread in the current codebase ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/32429 and https://github.com/apache/spark/pull/32644. I was thinking about creating separate PRs but decided to include all in this PR because it shares the same context, and should be easier to review together. This PR includes: - Use `InheritableThread` and `inheritable_thread_target` in the current code base to prevent potential resource leak (since we enabled pinned thread mode by default now at https://github.com/apache/spark/pull/32429) - Copy local properties when `start` at `InheritableThread` is called to mimic JVM behaviour. Previously it was copied when `InheritableThread` instance was created (related to #32644). - https://github.com/apache/spark/pull/32429 missed one place at `inheritable_thread_target` (https://github.com/apache/spark/blob/master/python/pyspark/util.py#L308). More specifically, I missed one place that should enable pinned thread mode by default. ### Why are the changes needed? To mimic the JVM behaviour about thread lifecycle ### Does this PR introduce _any_ user-facing change? Ideally no. One possible case is that users use `InheritableThread` with pinned thread mode enabled. In this case, the local properties will be copied when starting the thread instead of defining the `InheritableThread` object. This is a small difference that wouldn't likely affect end users. ### How was this patch tested? Existing tests should cover this. Closes #32962 from HyukjinKwon/SPARK-35498-SPARK-35303. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-20 11:48:38 +09:00
Takuya UESHIN	1589d32732	[SPARK-35472][PYTHON] Fix disallow_untyped_defs mypy checks for pyspark.pandas.generic ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/generic.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32957 from ueshin/issues/SPARK-35472/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-20 11:48:01 +09:00
Yikun Jiang	b7df75a777	[SPARK-35708][PYTHON][TEST] Add BaseTest for DataTypeOps ### What changes were proposed in this pull request? This patch adds DataTypeOps test to check the ops is loaded as expected. ### Why are the changes needed? When complete https://github.com/apache/spark/pull/32821, I found there are no test for DataTypeOps. There were many logic when DataTypeOps loaded, it's better to add the test to make sure interface stable. ### Does this PR introduce _any_ user-facing change? No, test only ### How was this patch tested? test passed. Closes #32859 from Yikun/SPARK-XXXXX1. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 18:54:50 -07:00
Takuya UESHIN	c879510d2f	[SPARK-35478][PYTHON][FOLLOWUP] Fix Jenkins' linter ### What changes were proposed in this pull request? This is a follow-up of #32886 to fix the Jenkins' linter. ### Why are the changes needed? The PR #32886 was mistakenly merged before Jenkins' linter passes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #32965 from ueshin/issues/SPARK-35478/fup. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-06-18 13:52:54 -07:00
Kevin Su	3fb044e043	[SPARK-35478][PYTHON] Enable disallow_untyped_defs mypy check for pyspark.pandas.window ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32886 from pingsutw/SPARK-35478. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 11:21:33 -07:00
Yikun Jiang	f84a720fe3	[SPARK-35342][PYTHON] Introduce DecimalOps and make `isnull` method data-type-based ### What changes were proposed in this pull request? - Introduce a DecimalOps for DecimalType - Make `isnull` method data-type-based ### Why are the changes needed? Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN. https://issues.apache.org/jira/browse/SPARK-35342 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - New added DecimalOpsTest passed - Existing NumOpsTest passed Closes #32821 from Yikun/SPARK-35342. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 10:44:35 -07:00
Takuya UESHIN	2f537a838a	[SPARK-35469][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-18 20:43:59 +09:00
HyukjinKwon	41af409b7b	[SPARK-35303][PYTHON] Enable pinned thread mode by default ### What changes were proposed in this pull request? PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default. ### Why are the changes needed? To correctly support parallel job submission and management. ### Does this PR introduce _any_ user-facing change? Yes, now Python thread is mapped to JVM thread one to one. ### How was this patch tested? Existing tests should cover it. Closes #32429 from HyukjinKwon/SPARK-35303. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-18 12:02:29 +09:00
Hyukjin Kwon	94bdbec380	[SPARK-35644][PYTHON][DOCS] Merge contents and remove obsolete pages in Development section ### What changes were proposed in this pull request? This PR proposes to merge contents and remove obsolete pages in Development section, especially about pandas API on Spark. Some were removed, and some were merged to the existing PySpark guides. I will inline some comments in the PRs to make the review easier. ### Why are the changes needed? To guide developers on the code base of pandas API on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it updates the user-facing documentation. ### How was this patch tested? Manually built the docs and checked. Closes #32926 from HyukjinKwon/SPARK-35644. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-17 13:35:20 +09:00
itholic	b9aeeb4e6c	[SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side ### What changes were proposed in this pull request? This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007 - it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` - it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous. And added the related test cases. ### Why are the changes needed? To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) MultiIndex([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)], ) ``` And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`. ### Does this PR introduce _any_ user-facing change? Yes, the previous bug is fixed as described in the above code examples. ### How was this patch tested? Manually tested with linter and unittest in local, and it might be passed on CI. Closes #32853 from itholic/SPARK-35683. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 14:18:54 +09:00
Takuya UESHIN	2a56cc36ca	[SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings ### What changes were proposed in this pull request? Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings. ### Why are the changes needed? The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings. We should use type-annotation based `pandas_udf` or avoid specifying udf types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32913 from ueshin/issues/SPARK-35761/suppress_warnings. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 11:17:56 +09:00
Hyukjin Kwon	95f36e76c6	[SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark" ### What changes were proposed in this pull request? This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface). ### Why are the changes needed? To make it sound more natural. ### Does this PR introduce _any_ user-facing change? It fixes a typo in the unreleased changes. ### How was this patch tested? N/A Closes #32903 from HyukjinKwon/SPARK-34885. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 10:01:04 +09:00
Takuya UESHIN	ef7545b788	[SPARK-35759][PYTHON] Remove the upperbound for numpy for pandas-on-Spark ### What changes were proposed in this pull request? Removes the upperbound for numpy for pandas-on-Spark. ### Why are the changes needed? We can remove the upper-bound for numpy for pandas-on-Spark because currently it works well on the CI with numpy 1.20.3. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32908 from ueshin/issues/SPARK-35759/numpy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 09:59:05 +09:00
Xinrong Meng	03756618fc	[SPARK-35616][PYTHON] Make `astype` method data-type-based ### What changes were proposed in this pull request? Make `astype` method data-type-based. Non-goal: Match pandas' `astype` TypeErrors. Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages. Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed. ### Why are the changes needed? There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32847 from xinrong-databricks/datatypeops_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-14 16:33:15 -07:00
Hyukjin Kwon	76e08a8e3d	[SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots ### What changes were proposed in this pull request? This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172. ```python ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200) ``` Before: ``` pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).; 'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))] +- Project [a#1L, b#2, c#3L] +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L] +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false ``` After: ```python Figure({ 'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}', 'name': 'a', ... ``` ### Why are the changes needed? To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns. ### Does this PR introduce _any_ user-facing change? No to end users since the changes is not released yet. Yes to dev as described before. ### How was this patch tested? Manually tested, added a test and tested in notebooks: ![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png) ![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png) Closes #32884 from HyukjinKwon/fix-hist-plot. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-12 14:36:46 +09:00
Takuya UESHIN	4d21b94d13	[SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-11 11:07:11 -07:00
itholic	ebe529e8e1	[SPARK-35591][PYTHON][DOCS] Rename "Koalas" to "pandas API on Spark" in the documents ### What changes were proposed in this pull request? This PR proposes the change the name "Koalas" to the "Pandas APIs on Spark" in the documents. ### Why are the changes needed? Since we don't use the name "Koalas" anymore. We should use "Pandas APIs on Spark" instead. ### Does this PR introduce _any_ user-facing change? Yes, the name "Koalas" is renamed to "Pandas APIs on Spark" in the documents. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32835 from itholic/SPARK-35591. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-11 20:42:38 +09:00
Kevin Su	cadd3a0588	[SPARK-35474] Enable disallow_untyped_defs mypy check for pyspark.pandas.indexing ### What changes were proposed in this pull request? Adds more type annotations in the file: `python/pyspark/pandas/spark/indexing.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. `./dev/lint-python` Closes #32738 from pingsutw/SPARK-35474. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 22:35:12 -07:00
Xinrong Meng	e9d60156c4	[SPARK-35705][PYTHON] Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions ### What changes were proposed in this pull request? Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions. ### Why are the changes needed? pandas had introduced bugs as below: - For pandas 1.1.3 and 1.1.4 Type error: only integer scalar arrays can be converted to a scalar index - For pandas < 1.0.4 Type error: Can only tuple-index with a MultiIndex We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32851 from xinrong-databricks/SPARK-35705. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:36:19 +09:00
Xinrong Meng	3c66c11aa6	[SPARK-35601][PYTHON] Complete arithmetic operators involving bool literals, Series, and Index ### What changes were proposed in this pull request? Completing arithmetic operators involving bool literals, Series, and Index consists of two main tasks: - Support arithmetic operations against bool literals - Support operators (+, ) between bool Series/Indexes. ### Why are the changes needed? Arithmetic operators involving bool literals, Series, and Index are incomplete now. We ought to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Newly supported operations example: ```py >>> ps.Series([1, 2, 3]) + True 0 2 1 3 2 4 dtype: int64 >>> ps.Series([1, 2, 3]) + False 0 1 1 2 2 3 dtype: int64 >>> ps.Series([True, False, True]) + True 0 True 1 True 2 True dtype: bool >>> ps.Series([True, False, True]) + False 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) True 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) * False 0 False 1 False 2 False dtype: bool >>> ps.set_option('compute.ops_on_diff_frames', True) >>> ps.Series([True, True, False]) + ps.Series([True, False, True]) 0 True 1 True 2 True dtype: bool >>> ps.Series([True, True, False]) * ps.Series([True, False, True]) 0 True 1 False 2 False dtype: bool ``` Before the change, operations above are not supported, raising a TypeError such as ```py >>> ps.Series([True, False, True]) + True Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. >>> ps.Series([True, False, True]) + False Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. ``` ### How was this patch tested? Unit tests. Closes #32785 from xinrong-databricks/datatypeops_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 15:13:03 -07:00
Hyukjin Kwon	afff42178c	[SPARK-35647][PYTHON][DOCS] Restructure User Guide in PySpark documentation ### What changes were proposed in this pull request? This PR proposes to restructure User Guide in PySpark documentation for pandas APIs on Spark. Before ![Screen Shot 2021-06-08 at 8 47 41 PM](https://user-images.githubusercontent.com/6477701/121179493-cb85e280-c89a-11eb-8b93-552ebe7cd0a8.png) After ![Screen Shot 2021-06-08 at 8 46 58 PM](https://user-images.githubusercontent.com/6477701/121179419-b3ae5e80-c89a-11eb-82a0-6dabbf1de12d.png) Note that I mostly just moved the contents around except minor changes: - Removing some questions in FAQ that don't make sense in Apache Spark - Rename a subtitle "Working with pandas and PySpark" to "From/to pandas and PySpark DataFrames" For renaming Koalas to either pandas-on-Spark or pandas APIs on Spark, it will be done at SPARK-35591 ### Why are the changes needed? For better readability. ### Does this PR introduce _any_ user-facing change? Yes, it restructures the documentation as shown above. ### How was this patch tested? I manually built the docs and tested. Closes #32820 from HyukjinKwon/SPARK-35647. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-09 12:13:25 +09:00
liuqi	e79dd89cf6	[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function ### What changes were proposed in this pull request? Limit the batch size for `add_shuffle_key` in `partitionBy` function to fix `OverflowError: cannot convert float infinity to integer` ### Why are the changes needed? It's not easy to write a UT, but I can use some simple code to explain the bug. * Original code ``` def add_shuffle_key(split, iterator): buckets = defaultdict(list) c, batch = 0, min(10 * numPartitions, 1000) for k, v in iterator: buckets[partitionFunc(k) % numPartitions].append((k, v)) c += 1 # check used memory and avg size of chunk of objects if (c % 1000 == 0 and get_used_memory() > limit or c > batch): n, size = len(buckets), 0 for split in list(buckets.keys()): yield pack_long(split) d = outputSerializer.dumps(buckets[split]) del buckets[split] yield d size += len(d) avg = int(size / n) >> 20 # let 1M < avg < 10M if avg < 1: batch = 1.5 elif avg > 10: batch = max(int(batch / 1.5), 1) c = 0 ``` if `get_used_memory() > limit` always is `True` and `avg < 1` always is `True`, the variable `batch` will grow to infinity. then `batch = max(int(batch / 1.5), 1)` may raise `OverflowError` if `avg > 10` at some time. sample code to reproduce the bug ``` import sys limit = 100 used_memory = 200 numPartitions = 64 c, batch = 0, min(10 * numPartitions, 1000) while True: c += 1 if (c % 1000 == 0 and used_memory > limit or c > batch): batch = batch * 1.5 d = max(int(batch / 1.5), 1) print(c, batch) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? It's not easy to write a UT, there is sample code to test ``` import sys limit = 100 used_memory = 200 numPartitions = 64 c, batch = 0, min(10 * numPartitions, 1000) while True: c += 1 if (c % 1000 == 0 and used_memory > limit or c > batch): batch = min(sys.maxsize, batch * 1.5) d = max(int(batch / 1.5), 1) print(c, batch) ``` Closes #32667 from nolanliou/fix_partitionby. Authored-by: liuqi <nolan.liou@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-09 10:57:27 +09:00
Hyukjin Kwon	921abc51cf	[SPARK-35636][PYTHON][DOCS][FOLLOW-UP] Restructure reference API files according to the layout ### What changes were proposed in this pull request? This PR proposes to restructure API files according to the layout, see https://github.com/apache/spark/pull/32799. Now the pandas APIs on Spark are under a separate directory which is same level as other modules such as Spark SQL. ```bash tree reference ``` Before: ``` reference ├── index.rst ├── ps_extensions.rst ├── ps_frame.rst ├── ps_general_functions.rst ├── ps_groupby.rst ├── ps_indexing.rst ├── ps_io.rst ├── ps_ml.rst ├── ps_series.rst ├── ps_window.rst ├── pyspark.ml.rst ├── pyspark.mllib.rst ├── pyspark.pandas.rst ├── pyspark.resource.rst ├── pyspark.rst ├── pyspark.sql.rst ├── pyspark.ss.rst └── pyspark.streaming.rst ``` After: ``` reference ├── index.rst ├── pyspark.ml.rst ├── pyspark.mllib.rst ├── pyspark.pandas │ ├── extensions.rst │ ├── frame.rst │ ├── general_functions.rst │ ├── groupby.rst │ ├── index.rst │ ├── indexing.rst │ ├── io.rst │ ├── ml.rst │ ├── series.rst │ └── window.rst ├── pyspark.resource.rst ├── pyspark.rst ├── pyspark.sql.rst ├── pyspark.ss.rst └── pyspark.streaming.rst ``` ### Why are the changes needed? To make the directory structure easier to follow. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually built and tested the docs. Closes #32812 from HyukjinKwon/SPARK-35646-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 19:01:56 +09:00
Takuya UESHIN	04418e18d7	[SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and StructFields ### What changes were proposed in this pull request? Introduces `InternalField` to manage dtypes and `StructField`s. `InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue. It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark. ### Why are the changes needed? Currently there are some performance issues in the pandas-on-Spark layer. One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types. We should reduce the amount of unnecessary access. ### Does this PR introduce _any_ user-facing change? Improves the performance in pandas-on-Spark layer: ```py df = ps.read_parquet("/path/to/test.parquet") # contains ~75 columns df = df[(df["col"] > 0) & (df["col"] < 10000)] ``` Before the PR, it took about 2.15 sec and after 1.15 sec. ### How was this patch tested? Existing tests. Closes #32775 from ueshin/issues/SPARK-35638/field. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 11:57:28 +09:00
Xinrong Meng	dfd8a8dc67	[SPARK-35341][PYTHON] Introduce BooleanExtensionOps ### What changes were proposed in this pull request? - Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based. - Improve error messages for operators `and` and `or`. ### Why are the changes needed? Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced. These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR. ### Does this PR introduce _any_ user-facing change? Yes. Error messages for operators `and` and `or` are improved. Before: ``` >>> psser = ps.Series([1, "x", "y"], dtype="category") >>> psser \| True Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).; 'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))] +- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L] +- LogicalRDD [__index_level_0__#8L, 0#9], false ``` After: ``` >>> psser = ps.Series([1, "x", "y"], dtype="category") >>> psser \| True Traceback (most recent call last): ... TypeError: Bitwise or can not be applied to categoricals. ``` ### How was this patch tested? Unit tests. Closes #32698 from xinrong-databricks/datatypeops_extension. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 15:43:52 -07:00
Xinrong Meng	04a8d2cbcf	[SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes ### What changes were proposed in this pull request? Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based. NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614. ### Why are the changes needed? The conversion from/to pandas includes logic for checking data types and behaving accordingly. That makes code hard to change or maintain. Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32592 from xinrong-databricks/datatypeop_pd_conversion. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 13:12:12 -07:00
Hyukjin Kwon	7ce7aa4758	[SPARK-35646][PYTHON][DOCS] Relocate pandas-on-Spark API references in documentation ### What changes were proposed in this pull request? This PR proposes to change from: ![Screen Shot 2021-06-07 at 1 40 47 PM](https://user-images.githubusercontent.com/6477701/120960027-fc302400-c795-11eb-96fb-73ac1d8277fe.png) to: ![Screen Shot 2021-06-07 at 1 41 19 PM](https://user-images.githubusercontent.com/6477701/120960074-0fdb8a80-c796-11eb-87ec-69a30692fdfe.png) ### Why are the changes needed? pandas APIs on Spark (pandas on Spark) is a package in PySpark in the end. So it has to be documented in the same level with other packages (e.g., Spark SQL). ### Does this PR introduce _any_ user-facing change? Yes, it changes the structure of the docs. To end users, no as it's only in development branch. ### How was this patch tested? Manually tested as above. Closes #32799 from HyukjinKwon/SPARK-35646. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 16:37:58 +09:00
Xinrong Meng	50f7686de9	[SPARK-35599][PYTHON] Adjust `check_exact` parameter for older pd.testing ### What changes were proposed in this pull request? Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions. ### Why are the changes needed? `pd.testing` utils are utilized in pandas-on-Spark tests. Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`. We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32772 from xinrong-databricks/test_util. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 11:12:49 +09:00
itholic	b8740a1d1e	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes ### What changes were proposed in this pull request? This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis. By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules. ### Why are the changes needed? This can be reduces the cost of static analysis during development. It has been used continuously for about a year in the Koalas project and its convenience has been proven. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed. Closes #32779 from itholic/SPARK-35499. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-06 17:30:07 -07:00
Keerthan Vasist	f2c0a049a6	[SPARK-35643][PYTHON] Fix ambiguous reference in functions.py column() ### What changes were proposed in this pull request? In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended. This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643. ### Why are the changes needed? In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I don't believe this patch needs additional testing. Closes #32771 from keerthanvasist/col. Lead-authored-by: Keerthan Vasist <kvasist@amazon.com> Co-authored-by: keerthanvasist <kvasist@amazon.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-05 12:40:39 +09:00
Hyukjin Kwon	3d158f9c91	[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 11:11:09 +09:00
itholic	2658bc590f	[SPARK-35081][DOCS] Add Data Source Option links to missing documents ### What changes were proposed in this pull request? This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`. - Before <img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png"> - After <img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png"> ### Why are the changes needed? To provide users available options in detail with the proper documentation link. ### Does this PR introduce _any_ user-facing change? Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32762 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 13:52:46 +09:00
itholic	48252bac95	[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JDBC To Other Databases" page <img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png"> - Python ![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png) - Scala ![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png) - Java ![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32723 from itholic/SPARK-35583. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 14:21:16 +09:00
itholic	0ad5ae54b2	[SPARK-35539][PYTHON] Restore to_koalas to keep the backward compatibility ### What changes were proposed in this pull request? This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning. ### Why are the changes needed? If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work. ### Does this PR introduce _any_ user-facing change? No. It's restoring the existing functionality. ### How was this patch tested? Manually tested in local. ```shell >>> sdf.to_koalas() .../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead. warnings.warn( ``` Closes #32729 from itholic/SPARK-35539. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 10:39:24 +09:00
Xinrong Meng	0ac5c16177	[SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin ### What changes were proposed in this pull request? Support arithmetic operations against bool IndexOpsMixin. ### Why are the changes needed? Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors. pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is. We aim to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> 1 + psser Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> ps.Series([1, 2, 3]) + psser Traceback (most recent call last): ... TypeError: addition can not be applied to given types. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 0 2 1 2 2 1 dtype: int64 >>> 1 + psser 0 2 1 2 2 1 dtype: int64 >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) 0 2 1 3 2 3 dtype: int64 >>> ps.Series([1, 2, 3]) + psser 0 2 1 3 2 3 dtype: int64 ``` ### How was this patch tested? Unit tests. Closes #32611 from xinrong-databricks/datatypeop_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-01 10:57:12 -07:00
itholic	fe09def323	[SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents ### What changes were proposed in this pull request? This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents. ### Why are the changes needed? If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below: <img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png"> ### Does this PR introduce _any_ user-facing change? Yes, the `# noqa` is no more shown in the documents as below: <img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png"> ### How was this patch tested? Manually build docs and check. Closes #32728 from itholic/SPARK-35582. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 15:24:04 +09:00
itholic	73d4f67145	[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:58:49 +09:00
itholic	7e2717333b	[SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor ### What changes were proposed in this pull request? This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor". ### Why are the changes needed? Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark". So, the related code bases are all need to be changed. ### Does this PR introduce _any_ user-facing change? Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`. Note: `df.koalas.[...]` is still available but with deprecated warnings. ### How was this patch tested? Manually tested in local and checked one by one. Closes #32674 from itholic/SPARK-35453. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:33:10 +09:00
Hyukjin Kwon	7eb74482a7	[SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true ### What changes were proposed in this pull request? This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657. Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`. pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy. For the time being, I propose to exclude boolean case alone in percentile/quartile test case ### Why are the changes needed? To keep the test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I roughly locally tested. But it should pass in CI. Closes #32690 from HyukjinKwon/SPARK-35510. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-28 17:35:01 +09:00
Xinrong Meng	79a2a46cdb	[SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases ### What changes were proposed in this pull request? Re-enable some pandas-on-Spark test cases. ### Why are the changes needed? pandas version in GitHub Actions is upgraded now so we can re-enable some pandas-on-Spark test cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32682 from xinrong-databricks/enable_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 12:33:30 +09:00
Takuya UESHIN	d6d3209c2f	[SPARK-35537][PYTHON] Introduce a util function spark_column_equals ### What changes were proposed in this pull request? Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not. ### Why are the changes needed? In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one. We should introduce a util function for it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing tests. Closes #32680 from ueshin/issues/SPARK-35537/spark_column_equals. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 12:14:43 +09:00
Xinrong Meng	8cc7232ffa	[SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType ### What changes were proposed in this pull request? BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it. ### Why are the changes needed? The data-type-based-operations class should be set for each individual data type, including BinaryType. In addition, BinaryType has its special way of addition, which means concatenation. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b'1', b'2', b'3']) >>> psser + psser Traceback (most recent call last): ... TypeError: Type object was not understood. >>> psser + b'1' Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b'1', b'2', b'3']) >>> psser + psser 0 [49, 49] 1 [50, 50] 2 [51, 51] dtype: object >>> psser + b'1' 0 [49, 49] 1 [50, 49] 2 [51, 49] dtype: object ``` ### How was this patch tested? Unit tests. Closes #32665 from xinrong-databricks/datatypeops_binary. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-26 14:30:24 -07:00
Xinrong Meng	266608d50e	[SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps ### What changes were proposed in this pull request? The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately. ### Why are the changes needed? StructType, ArrayType, and MapType are not accepted by DataTypeOps now. We should handle these complex types. Among them: - ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation. - StructOps will be helpful to make to/from pandas conversion data-type-based. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) 0 [1.0, 2.0, 3.0, 0.4, 0.5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) 0 [1, 2, 3, 4, 5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Concatenation can only be applied to arrays of the same type ``` ### How was this patch tested? Unit tests. Closes #32626 from xinrong-databricks/datatypeop_complex. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-26 10:40:01 -07:00
itholic	79a6b0cc8a	[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move text data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Text Files" page <img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png"> - Python <img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png"> - Scala <img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png"> - Java <img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32660 from itholic/SPARK-35509. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 17:12:49 +09:00
Hyukjin Kwon	20750a3f9e	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception ### What changes were proposed in this pull request? This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python. While I am here, I fixed another minor issue at `DataFrams.schema` together: ```diff - except AttributeError as e: - raise Exception( - "Unable to parse datatype from schema. %s" % e) + except Exception as e: + raise ValueError( + "Unable to parse datatype from schema. %s" % e) from e ``` Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases. ### Why are the changes needed? For users to expect the proper exceptions. ### Does this PR introduce _any_ user-facing change? Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit. ### How was this patch tested? Existing unittests should cover, Closes #31238 Closes #32650 from HyukjinKwon/SPARK-32194. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 11:54:40 +09:00
Hyukjin Kwon	e47e615c0e	[SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions ### What changes were proposed in this pull request? This PR enables GitHub Actions to test PySpark with Python 3.9. ### Why are the changes needed? To verify the support of Python 3.9. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Existing tests should cover. Closes #32657 from HyukjinKwon/SPARK-35506. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 09:25:51 +09:00
Takuya UESHIN	d67d73b708	[SPARK-35505][PYTHON] Remove APIs which have been deprecated in Koalas ### What changes were proposed in this pull request? Removes APIs which have been deprecated in Koalas. ### Why are the changes needed? There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, the APIs deprecated in Koalas will be no longer available. ### How was this patch tested? Modified some tests which use the deprecated APIs, and the other existing tests should pass. Closes #32656 from ueshin/issues/SPARK-35505/remove_deprecated. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-25 11:16:27 -07:00
Hyukjin Kwon	4a6d844184	[SPARK-35497][PYTHON] Enable plotly tests in pandas-on-Spark ### What changes were proposed in this pull request? This PR enables plot tests with plotly ```bash ./python/run-tests --python-executables=python3 --modules=pyspark-pandas ``` Before: ``` Traceback (most recent call last): File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/.../pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 42, in <module> plotly_requirement_message + " Or pandas<1.0; pandas<1.0 does not support latest plotly " TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' ``` After: ``` ... Starting test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly ... Finished test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly (23s) ... Tests passed in 1296 seconds ``` ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? By running the tests. Closes #32649 from HyukjinKwon/SPARK-35497. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-25 12:31:32 +09:00
Weichen Xu	fdd7ca5f4e	[SPARK-35498][PYTHON] Add thread target wrapper API for pyspark pin thread mode ### What changes were proposed in this pull request? Add thread target wrapper API for pyspark pin thread mode. ### Why are the changes needed? A helper method which make user easier to write threading code under pin thread mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual. Closes #32644 from WeichenXu123/add_thread_target_wrapper_api. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-25 09:50:22 +09:00
Takuya UESHIN	1b75c2494c	[SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the files: - `python/pyspark/pandas/spark/accessors.py` - `python/pyspark/pandas/typedef/typehints.py` - `python/pyspark/pandas/utils.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more `disallow_untyped_defs` mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32627 from ueshin/issues/SPARK-35467_35468_35477/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-24 09:31:00 +09:00
Takuya UESHIN	2616d5cc1d	[SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module ### What changes were proposed in this pull request? Sets up the `mypy` configuration to enable `disallow_untyped_defs` check for pandas APIs on Spark module. ### Why are the changes needed? Currently many functions in the main codes in pandas APIs on Spark module are still missing type annotations and disabled `mypy` check `disallow_untyped_defs`. We should add more type annotations and enable the mypy check. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32614 from ueshin/issues/SPARK-35465/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-21 11:03:35 -07:00
itholic	d2bdd6595e	[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM](https://user-images.githubusercontent.com/44108233/119082866-e7375f00-ba39-11eb-9ade-a931a5957b34.png) - Python ![Screen Shot 2021-05-21 at 1 38 27 PM](https://user-images.githubusercontent.com/44108233/119082879-eef70380-ba39-11eb-9e8e-ee50eed98dbe.png) - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM](https://user-images.githubusercontent.com/44108233/119082884-f1595d80-ba39-11eb-98d5-966657df65f7.png) - Java ![Screen Shot 2021-05-21 at 1 37 19 PM](https://user-images.githubusercontent.com/44108233/119082888-f4544e00-ba39-11eb-8bf8-47ce78ec0b01.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:49 +09:00
itholic	419ddcb2a4	[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move JSON data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JSON Files" page <img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png"> - Python <img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png"> - Scala <img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png"> - Java <img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32204 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:13 +09:00
itholic	0fe65b5365	[SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move ORC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "ORC Files" page ![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png) - Python ![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png) - Scala ![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png) - Java ![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32546 from itholic/SPARK-35395. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:03:57 +09:00
itholic	6b912e4179	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes ### What changes were proposed in this pull request? There are still naming related to Koalas in test and function name. This PR addressed them to fit pandas-on-spark. - kdf -> psdf - kser -> psser - kidx -> psidx - kmidx -> psmidx - to_koalas() -> to_pandas_on_spark() ### Why are the changes needed? This is because the name Koalas is no longer used in PySpark. ### Does this PR introduce _any_ user-facing change? `to_koalas()` function is renamed to `to_pandas_on_spark()` ### How was this patch tested? Tested in local manually. After changing the related naming, I checked them one by one. Closes #32516 from itholic/SPARK-35364. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-20 15:08:30 -07:00
Xinrong Meng	a970f8505d	[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for pandas APIs on Spark, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32596 from xinrong-databricks/datatypeop_arith_fix. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-19 19:47:00 -07:00
Hyukjin Kwon	7eaabf4df5	[SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format ### What changes were proposed in this pull request? This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5. This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message. NOTE that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message. ### Why are the changes needed? To keep unofficial Python 3.5 support ### Does this PR introduce _any_ user-facing change? Officially nope. ### How was this patch tested? Ran the linters. Closes #32598 from HyukjinKwon/SPARK-35408=followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-20 10:47:31 +09:00
Takuya UESHIN	d44e6c7f10	Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures" This reverts commit `d1b24d8aba`.	2021-05-19 16:49:47 -07:00
Xinrong Meng	d1b24d8aba	[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for pandas APIs on Spark, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32469 from xinrong-databricks/datatypeop_arith. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-19 15:05:32 -07:00
Kousuke Saruta	9283bebbbd	[SPARK-35418][SQL] Add sentences function to functions.{scala,py} ### What changes were proposed in this pull request? This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`. ### Why are the changes needed? This function can be only used from SQL for now. It's good if we can use this function from Scala/Python code as well as SQL. ### Does this PR introduce _any_ user-facing change? Yes. Users can use this function from Scala and Python. ### How was this patch tested? New test. Closes #32566 from sarutak/sentences-function. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-19 20:07:28 +09:00
Hyukjin Kwon	747fe7282c	[SPARK-35419][PYTHON] Enable spark.sql.execution.pyspark.udf.simplifiedTraceback.enabled by default ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` Before ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` After* ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Note that the traceback (`return f(args, *kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging. In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it. ### Why are the changes needed? To show simplified exception messages to end users. ### Does this PR introduce _any_ user-facing change? Yes, it will hide the internal Python worker traceback. ### How was this patch tested? Existing test cases should cover. Closes #32569 from HyukjinKwon/SPARK-35419. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-18 12:27:09 +09:00
Takuya UESHIN	2a335f2d7d	[SPARK-34941][PYTHON] Fix mypy errors and enable mypy check for pandas-on-Spark ### What changes were proposed in this pull request? Fixes `mypy` errors and enables `mypy` check for pandas-on-Spark. ### Why are the changes needed? The `mypy` check for pandas-on-Spark was disabled when the initial porting. It should be enabled again; otherwise we will miss type checking errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The enabled `mypy` check and existing unit tests should pass. Closes #32540 from ueshin/issues/SPARK-34941/pandas_mypy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-17 10:46:59 -07:00
Gera Shegalov	9eb45ecb4f	[SPARK-35408][PYTHON] Improve parameter validation in DataFrame.show ### What changes were proposed in this pull request? Provide clearer error message tied to the user's Python code if incorrect parameters are passed to `DataFrame.show` rather than the message about a missing JVM method the user is not calling directly. ``` py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748 ``` ### Why are the changes needed? For faster debugging through actionable error message. ### Does this PR introduce _any_ user-facing change? No change for the correct parameters but different error messages for the parameters triggering an exception. ### How was this patch tested? - unit test - manually in PySpark REPL Closes #32555 from gerashegalov/df_show_validation. Authored-by: Gera Shegalov <gera@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-17 16:22:46 +09:00
Sean Owen	a37cce95c2	[MINOR][DOCS] Add required imports to CV, train validation split Pyspark ML examples ### What changes were proposed in this pull request? Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit ### Why are the changes needed? The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work. ### Does this PR introduce _any_ user-facing change? No, docs only change. ### How was this patch tested? Existing tests. Closes #32554 from srowen/TuningImports. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-15 08:13:54 -05:00
Ruifeng Zheng	f7704ece40	[SPARK-35392][ML][PYTHON] Fix flaky tests in ml/clustering.py and ml/feature.py ### What changes were proposed in this pull request? This PR removes the check of `summary.logLikelihood` in ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if: - change number of partitions; - just change the way to compute the sum of weights; - change the underlying BLAS impl Also uses more permissive precision on `Word2Vec` test case. ### Why are the changes needed? To recover the build and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test cases. Closes #32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-13 22:23:51 +09:00
Takuya UESHIN	17b59a9970	[SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` Before: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]\| +------------------------------------------------------------------------+ ``` After: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]\| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-13 14:58:01 +09:00
Sean Owen	a189be8754	[MINOR][DOCS] Avoid some python docs where first sentence has "e.g." or similar ### What changes were proposed in this pull request? Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary. ### Why are the changes needed? See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated. ### Does this PR introduce _any_ user-facing change? Only changes docs. ### How was this patch tested? Manual testing of docs. Closes #32508 from srowen/TruncatedPythonDesc. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 10:38:59 +09:00
Xinrong Meng	5ecb112410	[SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst ### What changes were proposed in this pull request? Use full names of modules in `install.rst` when specifying dependencies. ### Why are the changes needed? Using full names makes it more clear. In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual verification. Closes #32427 from xinrong-databricks/nameDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 11:02:57 +09:00
Xinrong Meng	120c389b00	[SPARK-34887][PYTHON] Port Koalas dependencies into PySpark ### What changes were proposed in this pull request? Port Koalas dependencies appropriately to PySpark dependencies. ### Why are the changes needed? pandas-on-Spark has its own required dependency and optional dependencies. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #32386 from xinrong-databricks/portDeps. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:04:23 +09:00
garawalid	176218b6b8	[SPARK-35292][PYTHON] Delete redundant parameter in mypy configuration ### What changes were proposed in this pull request? The parameter no_implicit_optional is defined twice in the mypy configuration, [ligne 20](https://github.com/apache/spark/blob/master/python/mypy.ini#L20) and ligne 105. ### Why are the changes needed? We would like to keep the mypy configuration clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This patch can be tested with `dev/lint-python` Closes #32418 from garawalid/feature/clean-mypy-config. Authored-by: garawalid <gwalid94@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:01:34 +09:00
HyukjinKwon	8aaa9e890a	[SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 08:44:18 +09:00
Yikun Jiang	44b7931936	[SPARK-35176][PYTHON] Standardize input validation error type ### What changes were proposed in this pull request? This PR corrects some exception type when the function input params are failed to validate due to TypeError. In order to convenient to review, there are 3 commits in this PR: - Standardize input validation error type on sql - Standardize input validation error type on ml - Standardize input validation error type on pandas ### Why are the changes needed? As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them. [1] https://docs.python.org/3/library/exceptions.html#TypeError Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch. ### Does this PR introduce _any_ user-facing change? Yes, code can raise the right TypeError instead of ValueError. ### How was this patch tested? Existing test case and UT Closes #32368 from Yikun/SPARK-35176. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 15:34:24 +09:00
Yikun Jiang	0769049ee1	[SPARK-34979][PYTHON][DOC] Add PyArrow installation note for PySpark aarch64 user ### What changes were proposed in this pull request? This patch adds a note for aarch64 user to install the specific pyarrow>=4.0.0. ### Why are the changes needed? The pyarrow aarch64 support is [introduced](https://github.com/apache/arrow/pull/9285) in [PyArrow 4.0.0](https://github.com/apache/arrow/releases/tag/apache-arrow-4.0.0), and it has been published 27.Apr.2021. See more in [SPARK-34979](https://issues.apache.org/jira/browse/SPARK-34979). ### Does this PR introduce _any_ user-facing change? Yes, this doc can help user install arrow on aarch64. ### How was this patch tested? doc test passed. Closes #32363 from Yikun/SPARK-34979. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 09:56:17 +09:00
Ludovic Henry	5b77ebb57b	[SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib ### What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package. The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation. Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available. A table summarising which version gets loaded in which case: ``` \| \| BLAS.nativeBLAS \| BLAS.javaBLAS \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| with -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| wrapper for com.github.fommil:all \| (JDK16+, relies on the Vector API, requires \| \| \| 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| relies on the Foreign Linker API, requires \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| `--add-modules=jdk.incubator.foreign \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| -Dforeign.restricted=warn`) \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| 3. fails to load, falls back to BLAS.javaBLAS in \| wrapper for com.github.fommil:core \| \| \| org.apache.spark.ml.linalg.BLAS \| \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| without -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| relies on the Foreign Linker API, requires \| (JDK16+, relies on the Vector API, requires \| \| \| `--add-modules=jdk.incubator.foreign \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| -Dforeign.restricted=warn`) \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| 2. fails to load, falls back to BLAS.javaBLAS in \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| org.apache.spark.ml.linalg.BLAS \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| \| wrapper for com.github.fommil:core \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| ``` ### Why are the changes needed? Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available. ### Does this PR introduce _any_ user-facing change? No, all changes are transparent to the user. ### How was this patch tested? The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite. [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`: #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 223 232 8 448.0 2.2 1.0X [info] java 221 228 7 453.0 2.2 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 122 128 4 821.2 1.2 1.0X [info] java 122 128 4 822.3 1.2 1.0X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 112 2 921.4 1.1 1.0X [info] java 70 74 3 1423.5 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.1 1.0 1.0X [info] java 47 49 2 2121.7 0.5 2.0X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 184 195 8 544.3 1.8 1.0X [info] java 185 196 7 539.5 1.9 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 99 104 4 1011.9 1.0 1.0X [info] java 99 104 4 1010.4 1.0 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 947.2 1.1 1.0X [info] java 0 0 0 1584.8 0.6 1.7X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 867.4 1.2 1.0X [info] java 1 1 0 865.0 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 485.9 2.1 1.0X [info] java 1 1 0 486.8 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1843.0 0.5 1.0X [info] java 0 0 0 2690.6 0.4 1.5X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1214.7 0.8 1.0X [info] java 0 0 0 2536.8 0.4 2.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1895.9 0.5 1.0X [info] java 0 0 0 2961.1 0.3 1.6X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1223.4 0.8 1.0X [info] java 0 0 0 3091.4 0.3 2.5X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 560 575 20 1787.1 0.6 1.0X [info] java 226 232 5 4432.4 0.2 2.5X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 570 586 23 1755.2 0.6 1.0X [info] java 227 232 4 4410.1 0.2 2.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 863 879 17 1158.4 0.9 1.0X [info] java 227 231 3 4407.9 0.2 3.8X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1282 1305 23 780.0 1.3 1.0X [info] java 227 232 4 4413.4 0.2 5.7X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 538 548 8 1858.6 0.5 1.0X [info] java 221 226 3 4521.1 0.2 2.4X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 549 558 10 1819.9 0.5 1.0X [info] java 222 229 7 4503.5 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 838 852 12 1193.0 0.8 1.0X [info] java 222 229 5 4500.5 0.2 3.8X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 905 919 18 1104.8 0.9 1.0X [info] java 221 228 5 4521.3 0.2 4.1X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 195 204 10 512.7 2.0 1.0X [info] java 195 202 7 512.4 2.0 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 113 4 923.3 1.1 1.0X [info] java 102 107 4 984.4 1.0 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 107 110 3 938.1 1.1 1.0X [info] java 69 72 3 1447.1 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.5 1.0 1.0X [info] java 43 45 2 2317.1 0.4 2.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 155 168 8 644.2 1.6 1.0X [info] java 158 169 8 632.8 1.6 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 90 4 1178.1 0.8 1.0X [info] java 86 90 4 1167.7 0.9 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 1182.1 0.8 1.0X [info] java 0 0 0 1432.1 0.7 1.2X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 898.7 1.1 1.0X [info] java 1 1 0 891.5 1.1 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 495.4 2.0 1.0X [info] java 1 1 0 495.7 2.0 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2271.6 0.4 1.0X [info] java 0 0 0 3648.1 0.3 1.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1229.3 0.8 1.0X [info] java 0 0 0 2711.3 0.4 2.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2677.5 0.4 1.0X [info] java 0 0 0 3288.2 0.3 1.2X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1233.0 0.8 1.0X [info] java 0 0 0 2766.3 0.4 2.2X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 520 536 16 1923.6 0.5 1.0X [info] java 214 221 7 4669.5 0.2 2.4X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 593 612 17 1686.5 0.6 1.0X [info] java 215 219 3 4643.3 0.2 2.8X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 853 870 16 1172.8 0.9 1.0X [info] java 215 218 3 4659.7 0.2 4.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1350 1370 23 740.8 1.3 1.0X [info] java 215 219 4 4656.6 0.2 6.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 460 468 6 2173.2 0.5 1.0X [info] java 210 213 2 4752.7 0.2 2.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 535 544 8 1869.3 0.5 1.0X [info] java 210 215 5 4761.8 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 843 853 11 1186.8 0.8 1.0X [info] java 209 214 4 4793.4 0.2 4.0X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 891 904 15 1122.0 0.9 1.0X [info] java 209 214 4 4777.2 0.2 4.3X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 194 199 7 515.7 1.9 1.0X [info] java 181 186 3 551.1 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 115 4 915.0 1.1 1.0X [info] java 88 92 3 1138.8 0.9 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 110 2 922.6 1.1 1.0X [info] java 54 56 2 1839.2 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 97 2 1046.1 1.0 1.0X [info] java 29 30 1 3393.4 0.3 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 156 165 5 643.0 1.6 1.0X [info] java 150 159 5 667.1 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 91 6 1171.0 0.9 1.0X [info] java 75 79 3 1340.6 0.7 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 917.0 1.1 1.0X [info] java 0 0 0 8147.2 0.1 8.9X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 859.3 1.2 1.0X [info] java 1 1 0 859.3 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 482.1 2.1 1.0X [info] java 1 1 0 482.6 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2214.2 0.5 1.0X [info] java 0 0 0 7975.8 0.1 3.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1231.4 0.8 1.0X [info] java 0 0 0 8680.9 0.1 7.0X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2684.3 0.4 1.0X [info] java 0 0 0 18527.1 0.1 6.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1235.4 0.8 1.0X [info] java 0 0 0 17347.9 0.1 14.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 530 552 18 1887.5 0.5 1.0X [info] java 58 64 3 17143.9 0.1 9.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 598 620 17 1671.1 0.6 1.0X [info] java 58 64 3 17196.6 0.1 10.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 834 847 14 1199.4 0.8 1.0X [info] java 57 63 4 17486.9 0.1 14.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1338 1366 22 747.3 1.3 1.0X [info] java 58 63 3 17356.6 0.1 23.2X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 489 501 9 2045.5 0.5 1.0X [info] java 36 38 2 27721.9 0.0 13.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 478 488 9 2094.0 0.5 1.0X [info] java 36 38 2 27813.2 0.0 13.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 825 837 10 1211.6 0.8 1.0X [info] java 35 38 2 28433.1 0.0 23.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 900 918 15 1111.6 0.9 1.0X [info] java 36 38 2 28073.0 0.0 25.3X ``` [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas Closes #32253 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 14:00:59 -05:00
Julien Lafaye	592230e47b	[MINOR][DOCS][ML] Explicit return type of array_to_vector utility function There are two types of dense vectors: * pyspark.ml.linalg.DenseVector * pyspark.mllib.linalg.DenseVector In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector. The documentation is ambiguous & can lead to the false conclusion that instances of pyspark.mllib.linalg.DenseVector will be returned. Conversion from ml versions to mllib versions can easly be achieved with mlutils.convertVectorColumnsToML helper. ### What changes were proposed in this pull request? Make documentation more explicit ### Why are the changes needed? The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test were run as only the documentation was changed Closes #32255 from jlafaye/master. Authored-by: Julien Lafaye <jlafaye@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 09:08:26 -05:00
Ruifeng Zheng	1f150b9392	[SPARK-35024][ML] Refactor LinearSVC - support virtual centering ### What changes were proposed in this pull request? 1, remove existing agg, and use a new agg supporting virtual centering 2, add related testsuites ### Why are the changes needed? centering vectors should accelerate convergence, and generate solution more close to R ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated testsuites and added testsuites Closes #32124 from zhengruifeng/svc_agg_refactor. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-04-25 13:16:46 +08:00
Xinrong Meng	4fcbf59079	[SPARK-35040][PYTHON] Remove Spark-version related codes from test codes ### What changes were proposed in this pull request? Removes PySpark version dependent codes from pyspark.pandas test codes. ### Why are the changes needed? There are several places to check the PySpark version and switch the logic, but now those are not necessary. We should remove them. We will do the same thing after we finish porting tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32300 from xinrong-databricks/port.rmv_spark_version_chk_in_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-22 18:01:07 -07:00
Xinrong Meng	4d2b559d92	[SPARK-34999][PYTHON] Consolidate PySpark testing utils ### What changes were proposed in this pull request? Consolidate PySpark testing utils by removing `python/pyspark/pandas/testing`, and then creating a file `pandasutils` under `python/pyspark/testing` for test utilities used in `pyspark/pandas`. ### Why are the changes needed? `python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and `python/pyspark/testing` contain test utilities for pyspark. Consolidating them makes code cleaner and easier to maintain. Updated import statements are as shown below: - from pyspark.testing.sqlutils import SQLTestUtils - from pyspark.testing.pandasutils import PandasOnSparkTestCase, TestUtils (PandasOnSparkTestCase is the original ReusedSQLTestCase in `python/pyspark/pandas/testing/utils.py`) Minor improvements include: - Usage of missing library's requirement_message - `except ImportError` rather than `except` - import pyspark.pandas alias as `ps` rather than `pp` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests under python/pyspark/pandas/tests. Closes #32177 from xinrong-databricks/port.merge_utils. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-22 13:07:35 -07:00
harupy	b6350f5bb0	[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-04-21 16:29:10 +08:00
itholic	91bd38467e	[SPARK-34995] Port/integrate Koalas remaining codes into PySpark ### What changes were proposed in this pull request? There are some more changes in Koalas such as [databricks/koalas#2141](`c8f803d6be`), [databricks/koalas#2143](`913d68868d`) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`. ### Why are the changes needed? We should port the whole Koalas codes into PySpark and synchronize them. ### Does this PR introduce _any_ user-facing change? Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring. ### How was this patch tested? Manually tested in local. Closes #32197 from itholic/SPARK-34995-fix. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-16 17:42:03 +09:00
Xinrong Meng	4aee19efb4	[SPARK-35032][PYTHON] Port Koalas Index unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable Index unit tests. Closes #32139 from xinrong-databricks/port.indexes_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-16 08:53:30 +09:00
HyukjinKwon	637f59360b	Revert "[SPARK-34995] Port/integrate Koalas remaining codes into PySpark" This reverts commit `9689c44b60`.	2021-04-15 21:01:47 +09:00
itholic	9689c44b60	[SPARK-34995] Port/integrate Koalas remaining codes into PySpark ### What changes were proposed in this pull request? There are some more changes in Koalas such as [databricks/koalas#2141](`c8f803d6be`), [databricks/koalas#2143](`913d68868d`) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`. ### Why are the changes needed? We should port the whole Koalas codes into PySpark and synchronize them. ### Does this PR introduce _any_ user-facing change? Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring. ### How was this patch tested? Manually tested in local. Closes #32154 from itholic/SPARK-34995. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 19:13:08 +09:00
HyukjinKwon	7ff9d2e3ee	[SPARK-35071][PYTHON] Rename Koalas to pandas-on-Spark in main codes ### What changes were proposed in this pull request? This PR proposes to rename Koalas to pandas-on-Spark in main codes ### Why are the changes needed? To have the correct name in PySpark. NOTE that the official name in the main documentation will be pandas APIs on Spark to be extra clear. pandas-on-Spark is not the official term. ### Does this PR introduce _any_ user-facing change? No, it's master-only change. It changes the docstring and class names. ### How was this patch tested? Manually tested via: ```bash ./python/run-tests --python-executable=python3 --modules pyspark-pandas ``` Closes #32166 from HyukjinKwon/rename-koalas. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 12:48:59 +09:00
xinrong-databricks	58feb85145	[SPARK-35034][PYTHON] Port Koalas miscellaneous unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas miscellaneous unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable miscellaneous unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable miscellaneous unit tests. Closes #32152 from xinrong-databricks/port.misc_tests. Lead-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Co-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 11:45:15 +09:00
Yikun Jiang	31555f7779	[SPARK-34630][PYTHON][FOLLOWUP] Add __version__ into pyspark init __all__ ### What changes were proposed in this pull request? This patch add `__version__` into pyspark.__init__.__all__ to make the `__version__` as exported explicitly, see more in https://github.com/apache/spark/pull/32110#issuecomment-817331896 ### Why are the changes needed? 1. make the `__version__` as exported explicitly 2. cleanup `noqa: F401` on `__version` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Python related CI passed Closes #32125 from Yikun/SPARK-34629-Follow. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-04-14 23:36:25 +02:00
Takuya UESHIN	4ae57d5b3a	[SPARK-35039][PYTHON] Remove PySpark version dependent codes ### What changes were proposed in this pull request? Removes PySpark version dependent codes from `pyspark.pandas` main codes. ### Why are the changes needed? There are several places to check the PySpark version and switch the logic, but now those are not necessary. We should remove them. We will do the same thing after we finish porting tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32138 from ueshin/issues/SPARK-35039/pyspark_version. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 14:30:48 +09:00
Xinrong Meng	47d62af2a9	[SPARK-35035][PYTHON] Port Koalas internal implementation unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas internal implementation unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the internal implementation unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable internal implementation unit tests. Closes #32137 from xinrong-databricks/port.test_internal_impl. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:59:33 +09:00
Xinrong Meng	cd1e8e8158	[SPARK-35033][PYTHON] Port Koalas plot unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas plot unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the plot unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable plot unit tests. Closes #32151 from xinrong-databricks/port.plot_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:20:16 +09:00
Alex Mooney	faa928cefc	[MINOR][PYTHON][DOCS] Fix docstring for pyspark.sql.DataFrameWriter.json lineSep param ### What changes were proposed in this pull request? Add a new line to the `lineSep` parameter so that the doc renders correctly. ### Why are the changes needed? > <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png"> The first line of the description is part of the signature and is bolded. ### Does this PR introduce _any_ user-facing change? Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered. ### How was this patch tested? I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious. Closes #32153 from AlexMooney/patch-1. Authored-by: Alex Mooney <alexmooney@fastmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:14:51 +09:00
Xinrong Meng	8ebc3fca8c	[SPARK-35012][PYTHON] Port Koalas DataFrame-related unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame-related unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not fully tested. We should enable the DataFrame-related unit tests first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable DataFrame-related unit tests. Closes #32131 from xinrong-databricks/port.test_dataframe_related. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-13 14:24:08 -07:00
Xinrong Meng	a392633566	[SPARK-34996][PYTHON] Port Koalas Series-related unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Series related unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not fully tested. We should enable the Series related unit tests first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable Series-related unit tests. Closes #32117 from xinrong-databricks/port.test_series_related. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 13:03:35 +09:00
Xinrong Meng	9c1f807549	[SPARK-35031][PYTHON] Port Koalas operations on different frames tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas operations on different frames unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the operations on different frames unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable operations on different frames unit tests. Closes #32133 from xinrong-databricks/port.test_ops_on_diff_frames. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 11:22:51 +09:00
Yikun Jiang	b43f7e6a97	[SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.* ### What changes were proposed in this pull request? Fix type hints mismatches in pyspark.sql.* ### Why are the changes needed? There were some mismatches in pyspark.sql.* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? dev/lint-python passed. Closes #32122 from Yikun/SPARK-35019. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 11:21:13 +09:00
Luka Sturtewagen	fd8081cd27	[SPARK-34983][PYTHON] Renaming the package alias from pp to ps ### What changes were proposed in this pull request? This PR proposes to fix: ```python import pyspark.pandas as pp ``` to ```python import pyspark.pandas as ps ``` ### Why are the changes needed? `pp` might sound offensive in some contexts. ### Does this PR introduce _any_ user-facing change? The change is in master only. We'll use `ps` as the short name instead of `pp`. ### How was this patch tested? The CI in this PR will test it out. Closes #32108 from LSturtew/renaming_pyspark.pandas. Authored-by: Luka Sturtewagen <luka.sturtewagen@linkit.nl> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-12 11:18:08 +09:00
Takuya UESHIN	ff1fc5ed4b	[SPARK-34972][PYTHON][TEST][FOLLOWUP] Fix pyspark.pandas doctests which could be flaky ### What changes were proposed in this pull request? This is a follow-up of #32069. Makes some doctests which could be flaky skip. ### Why are the changes needed? Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic. - groupby-apply with UDF which has a return type annotation will lose its index. - `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32116 from ueshin/issues/SPARK-34972/fix_flaky_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-11 10:42:00 +09:00
Yikun Jiang	4c1ccdabe8	[SPARK-34630][PYTHON] Add typehint for pyspark.__version__ ### What changes were proposed in this pull request? This PR adds the typehint of pyspark.__version__, which was mentioned in [SPARK-34630](https://issues.apache.org/jira/browse/SPARK-34630). ### Why are the changes needed? There were some short discussion happened in https://github.com/apache/spark/pull/31823#discussion_r593830911 . After further deep investigation on [1][2], we can see the `pyspark.__version__` is added by [setup.py](`c06758834e/python/setup.py (L201)`), it makes `__version__` embedded into pyspark module, that means the `__init__.pyi` is the right place to add the typehint for `__version__`. So, this patch adds the type hint `__version__` in pyspark/__init__.pyi. [1] [PEP-396 Module Version Numbers](https://www.python.org/dev/peps/pep-0396/) [2] https://packaging.python.org/guides/single-sourcing-package-version/ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. Disable the ignore_error on `ee7bf7d962/python/mypy.ini (L132)` 2. Run mypy: - Before fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark \| grep version python/pyspark/pandas/spark/accessors.py:884: error: Module has no attribute "__version__" ``` - After fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark \| grep version ``` no output Closes #32110 from Yikun/SPARK-34629. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-11 10:40:08 +09:00
Xinrong Meng	3af2c1bb9c	[SPARK-34886][PYTHON] Port/integrate Koalas DataFrame unit test into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame unit test to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested at all. We should enable the DataFrame unit test first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable the DataFrame unit test. Closes #32083 from xinrong-databricks/port.test_dataframe. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-09 15:48:13 +09:00
Takuya UESHIN	2635c3894f	[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work ### What changes were proposed in this pull request? Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure. ### Why are the changes needed? Currently the pandas-on-Spark modules are not tested at all. We should enable doctests first, and we will port other unit tests separately later. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the whole doctests. Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-07 20:50:41 +09:00
Yikun Jiang	390d5bde81	[SPARK-34968][TEST][PYTHON] Add the `-fr` argument to xargs rm ### What changes were proposed in this pull request? This patch add the `-fr` argument to `xargs rm`. ### Why are the changes needed? This cmd is unavailable in basic case. If the find command does not get any search results, the rm command is invoked with an empty argument list, and then we will get a `rm: missing operand` and break, then the coverage report does not generate. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? python/run-tests-with-coverage --testnames pyspark.sql.tests.test_arrow --python-executables=python The coverage report result is generated without break. Closes #32064 from Yikun/patch-1. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-06 15:20:55 -07:00
itholic	caf04f9b77	[SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark ### What changes were proposed in this pull request? As a first step of [SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849), this PR proposes porting the Koalas main code into PySpark. This PR contains minimal changes to the existing Koalas code as follows: 1. `databricks.koalas` -> `pyspark.pandas` 2. `from databricks import koalas as ks` -> `from pyspark import pandas as pp` 3. `ks.xxx -> pp.xxx` Other than them: 1. Added a line to `python/mypy.ini` in order to ignore the mypy test. See related issue at [SPARK-34941](https://issues.apache.org/jira/browse/SPARK-34941). 2. Added a comment to several lines in several files to ignore the flake8 F401. See related issue at [SPARK-34943](https://issues.apache.org/jira/browse/SPARK-34943). When this PR is merged, all the features that were previously used in [Koalas](https://github.com/databricks/koalas) will be available in PySpark as well. Users can access to the pandas API in PySpark as below: ```python >>> from pyspark import pandas as pp >>> ppdf = pp.DataFrame({"A": [1, 2, 3], "B": [15, 20, 25]}) >>> ppdf A B 0 1 15 1 2 20 2 3 25 ``` The existing "options and settings" in Koalas are also available in the same way: ```python >>> from pyspark.pandas.config import set_option, reset_option, get_option >>> ppser1 = pp.Series([1, 2, 3]) >>> ppser2 = pp.Series([3, 4, 5]) >>> ppser1 + ppser2 Traceback (most recent call last): ... ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option. >>> set_option("compute.ops_on_diff_frames", True) >>> ppser1 + ppser2 0 4 1 6 2 8 dtype: int64 ``` Please also refer to the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html) and [Options and Settings](https://koalas.readthedocs.io/en/latest/user_guide/options.html) for more detail. NOTE that this PR intentionally ports the main codes of Koalas first almost as are with minimal changes because: - Koalas project is fairly large. Making some changes together for PySpark will make it difficult to review the individual change. Koalas dev includes multiple Spark committers who will review. By doing this, the committers will be able to more easily and effectively review and drive the development. - Koalas tests and documentation require major changes to make it look great together with PySpark whereas main codes do not require. - We lately froze the Koalas codebase, and plan to work together on the initial porting. By porting the main codes first as are, it unblocks the Koalas dev to work on other items in parallel. I promise and will make sure on: - Rename Koalas to PySpark pandas APIs and/or pandas-on-Spark accordingly in documentation, and the docstrings and comments in the main codes. - Triage APIs to remove that don’t make sense when Koalas is in PySpark The documentation changes will be tracked in [SPARK-34885](https://issues.apache.org/jira/browse/SPARK-34885), the test code changes will be tracked in [SPARK-34886](https://issues.apache.org/jira/browse/SPARK-34886). ### Why are the changes needed? Please refer to: - [[DISCUSS] Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html) - [[VOTE] SPIP: Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html) ### Does this PR introduce _any_ user-facing change? Yes, now users can use the pandas APIs on Spark ### How was this patch tested? Manually tested for exposed major APIs and options as described above. ### Koalas contributors Koalas would not have been possible without the following contributors: ueshin HyukjinKwon rxin xinrong-databricks RainFung charlesdong1991 harupy floscha beobest2 thunterdb garawalid LucasG0 shril deepyaman gioa fwani 90jam thoo AbdealiJK abishekganesh72 gliptak DumbMachine dvgodoy stbof nitlev hjoo gatorsmile tomspur icexelloss awdavidson guyao akhilputhiry scook12 patryk-oleniuk tracek dennyglee athena15 gstaubli WeichenXu123 hsubbaraj lfdversluis ktksq shengjh margaret-databricks LSturtew sllynn manuzhang jijosg sadikovi Closes #32036 from itholic/SPARK-34890. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-06 12:42:39 +09:00
HyukjinKwon	2ca76a57be	[MINOR][DOCS] Use ASCII characters when possible in PySpark documentation ### What changes were proposed in this pull request? This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation ### Why are the changes needed? To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782 ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Found via (Mac OS): ```bash # In Spark root directory cd python pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .` ``` Closes #32048 from HyukjinKwon/minor-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-04 09:49:36 +03:00
David Li	1237124062	[SPARK-34463][PYSPARK][DOCS] Document caveats of Arrow selfDestruct ### What changes were proposed in this pull request? As a followup for #29818, document caveats of using the Arrow selfDestruct option in toPandas, which include: - toPandas() may be slower; - the resulting dataframe may not support some Pandas operations due to immutable backing arrays. ### Why are the changes needed? This will hopefully reduce user confusion as with SPARK-34463. ### Does this PR introduce _any_ user-facing change? Yes - documentation is updated and a config setting description is updated to clearly indicate the config is experimental. ### How was this patch tested? This is a documentation-only change. Closes #31738 from lidavidm/spark-34463. Authored-by: David Li <li.davidm96@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-30 13:30:27 +09:00
Kousuke Saruta	14c7bb877d	[SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters ### What changes were proposed in this pull request? This PR fixes an issue that `quoteIfNeeded` quotes a name only if it contains `.` or ``` ` ```. This method should quote it if it contains non-word characters. ### Why are the changes needed? It's a potential bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31964 from sarutak/fix-quoteIfNeeded. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-29 09:31:24 +00:00
Danny Meijer	ad211ccd9d	[SPARK-34630][PYTHON][SQL] Added typehint for pyspark.sql.Column.contains ### What changes were proposed in this pull request? This PR implements the missing typehints as per SPARK-34630. ### Why are the changes needed? To satisfy the aforementioned Jira ticket ### Does this PR introduce _any_ user-facing change? No, just adding a missing typehint for Project Zen ### How was this patch tested? No tests needed (just adding a typehint) Closes #31823 from dannymeijer/feature/SPARK-34630. Authored-by: Danny Meijer <danny.meijer@nike.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-03-24 15:21:19 +01:00
John Ayad	ddfc75ec64	[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes #31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-22 23:29:28 +09:00
HyukjinKwon	c7bf8adc38	[SPARK-34818][PYTHON][DOCS] Reorder the items in User Guide at PySpark documentation ### What changes were proposed in this pull request? This PR proposes to reorder the items in User Guide in PySpark documentation in order to place general guides first and advance ones later. ### Why are the changes needed? For users to more easily follow. ### Does this PR introduce _any_ user-facing change? Yes, it changes the order in the items in documentation . ### How was this patch tested? Manually verified the documentation after building: <img width="768" alt="Screen Shot 2021-03-22 at 2 38 41 PM" src="https://user-images.githubusercontent.com/6477701/111945072-5537d680-8b1c-11eb-9f43-02f3ad63a509.png"> FWIW, the current page: https://spark.apache.org/docs/latest/api/python/user_guide/index.html Closes #31922 from HyukjinKwon/SPARK-34818. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-22 15:53:39 +09:00
Sean Owen	ed641fbad6	[MINOR][DOCS][ML] Doc 'mode' as a supported Imputer strategy in Pyspark ### What changes were proposed in this pull request? Document `mode` as a supported Imputer strategy in Pyspark docs. ### Why are the changes needed? Support was added in 3.1, and documented in Scala, but some Python docs were missed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #31883 from srowen/ImputerModeDocs. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-20 01:16:49 -05:00
Kousuke Saruta	03dd33cc98	[SPARK-25769][SPARK-34636][SPARK-34626][SQL] sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly ### What changes were proposed in this pull request? This PR fixes an issue that `sql` method in the following classes which take qualified names don't quote the qualified names properly. * UnresolvedAttribute * AttributeReference * Alias One instance caused by this issue is reported in SPARK-34626. ``` UnresolvedAttribute("a" :: "b" :: Nil).sql `a.b` // expected: `a`.`b` ``` And other instances are like as follows. ``` UnresolvedAttribute("a`b"::"c.d"::Nil).sql a`b.`c.d` // expected: `a``b`.`c.d` AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql c.d.`a.b` // expected: `c.d`.`a.b` Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` ``` ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31754 from sarutak/fix-qualified-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-12 02:58:46 +00:00
wankunde	60e324aa9f	[SPARK-34688][PYTHON] Upgrade to Py4J 0.10.9.2 ### What changes were proposed in this pull request? This PR upgrade Py4J from 0.10.9.1 to 0.10.9.2 that contains some bug fixes and improvements. * expose shell parameter in Popen inside launch_gateway. ([bartdag/py4j220efc3](`220efc3716`)) * fixed Flake8 errors ([bartdag/py4j6c6ee9a](`6c6ee9aedc`)) ### Why are the changes needed? To leverage fixes from the upstream in Py4J. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Jenkins build and GitHub Actions will test it out. Closes #31796 from wankunde/py4j. Authored-by: wankunde <wankunde@163.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-11 09:51:41 -06:00
HyukjinKwon	2526fdea48	[SPARK-34657][PYTHON][DOCS] Replace the tag of release to the hash to hide RC tags in Binder ### What changes were proposed in this pull request? Currently Binder link at Spark 3.1.1 (https://mybinder.org/v2/gh/apache/spark/v3.1.1-rc3?filepath=python%2Fdocs%2Fsource%2Fgetting_started%2Fquickstart.ipynb) shows `v3.1.1-rc3` like: ![Screen Shot 2021-03-08 at 10 10 55 AM](https://user-images.githubusercontent.com/6477701/110262729-ecb70880-7ff7-11eb-92ba-f151d74985a6.png) After the fix, it will shows the explicit hash: ![Screen Shot 2021-03-08 at 10 17 25 AM](https://user-images.githubusercontent.com/6477701/110262740-f476ad00-7ff7-11eb-8632-5b418ff87024.png) In addition, this also fixes the examples URL while I am fixing it. For example: https://github.com/apache/spark/tree/v3.1.1-rc3/examples/src/main/python -> https://github.com/apache/spark/tree/1d550c4e902/examples/src/main/python Note that it is hash in order to make both dev and release easier. ### Why are the changes needed? To hide RC tags. ### Does this PR introduce _any_ user-facing change? It will just change the URL shown when Binder is being loaded. ### How was this patch tested? Manually tested: ```bash make clean html ``` ![Screen Shot 2021-03-08 at 10 17 06 AM](https://user-images.githubusercontent.com/6477701/110262813-2ee04a00-7ff8-11eb-9983-c4484f7832c4.png) ```bash git_hash=`git rev-parse --short HEAD` export GIT_HASH=$git_hash make clean html ``` ![Screen Shot 2021-03-08 at 10 17 25 AM](https://user-images.githubusercontent.com/6477701/110262805-2982ff80-7ff8-11eb-8560-e1e2aa7b263a.png) Closes #31773 from HyukjinKwon/SPARK-34657. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-08 10:48:17 +09:00
Peter Toth	ab8a9a0ceb	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType)))) ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType)))) ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ \| c3\| +---+ \|1.0\| +---+ >>> df.select("c4").show() +---+ \| c4\| +---+ \| 1\| +---+ >>> df.select("c3", "c4").show() +---+----+ \| c3\| c4\| +---+----+ \|1.0\|null\| +---+----+ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113 After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ \| c3\| c4\| +---+---+ \|1.0\| 1\| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-07 19:12:42 -06:00
Sean Owen	2f30cdebb1	[SPARK-34642][DOCS][ML] Fix TypeError in Pyspark Linear Regression docs ### What changes were proposed in this pull request? Fix a call to setParams in the Linear Regression docs example in Pyspark to avoid a TypeError. ### Why are the changes needed? The example is slightly wrong and we should not show an error in the docs. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Existing tests Closes #31760 from srowen/SPARK-34642. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-06 07:32:01 -08:00
Takuya UESHIN	331d459ee7	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests ### What changes were proposed in this pull request? Fixes a Python UDF `plus_one` used in `GroupedAggPandasUDFTests` to always return float (double) values. ### Why are the changes needed? The Python UDF `plus_one` used in `GroupedAggPandasUDFTests` is always returning `v + 1` regardless of its type. The return type of the UDF is 'double', so if the input is int, the result will be `null`. ```py >>> df = spark.range(10).toDF('id') \ ... .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ ... .withColumn("v", explode(col('vs'))) \ ... .drop('vs') \ ... .withColumn('w', lit(1.0)) >>> udf('double') ... def plus_one(v): ... assert isinstance(v, (int, float)) ... return v + 1 ... >>> pandas_udf('double', PandasUDFType.GROUPED_AGG) ... def sum_udf(v): ... return v.sum() ... >>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).show() +------------+----------+ \|plus_one(id)\|sum_udf(v)\| +------------+----------+ \| null\| 2900.0\| +------------+----------+ ``` This is meaningless and should be: ```py >>> udf('double') ... def plus_one(v): ... assert isinstance(v, (int, float)) ... return float(v + 1) ... >>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).sort('plus_one(id)').show() +------------+----------+ \|plus_one(id)\|sum_udf(v)\| +------------+----------+ \| 1.0\| 245.0\| \| 2.0\| 255.0\| \| 3.0\| 265.0\| \| 4.0\| 275.0\| \| 5.0\| 285.0\| \| 6.0\| 295.0\| \| 7.0\| 305.0\| \| 8.0\| 315.0\| \| 9.0\| 325.0\| \| 10.0\| 335.0\| +------------+----------+ ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Fixed the test. Closes #31730 from ueshin/issues/SPARK-34610/test_pandas_udf_grouped_agg. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 10:03:54 +09:00

1 2 3 4 5 ...

2919 commits