ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kevin Su	3fb044e043	[SPARK-35478][PYTHON] Enable disallow_untyped_defs mypy check for pyspark.pandas.window ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/window.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on the Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32886 from pingsutw/SPARK-35478. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 11:21:33 -07:00
Yikun Jiang	f84a720fe3	[SPARK-35342][PYTHON] Introduce DecimalOps and make `isnull` method data-type-based ### What changes were proposed in this pull request? - Introduce a DecimalOps for DecimalType - Make `isnull` method data-type-based ### Why are the changes needed? Now DecimalType, DoubleType, and FloatType data share the FractionalOps class, but DecimalType behaves differently from FloatType and DoubleType (as https://github.com/apache/spark/blob/master/python/pyspark/pandas/base.py#L987-L990), so we propose to introduce DecimalOps. The behavior difference here is caused by DecimalType could not have NaN. https://issues.apache.org/jira/browse/SPARK-35342 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - New added DecimalOpsTest passed - Existing NumOpsTest passed Closes #32821 from Yikun/SPARK-35342. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-18 10:44:35 -07:00
Takuya UESHIN	2f537a838a	[SPARK-35469][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/accessors.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32956 from ueshin/issues/SPARK-35469/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-18 20:43:59 +09:00
HyukjinKwon	41af409b7b	[SPARK-35303][PYTHON] Enable pinned thread mode by default ### What changes were proposed in this pull request? PySpark added pinned thread mode at https://github.com/apache/spark/pull/24898 to sync Python thread to JVM thread. Previously, one JVM thread could be reused which ends up with messed inheritance hierarchy such as thread local especially when multiple jobs run in parallel. To completely fix this, we should enable this mode by default. ### Why are the changes needed? To correctly support parallel job submission and management. ### Does this PR introduce _any_ user-facing change? Yes, now Python thread is mapped to JVM thread one to one. ### How was this patch tested? Existing tests should cover it. Closes #32429 from HyukjinKwon/SPARK-35303. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-18 12:02:29 +09:00
itholic	b9aeeb4e6c	[SPARK-35683][PYTHON] Fix Index.difference to avoid collect 'other' to driver side ### What changes were proposed in this pull request? This PR fix the wrong behavior of `Index.difference` in pandas APIs on Spark, based on the comment https://github.com/databricks/koalas/pull/1325#discussion_r647889901 and https://github.com/databricks/koalas/pull/1325#discussion_r647890007 - it couldn't handle the case properly when `self` is `Index` or `MultiIndex` and `other` is `MultiIndex` or `Index`. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) pyspark.pandas.exceptions.PandasNotImplementedError: The method `pd.Index.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead. ``` - it's collecting the all data into the driver side when the other is list-like objects, especially when the `other` is distributed object such as Series which is very dangerous. And added the related test cases. ### Why are the changes needed? To correct the incompatible behavior with pandas, and to prevent the case which potentially cause the OOM easily. ```python >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> idx1 = ps.Index([1, 2, 3]) >>> midx1 = ps.MultiIndex.from_tuples([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)]) >>> midx1.difference(idx1) MultiIndex([('a', 'x', 1), ('b', 'z', 2), ('k', 'z', 3)], ) ``` And now it only using the for loop when the `other` is only the case `list`, `set` or `dict`. ### Does this PR introduce _any_ user-facing change? Yes, the previous bug is fixed as described in the above code examples. ### How was this patch tested? Manually tested with linter and unittest in local, and it might be passed on CI. Closes #32853 from itholic/SPARK-35683. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 14:18:54 +09:00
Takuya UESHIN	2a56cc36ca	[SPARK-35761][PYTHON] Use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings ### What changes were proposed in this pull request? Modify the `pandas_udf` usage to use type-annotation based pandas_udf or avoid specifying udf types to suppress warnings. ### Why are the changes needed? The usage of `pandas_udf` in pandas-on-Spark is outdated and shows warnings. We should use type-annotation based `pandas_udf` or avoid specifying udf types. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32913 from ueshin/issues/SPARK-35761/suppress_warnings. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 11:17:56 +09:00
Hyukjin Kwon	95f36e76c6	[SPARK-35750][PYTHON][DOCS] Rename "pandas APIs on Spark" to "pandas API on Spark" ### What changes were proposed in this pull request? This PR proposes to rename "pandas APIs on Spark" to "pandas API on Spark" which is more natural (since API stands for Application Program Interface). ### Why are the changes needed? To make it sound more natural. ### Does this PR introduce _any_ user-facing change? It fixes a typo in the unreleased changes. ### How was this patch tested? N/A Closes #32903 from HyukjinKwon/SPARK-34885. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-15 10:01:04 +09:00
Xinrong Meng	03756618fc	[SPARK-35616][PYTHON] Make `astype` method data-type-based ### What changes were proposed in this pull request? Make `astype` method data-type-based. Non-goal: Match pandas' `astype` TypeErrors. Currently, `astype` throws TypeError error messages only when the destination type is not recognized. However, for some destination types that don't make sense to the specific type of Series/Index, for example, `numeric Series/Index → bytes`, we don't have proper TypeError error messages. Since the goal of the PR is refactoring mainly, the above issue might be resolved later if needed. ### Why are the changes needed? There are many type checks in the `astype` method. Since `DataTypeOps` and its subclasses are introduced, we should refactor `astype` to make it data-type-based. In this way, code is cleaner, more maintainable, and more flexible. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32847 from xinrong-databricks/datatypeops_astype. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-14 16:33:15 -07:00
Hyukjin Kwon	76e08a8e3d	[SPARK-35738][PYTHON] Support 'y' properly in DataFrame with non-numeric columns with plots ### What changes were proposed in this pull request? This PR proposes to port the fix https://github.com/databricks/koalas/pull/2172. ```python ks.DataFrame({'a': [1, 2, 3], 'b':["a", "b", "c"], 'c': [4, 5, 6]}).plot(kind='hist', x='a', y='c', bins=200) ``` Before: ``` pyspark.sql.utils.AnalysisException: cannot resolve 'least(min(a), min(b), min(c))' due to data type mismatch: The expressions should all have the same type, got LEAST(bigint, string, bigint).; 'Aggregate [unresolvedalias(least(min(a#1L), min(b#2), min(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1)), unresolvedalias(greatest(max(a#1L), max(b#2), max(c#3L)), Some(org.apache.spark.sql.Column$$Lambda$1556/0x0000000800d9484042fb0cc1))] +- Project [a#1L, b#2, c#3L] +- Project [__index_level_0__#0L, a#1L, b#2, c#3L, monotonically_increasing_id() AS __natural_order__#8L] +- LogicalRDD [__index_level_0__#0L, a#1L, b#2, c#3L], false ``` After: ```python Figure({ 'data': [{'hovertemplate': 'variable=a<br>value=%{text}<br>count=%{y}', 'name': 'a', ... ``` ### Why are the changes needed? To match the behaviour with panadas' and allow users to set `x` and `y` in the DataFrame with non-numeric columns. ### Does this PR introduce _any_ user-facing change? No to end users since the changes is not released yet. Yes to dev as described before. ### How was this patch tested? Manually tested, added a test and tested in notebooks: ![Screen Shot 2021-06-11 at 9 11 25 PM](https://user-images.githubusercontent.com/6477701/121686038-a47a1b80-cafb-11eb-8f8e-8d968db7ebef.png) ![Screen Shot 2021-06-11 at 9 48 58 PM](https://user-images.githubusercontent.com/6477701/121688858-e22c7380-cafe-11eb-9d0a-adcbe560030f.png) Closes #32884 from HyukjinKwon/fix-hist-plot. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-12 14:36:46 +09:00
Takuya UESHIN	4d21b94d13	[SPARK-35475][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the file `python/pyspark/pandas/namespace.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32871 from ueshin/issues/SPARK-35475/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-11 11:07:11 -07:00
Kevin Su	cadd3a0588	[SPARK-35474] Enable disallow_untyped_defs mypy check for pyspark.pandas.indexing ### What changes were proposed in this pull request? Adds more type annotations in the file: `python/pyspark/pandas/spark/indexing.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more disallow_untyped_defs mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. `./dev/lint-python` Closes #32738 from pingsutw/SPARK-35474. Authored-by: Kevin Su <pingsutw@apache.org> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 22:35:12 -07:00
Xinrong Meng	e9d60156c4	[SPARK-35705][PYTHON] Adjust pandas-on-spark `test_groupby_multiindex_columns` test for different pandas versions ### What changes were proposed in this pull request? Adjust pandas-on-spark test_groupby_multiindex_columns test in order to pass with different pandas versions. ### Why are the changes needed? pandas had introduced bugs as below: - For pandas 1.1.3 and 1.1.4 Type error: only integer scalar arrays can be converted to a scalar index - For pandas < 1.0.4 Type error: Can only tuple-index with a MultiIndex We ought to adjust `test_groupby_multiindex_columns` tests by comparing with a predefined return value, rather than comparing with the pandas return value in the pandas versions above. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32851 from xinrong-databricks/SPARK-35705. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-10 10:36:19 +09:00
Xinrong Meng	3c66c11aa6	[SPARK-35601][PYTHON] Complete arithmetic operators involving bool literals, Series, and Index ### What changes were proposed in this pull request? Completing arithmetic operators involving bool literals, Series, and Index consists of two main tasks: - Support arithmetic operations against bool literals - Support operators (+, ) between bool Series/Indexes. ### Why are the changes needed? Arithmetic operators involving bool literals, Series, and Index are incomplete now. We ought to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Newly supported operations example: ```py >>> ps.Series([1, 2, 3]) + True 0 2 1 3 2 4 dtype: int64 >>> ps.Series([1, 2, 3]) + False 0 1 1 2 2 3 dtype: int64 >>> ps.Series([True, False, True]) + True 0 True 1 True 2 True dtype: bool >>> ps.Series([True, False, True]) + False 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) True 0 True 1 False 2 True dtype: bool >>> ps.Series([True, False, True]) * False 0 False 1 False 2 False dtype: bool >>> ps.set_option('compute.ops_on_diff_frames', True) >>> ps.Series([True, True, False]) + ps.Series([True, False, True]) 0 True 1 True 2 True dtype: bool >>> ps.Series([True, True, False]) * ps.Series([True, False, True]) 0 True 1 False 2 False dtype: bool ``` Before the change, operations above are not supported, raising a TypeError such as ```py >>> ps.Series([True, False, True]) + True Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. >>> ps.Series([True, False, True]) + False Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans and the given type. ``` ### How was this patch tested? Unit tests. Closes #32785 from xinrong-databricks/datatypeops_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-09 15:13:03 -07:00
liuqi	e79dd89cf6	[SPARK-35512][PYTHON] Fix OverflowError(cannot convert float infinity to integer) in partitionBy function ### What changes were proposed in this pull request? Limit the batch size for `add_shuffle_key` in `partitionBy` function to fix `OverflowError: cannot convert float infinity to integer` ### Why are the changes needed? It's not easy to write a UT, but I can use some simple code to explain the bug. * Original code ``` def add_shuffle_key(split, iterator): buckets = defaultdict(list) c, batch = 0, min(10 * numPartitions, 1000) for k, v in iterator: buckets[partitionFunc(k) % numPartitions].append((k, v)) c += 1 # check used memory and avg size of chunk of objects if (c % 1000 == 0 and get_used_memory() > limit or c > batch): n, size = len(buckets), 0 for split in list(buckets.keys()): yield pack_long(split) d = outputSerializer.dumps(buckets[split]) del buckets[split] yield d size += len(d) avg = int(size / n) >> 20 # let 1M < avg < 10M if avg < 1: batch = 1.5 elif avg > 10: batch = max(int(batch / 1.5), 1) c = 0 ``` if `get_used_memory() > limit` always is `True` and `avg < 1` always is `True`, the variable `batch` will grow to infinity. then `batch = max(int(batch / 1.5), 1)` may raise `OverflowError` if `avg > 10` at some time. sample code to reproduce the bug ``` import sys limit = 100 used_memory = 200 numPartitions = 64 c, batch = 0, min(10 * numPartitions, 1000) while True: c += 1 if (c % 1000 == 0 and used_memory > limit or c > batch): batch = batch * 1.5 d = max(int(batch / 1.5), 1) print(c, batch) ``` ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? It's not easy to write a UT, there is sample code to test ``` import sys limit = 100 used_memory = 200 numPartitions = 64 c, batch = 0, min(10 * numPartitions, 1000) while True: c += 1 if (c % 1000 == 0 and used_memory > limit or c > batch): batch = min(sys.maxsize, batch * 1.5) d = max(int(batch / 1.5), 1) print(c, batch) ``` Closes #32667 from nolanliou/fix_partitionby. Authored-by: liuqi <nolan.liou@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-09 10:57:27 +09:00
Takuya UESHIN	04418e18d7	[SPARK-35638][PYTHON] Introduce InternalField to manage dtypes and StructFields ### What changes were proposed in this pull request? Introduces `InternalField` to manage dtypes and `StructField`s. `InternalFrame` is already managing dtypes, but when it checks the Spark's data types, column names, and nullabilities, it tries to run the analysis phase each time it needs, which will cause a performance issue. It will use `InternalField` class which stores the retrieved Spark's data types, column names, and nullabilities, and reuse them. Also, in case those can be known, just update and reuse them without asking Spark. ### Why are the changes needed? Currently there are some performance issues in the pandas-on-Spark layer. One of them is accessing Java DataFrame and run analysis phase too many times, especially just for retrieving the current column names or data types. We should reduce the amount of unnecessary access. ### Does this PR introduce _any_ user-facing change? Improves the performance in pandas-on-Spark layer: ```py df = ps.read_parquet("/path/to/test.parquet") # contains ~75 columns df = df[(df["col"] > 0) & (df["col"] < 10000)] ``` Before the PR, it took about 2.15 sec and after 1.15 sec. ### How was this patch tested? Existing tests. Closes #32775 from ueshin/issues/SPARK-35638/field. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-08 11:57:28 +09:00
Xinrong Meng	dfd8a8dc67	[SPARK-35341][PYTHON] Introduce BooleanExtensionOps ### What changes were proposed in this pull request? - Introduce BooleanExtensionOps in order to make boolean operators `and` and `or` data-type-based. - Improve error messages for operators `and` and `or`. ### Why are the changes needed? Boolean operators __and__, __or__, __rand__, and __ror__ should be data-type-based BooleanExtensionDtypes processes these boolean operators differently from bool, so BooleanExtensionOps is introduced. These boolean operators themselves are also bitwise operators, which should be able to apply to other data types classes later. However, this is not the goal of this PR. ### Does this PR introduce _any_ user-facing change? Yes. Error messages for operators `and` and `or` are improved. Before: ``` >>> psser = ps.Series([1, "x", "y"], dtype="category") >>> psser \| True Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: cannot resolve '(`0` OR true)' due to data type mismatch: differing types in '(`0` OR true)' (tinyint and boolean).; 'Project [unresolvedalias(CASE WHEN (isnull(0#9) OR isnull((0#9 OR true))) THEN false ELSE (0#9 OR true) END, Some(org.apache.spark.sql.Column$$Lambda$1442/17254916406fb8afba))] +- Project [__index_level_0__#8L, 0#9, monotonically_increasing_id() AS __natural_order__#12L] +- LogicalRDD [__index_level_0__#8L, 0#9], false ``` After: ``` >>> psser = ps.Series([1, "x", "y"], dtype="category") >>> psser \| True Traceback (most recent call last): ... TypeError: Bitwise or can not be applied to categoricals. ``` ### How was this patch tested? Unit tests. Closes #32698 from xinrong-databricks/datatypeops_extension. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 15:43:52 -07:00
Xinrong Meng	04a8d2cbcf	[SPARK-35343][PYTHON] Make the conversion from/to pandas data-type-based for non-ExtensionDtypes ### What changes were proposed in this pull request? Make the conversion from/to pandas (for non-ExtensionDtype) data-type-based. NOTE: Ops class per ExtensionDtype and its data-type-based from/to pandas will be implemented in a separate PR as https://issues.apache.org/jira/browse/SPARK-35614. ### Why are the changes needed? The conversion from/to pandas includes logic for checking data types and behaving accordingly. That makes code hard to change or maintain. Since we have introduced the Ops class per non-ExtensionDtype data type, we ought to make the conversion from/to pandas data-type-based for non-ExtensionDtypes. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32592 from xinrong-databricks/datatypeop_pd_conversion. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-07 13:12:12 -07:00
Xinrong Meng	50f7686de9	[SPARK-35599][PYTHON] Adjust `check_exact` parameter for older pd.testing ### What changes were proposed in this pull request? Adjust the `check_exact` parameter for non-numeric columns to ensure pandas-on-Spark tests passed with all pandas versions. ### Why are the changes needed? `pd.testing` utils are utilized in pandas-on-Spark tests. Due to https://github.com/pandas-dev/pandas/issues/35446, `check_exact=True` for non-numeric columns doesn't work for older pd.testing utils, e.g. `assert_series_equal`. We wanted to adjust that to ensure pandas-on-Spark tests pass for all pandas versions. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #32772 from xinrong-databricks/test_util. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-07 11:12:49 +09:00
itholic	b8740a1d1e	[SPARK-35499][PYTHON] Apply black to pandas API on Spark codes ### What changes were proposed in this pull request? This PR proposes applying `black` to pandas API on Spark codes, for improving static analysis. By executing the `./dev/reformat-python` in the spark home directory, all the code of the pandas API on Spark is fixed according to the static analysis rules. ### Why are the changes needed? This can be reduces the cost of static analysis during development. It has been used continuously for about a year in the Koalas project and its convenience has been proven. ### Does this PR introduce _any_ user-facing change? No, it's dev-only. ### How was this patch tested? Manually reformat the pandas API on Spark codes by running the `./dev/reformat-python`, and checked the `./dev/lint-python` is passed. Closes #32779 from itholic/SPARK-35499. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-06-06 17:30:07 -07:00
Keerthan Vasist	f2c0a049a6	[SPARK-35643][PYTHON] Fix ambiguous reference in functions.py column() ### What changes were proposed in this pull request? In functions.py, there is a function added `def column(col)`. There is also another method in the same file `def col(col)`. This leads to some ambiguity on whether the parameter is being referred to or the function. In pyspark 3.1.2, this leads to `TypeError: 'str' object is not callable` when the function `column(col)` is called - the highest preference is given to the string variable in scope as opposed to the function `col `in the file as intended. This PR fixes that ambiguity by changing the variable name to `col_like`. I have filed this as an issue on JIRA here - https://issues.apache.org/jira/browse/SPARK-35643. ### Why are the changes needed? In pyspark 3.1.2, we see `TypeError: 'str' object is not callable` when `column()` function is called. This Pr fixes that error. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? I don't believe this patch needs additional testing. Closes #32771 from keerthanvasist/col. Lead-authored-by: Keerthan Vasist <kvasist@amazon.com> Co-authored-by: keerthanvasist <kvasist@amazon.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-05 12:40:39 +09:00
Hyukjin Kwon	3d158f9c91	[SPARK-35587][PYTHON][DOCS] Initial porting of Koalas documentation ### What changes were proposed in this pull request? This PR proposes to port Koalas documentation to PySpark documentation as its initial step. It ports almost as is except these differences: - Renamed import from `databricks.koalas` to `pyspark.pandas`. - Renamed `to_koalas` -> `to_pandas_on_spark` - Renamed `(Series\|DataFrame).koalas` -> `(Series\|DataFrame).pandas_on_spark` - Added a `ps_` prefix in the RST file names of Koalas documentation Other then that, - Excluded `python/docs/build/html` in linter - Fixed GA dependency installataion ### Why are the changes needed? To document pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, it adds new documentations. ### How was this patch tested? Manually built the docs and checked the output. Closes #32726 from HyukjinKwon/SPARK-35587. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-04 11:11:09 +09:00
itholic	2658bc590f	[SPARK-35081][DOCS] Add Data Source Option links to missing documents ### What changes were proposed in this pull request? This PR proposes adding the missing link to Data Source Option page, for related functions such as `to_csv`, `to_json`, `from_csv`, `from_json`, `schema_of_csv`, `schema_of_json`. - Before <img width="797" alt="Screen Shot 2021-06-03 at 11 39 17 AM" src="https://user-images.githubusercontent.com/44108233/120578877-7b092200-c461-11eb-9e24-bd5349445c66.png"> - After <img width="776" alt="Screen Shot 2021-06-03 at 11 59 14 AM" src="https://user-images.githubusercontent.com/44108233/120579868-29fa2d80-c463-11eb-9329-bd6c8f068f5b.png"> ### Why are the changes needed? To provide users available options in detail with the proper documentation link. ### Does this PR introduce _any_ user-facing change? Yes, the link to Data Source Options page is added to the API documentations, as shown in the above screen capture. ### How was this patch tested? Manually built the docs and checked one by one. Closes #32762 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-03 13:52:46 +09:00
itholic	48252bac95	[SPARK-35583][DOCS] Move JDBC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move missing JDBC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JDBC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JDBC To Other Databases" page <img width="803" alt="Screen Shot 2021-06-02 at 11 34 14 AM" src="https://user-images.githubusercontent.com/44108233/120415520-a115c000-c396-11eb-9663-9e666e08ed2b.png"> - Python ![Screen Shot 2021-06-01 at 2 57 40 PM](https://user-images.githubusercontent.com/44108233/120273628-ba146780-c2e9-11eb-96a8-11bd25415197.png) - Scala ![Screen Shot 2021-06-01 at 2 57 03 PM](https://user-images.githubusercontent.com/44108233/120273567-a2d57a00-c2e9-11eb-9788-ea58028ca0a6.png) - Java ![Screen Shot 2021-06-01 at 2 58 27 PM](https://user-images.githubusercontent.com/44108233/120273722-d912f980-c2e9-11eb-83b3-e09992d8c582.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32723 from itholic/SPARK-35583. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 14:21:16 +09:00
itholic	0ad5ae54b2	[SPARK-35539][PYTHON] Restore to_koalas to keep the backward compatibility ### What changes were proposed in this pull request? This PR proposes restoring `to_koalas` to keep the backward compatibility, with throwing deprecated warning. ### Why are the changes needed? If we remove `to_koalas`, the existing Koalas codes that include `to_koalas` wouldn't work. ### Does this PR introduce _any_ user-facing change? No. It's restoring the existing functionality. ### How was this patch tested? Manually tested in local. ```shell >>> sdf.to_koalas() .../spark/python/pyspark/pandas/frame.py:4550: FutureWarning: DataFrame.to_koalas is deprecated as of DataFrame.to_pandas_on_spark. Please use the API instead. warnings.warn( ``` Closes #32729 from itholic/SPARK-35539. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-02 10:39:24 +09:00
Xinrong Meng	0ac5c16177	[SPARK-35314][PYTHON] Support arithmetic operations against bool IndexOpsMixin ### What changes were proposed in this pull request? Support arithmetic operations against bool IndexOpsMixin. ### Why are the changes needed? Existing binary operations of bool IndexOpsMixin in Koalas do not match pandas’ behaviors. pandas take True as 1, False as 0 when dealing with numeric values, numeric collections, and numeric Series/Index; whereas Koalas raises an AnalysisException no matter what the binary operation is. We aim to match pandas' behaviors. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> 1 + psser Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) Traceback (most recent call last): ... TypeError: Addition can not be applied to booleans. >>> ps.Series([1, 2, 3]) + psser Traceback (most recent call last): ... TypeError: addition can not be applied to given types. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([True, True, False]) >>> psser + 1 0 2 1 2 2 1 dtype: int64 >>> 1 + psser 0 2 1 2 2 1 dtype: int64 >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> psser + ps.Series([1, 2, 3]) 0 2 1 3 2 3 dtype: int64 >>> ps.Series([1, 2, 3]) + psser 0 2 1 3 2 3 dtype: int64 ``` ### How was this patch tested? Unit tests. Closes #32611 from xinrong-databricks/datatypeop_arith_bool. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-06-01 10:57:12 -07:00
itholic	fe09def323	[SPARK-35582][PYTHON][DOCS] Remove # noqa in Python API documents ### What changes were proposed in this pull request? This PR aims to move `# noqa` in the Python docstring to the proper place so that hide them from the official documents. ### Why are the changes needed? If we don't move `# noqa` to the proper place, it is exposed in the middle of the docstring, and it looks a bit wired as below: <img width="613" alt="Screen Shot 2021-06-01 at 3 17 52 PM" src="https://user-images.githubusercontent.com/44108233/120275617-91da3800-c2ec-11eb-9778-16c5fe789418.png"> ### Does this PR introduce _any_ user-facing change? Yes, the `# noqa` is no more shown in the documents as below: <img width="609" alt="Screen Shot 2021-06-01 at 3 21 00 PM" src="https://user-images.githubusercontent.com/44108233/120275927-fbf2dd00-c2ec-11eb-950d-346af2745711.png"> ### How was this patch tested? Manually build docs and check. Closes #32728 from itholic/SPARK-35582. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 15:24:04 +09:00
itholic	73d4f67145	[SPARK-35433][DOCS] Move CSV data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move CSV data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for CSV data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "CSV Files" page <img width="970" alt="Screen Shot 2021-05-27 at 12 35 36 PM" src="https://user-images.githubusercontent.com/44108233/119762269-586a8c80-bee8-11eb-8443-ae5b3c7a685c.png"> - Python <img width="785" alt="Screen Shot 2021-05-25 at 4 12 10 PM" src="https://user-images.githubusercontent.com/44108233/119455390-83cc6a80-bd74-11eb-9156-65785ae27db0.png"> - Scala <img width="718" alt="Screen Shot 2021-05-25 at 4 12 39 PM" src="https://user-images.githubusercontent.com/44108233/119455414-89c24b80-bd74-11eb-9775-aeda549d081e.png"> - Java <img width="667" alt="Screen Shot 2021-05-25 at 4 13 09 PM" src="https://user-images.githubusercontent.com/44108233/119455422-8d55d280-bd74-11eb-97e8-86c1eabeadc2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32658 from itholic/SPARK-35433. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:58:49 +09:00
itholic	7e2717333b	[SPARK-35453][PYTHON] Move Koalas accessor to pandas_on_spark accessor ### What changes were proposed in this pull request? This PR proposes renaming the existing "Koalas Accessor" to "Pandas API on Spark Accessor". ### Why are the changes needed? Because we don't use name "Koalas" anymore, rather use "Pandas API on Spark". So, the related code bases are all need to be changed. ### Does this PR introduce _any_ user-facing change? Yes, the usage of pandas API on Spark accessor is changed from `df.koalas.[...]`. to `df.pandas_on_spark.[...]`. Note: `df.koalas.[...]` is still available but with deprecated warnings. ### How was this patch tested? Manually tested in local and checked one by one. Closes #32674 from itholic/SPARK-35453. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-06-01 10:33:10 +09:00
Hyukjin Kwon	7eb74482a7	[SPARK-35510][PYTHON] Fix and reenable test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true ### What changes were proposed in this pull request? This PR proposes to fix and reenable `test_stats_on_non_numeric_columns_should_be_discarded_if_numeric_only_is_true` that was disabled when we upgrade Python 3.9 in CI at https://github.com/apache/spark/pull/32657. Seems like this is because of the latest NumPy's behaviour change, see also `https://github.com/numpy/numpy/pull/16273#discussion_r641264085`. pandas inherits this behaviour but it doesn't make sense when `numeric_only` is set to `True` in pandas. I will track and follow the status of the issue between pandas and NumPy. For the time being, I propose to exclude boolean case alone in percentile/quartile test case ### Why are the changes needed? To keep the test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? I roughly locally tested. But it should pass in CI. Closes #32690 from HyukjinKwon/SPARK-35510. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-28 17:35:01 +09:00
Xinrong Meng	79a2a46cdb	[SPARK-35098][PYTHON] Re-enable pandas-on-Spark test cases ### What changes were proposed in this pull request? Re-enable some pandas-on-Spark test cases. ### Why are the changes needed? pandas version in GitHub Actions is upgraded now so we can re-enable some pandas-on-Spark test cases. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #32682 from xinrong-databricks/enable_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 12:33:30 +09:00
Takuya UESHIN	d6d3209c2f	[SPARK-35537][PYTHON] Introduce a util function spark_column_equals ### What changes were proposed in this pull request? Introduce a util function `spark_column_equals` to check the underlying expressions of columns are the same or not. ### Why are the changes needed? In pandas on Spark, there are some places checking the underlying expressions of columns are the same or not, but it's done one-by-one. We should introduce a util function for it. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The existing tests. Closes #32680 from ueshin/issues/SPARK-35537/spark_column_equals. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-27 12:14:43 +09:00
Xinrong Meng	8cc7232ffa	[SPARK-35522][PYTHON] Introduce BinaryOps for BinaryType ### What changes were proposed in this pull request? BinaryType, which represents byte sequence values in Spark, doesn't support data-type-based operations yet. We are going to introduce BinaryOps for it. ### Why are the changes needed? The data-type-based-operations class should be set for each individual data type, including BinaryType. In addition, BinaryType has its special way of addition, which means concatenation. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b'1', b'2', b'3']) >>> psser + psser Traceback (most recent call last): ... TypeError: Type object was not understood. >>> psser + b'1' Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> psser = ps.Series([b'1', b'2', b'3']) >>> psser + psser 0 [49, 49] 1 [50, 50] 2 [51, 51] dtype: object >>> psser + b'1' 0 [49, 49] 1 [50, 49] 2 [51, 49] dtype: object ``` ### How was this patch tested? Unit tests. Closes #32665 from xinrong-databricks/datatypeops_binary. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-26 14:30:24 -07:00
Xinrong Meng	266608d50e	[SPARK-35452][PYTHON] Introduce ArrayOps, MapOps and StructOps ### What changes were proposed in this pull request? The PR is proposed to introduce ArrayOps, MapOps and StructOps to handle data-type-based operations for StructType, ArrayType, and MapType separately. ### Why are the changes needed? StructType, ArrayType, and MapType are not accepted by DataTypeOps now. We should handle these complex types. Among them: - ArrayType supports concatenation: for example, ps.Series([[1,2,3]]) + ps.Series([[4,5,6]]) should work the same as pd.Series([[1,2,3]]) + pd.Series([[4,5,6]]), as concatenation. - StructOps will be helpful to make to/from pandas conversion data-type-based. ### Does this PR introduce _any_ user-facing change? Yes. Before the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) Traceback (most recent call last): ... TypeError: Type object was not understood. >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Type object was not understood. ``` After the change: ```py >>> import pyspark.pandas as ps >>> from pyspark.pandas.config import set_option >>> set_option("compute.ops_on_diff_frames", True) >>> ps.Series([[1, 2, 3]]) + ps.Series([[0.4, 0.5]]) 0 [1.0, 2.0, 3.0, 0.4, 0.5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([[4, 5]]) 0 [1, 2, 3, 4, 5] dtype: object >>> ps.Series([[1, 2, 3]]) + ps.Series([['x']]) Traceback (most recent call last): ... TypeError: Concatenation can only be applied to arrays of the same type ``` ### How was this patch tested? Unit tests. Closes #32626 from xinrong-databricks/datatypeop_complex. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-26 10:40:01 -07:00
itholic	79a6b0cc8a	[SPARK-35509][DOCS] Move text data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move text data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for text data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Text Files" page <img width="823" alt="Screen Shot 2021-05-26 at 3 20 11 PM" src="https://user-images.githubusercontent.com/44108233/119611669-f5202200-be35-11eb-9307-45846949d300.png"> - Python <img width="791" alt="Screen Shot 2021-05-25 at 5 04 26 PM" src="https://user-images.githubusercontent.com/44108233/119462469-b9c11d00-bd7b-11eb-8f19-2ba7b9ceb318.png"> - Scala <img width="683" alt="Screen Shot 2021-05-25 at 5 05 10 PM" src="https://user-images.githubusercontent.com/44108233/119462483-bd54a400-bd7b-11eb-8177-74e4d7035e63.png"> - Java <img width="665" alt="Screen Shot 2021-05-25 at 5 05 36 PM" src="https://user-images.githubusercontent.com/44108233/119462501-bfb6fe00-bd7b-11eb-8161-12c58fabe7e2.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32660 from itholic/SPARK-35509. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 17:12:49 +09:00
Hyukjin Kwon	20750a3f9e	[SPARK-32194][PYTHON] Use proper exception classes instead of plain Exception ### What changes were proposed in this pull request? This PR proposes to use a proper built-in exceptions instead of the plain `Exception` in Python. While I am here, I fixed another minor issue at `DataFrams.schema` together: ```diff - except AttributeError as e: - raise Exception( - "Unable to parse datatype from schema. %s" % e) + except Exception as e: + raise ValueError( + "Unable to parse datatype from schema. %s" % e) from e ``` Now it catches all exceptions during schema parsing, chains the exception with `ValueError`. Previously it only caught `AttributeError` that does not catch all cases. ### Why are the changes needed? For users to expect the proper exceptions. ### Does this PR introduce _any_ user-facing change? Yeah, the exception classes became different but should be compatible because previous exception was plain `Exception` which other exceptions inherit. ### How was this patch tested? Existing unittests should cover, Closes #31238 Closes #32650 from HyukjinKwon/SPARK-32194. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 11:54:40 +09:00
Hyukjin Kwon	e47e615c0e	[SPARK-35506][PYTHON][INFRA] Run tests with Python 3.9 in GitHub Actions ### What changes were proposed in this pull request? This PR enables GitHub Actions to test PySpark with Python 3.9. ### Why are the changes needed? To verify the support of Python 3.9. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Existing tests should cover. Closes #32657 from HyukjinKwon/SPARK-35506. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-26 09:25:51 +09:00
Takuya UESHIN	d67d73b708	[SPARK-35505][PYTHON] Remove APIs which have been deprecated in Koalas ### What changes were proposed in this pull request? Removes APIs which have been deprecated in Koalas. ### Why are the changes needed? There are some APIs that have been deprecated in Koalas. We shouldn't have those in pandas APIs on Spark. ### Does this PR introduce _any_ user-facing change? Yes, the APIs deprecated in Koalas will be no longer available. ### How was this patch tested? Modified some tests which use the deprecated APIs, and the other existing tests should pass. Closes #32656 from ueshin/issues/SPARK-35505/remove_deprecated. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-25 11:16:27 -07:00
Hyukjin Kwon	4a6d844184	[SPARK-35497][PYTHON] Enable plotly tests in pandas-on-Spark ### What changes were proposed in this pull request? This PR enables plot tests with plotly ```bash ./python/run-tests --python-executables=python3 --modules=pyspark-pandas ``` Before: ``` Traceback (most recent call last): File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/.../miniconda3/envs/python3.8/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/.../pyspark/pandas/tests/plot/test_frame_plot_plotly.py", line 42, in <module> plotly_requirement_message + " Or pandas<1.0; pandas<1.0 does not support latest plotly " TypeError: unsupported operand type(s) for +: 'NoneType' and 'str' ``` After: ``` ... Starting test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly ... Finished test(python3): pyspark.pandas.tests.plot.test_series_plot_plotly (23s) ... Tests passed in 1296 seconds ``` ### Why are the changes needed? For test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? By running the tests. Closes #32649 from HyukjinKwon/SPARK-35497. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-25 12:31:32 +09:00
Weichen Xu	fdd7ca5f4e	[SPARK-35498][PYTHON] Add thread target wrapper API for pyspark pin thread mode ### What changes were proposed in this pull request? Add thread target wrapper API for pyspark pin thread mode. ### Why are the changes needed? A helper method which make user easier to write threading code under pin thread mode. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual. Closes #32644 from WeichenXu123/add_thread_target_wrapper_api. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-25 09:50:22 +09:00
Takuya UESHIN	1b75c2494c	[SPARK-35467][SPARK-35468][SPARK-35477][PYTHON] Fix disallow_untyped_defs mypy checks ### What changes were proposed in this pull request? Adds more type annotations in the files: - `python/pyspark/pandas/spark/accessors.py` - `python/pyspark/pandas/typedef/typehints.py` - `python/pyspark/pandas/utils.py` and fixes the mypy check failures. ### Why are the changes needed? We should enable more `disallow_untyped_defs` mypy checks. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32627 from ueshin/issues/SPARK-35467_35468_35477/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-24 09:31:00 +09:00
Takuya UESHIN	2616d5cc1d	[SPARK-35465][PYTHON] Set up the mypy configuration to enable disallow_untyped_defs check for pandas APIs on Spark module ### What changes were proposed in this pull request? Sets up the `mypy` configuration to enable `disallow_untyped_defs` check for pandas APIs on Spark module. ### Why are the changes needed? Currently many functions in the main codes in pandas APIs on Spark module are still missing type annotations and disabled `mypy` check `disallow_untyped_defs`. We should add more type annotations and enable the mypy check. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds more type annotations in pandas APIs on Spark module, which can impact interaction with development tools for users. ### How was this patch tested? The mypy check with a new configuration and existing tests should pass. Closes #32614 from ueshin/issues/SPARK-35465/disallow_untyped_defs. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-21 11:03:35 -07:00
itholic	d2bdd6595e	[SPARK-35025][SQL][PYTHON][DOCS] Move Parquet data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move Parquet data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for Parquet data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "Parquet Files" page ![Screen Shot 2021-05-21 at 1 35 08 PM](https://user-images.githubusercontent.com/44108233/119082866-e7375f00-ba39-11eb-9ade-a931a5957b34.png) - Python ![Screen Shot 2021-05-21 at 1 38 27 PM](https://user-images.githubusercontent.com/44108233/119082879-eef70380-ba39-11eb-9e8e-ee50eed98dbe.png) - Scala ![Screen Shot 2021-05-21 at 1 36 52 PM](https://user-images.githubusercontent.com/44108233/119082884-f1595d80-ba39-11eb-98d5-966657df65f7.png) - Java ![Screen Shot 2021-05-21 at 1 37 19 PM](https://user-images.githubusercontent.com/44108233/119082888-f4544e00-ba39-11eb-8bf8-47ce78ec0b01.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32161 from itholic/SPARK-34491. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:49 +09:00
itholic	419ddcb2a4	[SPARK-34494][SQL][DOCS] Move JSON data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move JSON data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for JSON data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "JSON Files" page <img width="876" alt="Screen Shot 2021-05-20 at 8 48 27 PM" src="https://user-images.githubusercontent.com/44108233/118973662-ddb3e580-b9ac-11eb-987c-8139aa9c3fe2.png"> - Python <img width="714" alt="Screen Shot 2021-04-16 at 5 04 11 PM" src="https://user-images.githubusercontent.com/44108233/114992491-ca0cef00-9ed5-11eb-9d0f-4de60d8b2516.png"> - Scala <img width="726" alt="Screen Shot 2021-04-16 at 5 04 54 PM" src="https://user-images.githubusercontent.com/44108233/114992594-e315a000-9ed5-11eb-8bd3-af7e568fcfe1.png"> - Java <img width="911" alt="Screen Shot 2021-04-16 at 5 06 11 PM" src="https://user-images.githubusercontent.com/44108233/114992751-10624e00-9ed6-11eb-888c-8668d3c74289.png"> ### How was this patch tested? Manually build docs and confirm the page. Closes #32204 from itholic/SPARK-35081. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:05:13 +09:00
itholic	0fe65b5365	[SPARK-35395][DOCS] Move ORC data source options from Python and Scala into a single page ### What changes were proposed in this pull request? This PR proposes move ORC data source options from Python, Scala and Java into a single page. ### Why are the changes needed? So far, the documentation for ORC data source options is separated into different pages for each language API documents. However, this makes managing many options inconvenient, so it is efficient to manage all options in a single page and provide a link to that page in the API of each language. ### Does this PR introduce _any_ user-facing change? Yes, the documents will be shown below after this change: - "ORC Files" page ![Screen Shot 2021-05-21 at 2 07 14 PM](https://user-images.githubusercontent.com/44108233/119085078-f4564d00-ba3d-11eb-8990-3ba031d809da.png) - Python ![Screen Shot 2021-05-21 at 2 06 46 PM](https://user-images.githubusercontent.com/44108233/119085097-00daa580-ba3e-11eb-8017-ac5a95a7c053.png) - Scala ![Screen Shot 2021-05-21 at 2 06 09 PM](https://user-images.githubusercontent.com/44108233/119085135-164fcf80-ba3e-11eb-9cac-78dded523f38.png) - Java ![Screen Shot 2021-05-21 at 2 06 30 PM](https://user-images.githubusercontent.com/44108233/119085125-118b1b80-ba3e-11eb-9434-f26612d7da13.png) ### How was this patch tested? Manually build docs and confirm the page. Closes #32546 from itholic/SPARK-35395. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-21 18:03:57 +09:00
itholic	6b912e4179	[SPARK-35364][PYTHON] Renaming the existing Koalas related codes ### What changes were proposed in this pull request? There are still naming related to Koalas in test and function name. This PR addressed them to fit pandas-on-spark. - kdf -> psdf - kser -> psser - kidx -> psidx - kmidx -> psmidx - to_koalas() -> to_pandas_on_spark() ### Why are the changes needed? This is because the name Koalas is no longer used in PySpark. ### Does this PR introduce _any_ user-facing change? `to_koalas()` function is renamed to `to_pandas_on_spark()` ### How was this patch tested? Tested in local manually. After changing the related naming, I checked them one by one. Closes #32516 from itholic/SPARK-35364. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-20 15:08:30 -07:00
Xinrong Meng	a970f8505d	[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for pandas APIs on Spark, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32596 from xinrong-databricks/datatypeop_arith_fix. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-19 19:47:00 -07:00
Hyukjin Kwon	7eaabf4df5	[SPARK-35408][PYTHON][FOLLOW-UP] Avoid unnecessary f-string format ### What changes were proposed in this pull request? This PR avoids using f-string format that's a new feature in Python 3.6. Although it's legitimate to use this syntax because Apache Spark supports Python 3.6+, this breaks unofficial support of Python 3.5. This specific f-string format looks something unnecessary, and doesn't look worth enough to remove such unofficial support because of one string format in an error message. NOTE that this PR doesn't mean that we're maintaining Python 3.5 since we dropped. It just looks like too much to remove that unofficial support only because of one string format and error message. ### Why are the changes needed? To keep unofficial Python 3.5 support ### Does this PR introduce _any_ user-facing change? Officially nope. ### How was this patch tested? Ran the linters. Closes #32598 from HyukjinKwon/SPARK-35408=followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-20 10:47:31 +09:00
Takuya UESHIN	d44e6c7f10	Revert "[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures" This reverts commit `d1b24d8aba`.	2021-05-19 16:49:47 -07:00
Xinrong Meng	d1b24d8aba	[SPARK-35338][PYTHON] Separate arithmetic operations into data type based structures ### What changes were proposed in this pull request? The PR is proposed for pandas APIs on Spark, in order to separate arithmetic operations shown as below into data-type-based structures. `__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__, __radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__` DataTypeOps and subclasses are introduced. The existing behaviors of each arithmetic operation should be preserved. ### Why are the changes needed? Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types. Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class. Closes #32469 from xinrong-databricks/datatypeop_arith. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-19 15:05:32 -07:00
Kousuke Saruta	9283bebbbd	[SPARK-35418][SQL] Add sentences function to functions.{scala,py} ### What changes were proposed in this pull request? This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`. ### Why are the changes needed? This function can be only used from SQL for now. It's good if we can use this function from Scala/Python code as well as SQL. ### Does this PR introduce _any_ user-facing change? Yes. Users can use this function from Scala and Python. ### How was this patch tested? New test. Closes #32566 from sarutak/sentences-function. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-05-19 20:07:28 +09:00
Takuya UESHIN	2a335f2d7d	[SPARK-34941][PYTHON] Fix mypy errors and enable mypy check for pandas-on-Spark ### What changes were proposed in this pull request? Fixes `mypy` errors and enables `mypy` check for pandas-on-Spark. ### Why are the changes needed? The `mypy` check for pandas-on-Spark was disabled when the initial porting. It should be enabled again; otherwise we will miss type checking errors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The enabled `mypy` check and existing unit tests should pass. Closes #32540 from ueshin/issues/SPARK-34941/pandas_mypy. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-05-17 10:46:59 -07:00
Gera Shegalov	9eb45ecb4f	[SPARK-35408][PYTHON] Improve parameter validation in DataFrame.show ### What changes were proposed in this pull request? Provide clearer error message tied to the user's Python code if incorrect parameters are passed to `DataFrame.show` rather than the message about a missing JVM method the user is not calling directly. ``` py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748 ``` ### Why are the changes needed? For faster debugging through actionable error message. ### Does this PR introduce _any_ user-facing change? No change for the correct parameters but different error messages for the parameters triggering an exception. ### How was this patch tested? - unit test - manually in PySpark REPL Closes #32555 from gerashegalov/df_show_validation. Authored-by: Gera Shegalov <gera@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-17 16:22:46 +09:00
Sean Owen	a37cce95c2	[MINOR][DOCS] Add required imports to CV, train validation split Pyspark ML examples ### What changes were proposed in this pull request? Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit ### Why are the changes needed? The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work. ### Does this PR introduce _any_ user-facing change? No, docs only change. ### How was this patch tested? Existing tests. Closes #32554 from srowen/TuningImports. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-05-15 08:13:54 -05:00
Ruifeng Zheng	f7704ece40	[SPARK-35392][ML][PYTHON] Fix flaky tests in ml/clustering.py and ml/feature.py ### What changes were proposed in this pull request? This PR removes the check of `summary.logLikelihood` in ml/clustering.py - this GMM test is quite flaky. It fails easily e.g., if: - change number of partitions; - just change the way to compute the sum of weights; - change the underlying BLAS impl Also uses more permissive precision on `Word2Vec` test case. ### Why are the changes needed? To recover the build and tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing test cases. Closes #32533 from zhengruifeng/SPARK_35392_disable_flaky_gmm_test. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-13 22:23:51 +09:00
Takuya UESHIN	17b59a9970	[SPARK-35382][PYTHON] Fix lambda variable name issues in nested DataFrame functions in Python APIs ### What changes were proposed in this pull request? This PR fixes the same issue as #32424. ```py from pyspark.sql.functions import flatten, struct, transform df = spark.sql("SELECT array(1, 2, 3) as numbers, array('a', 'b', 'c') as letters") df.select(flatten( transform( "numbers", lambda number: transform( "letters", lambda letter: struct(number.alias("n"), letter.alias("l")) ) ) ).alias("zipped")).show(truncate=False) ``` Before: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]\| +------------------------------------------------------------------------+ ``` After: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]\| +------------------------------------------------------------------------+ ``` ### Why are the changes needed? To produce the correct results. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the results to be correct as mentioned above. ### How was this patch tested? Added a unit test as well as manually. Closes #32523 from ueshin/issues/SPARK-35382/nested_higher_order_functions. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-13 14:58:01 +09:00
Sean Owen	a189be8754	[MINOR][DOCS] Avoid some python docs where first sentence has "e.g." or similar ### What changes were proposed in this pull request? Avoid some python docs where first sentence has "e.g." or similar as the period causes the docs to show only half of the first sentence as the summary. ### Why are the changes needed? See for example https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.regression.LinearRegressionModel.html?highlight=linearregressionmodel#pyspark.ml.regression.LinearRegressionModel.summary where the method description is clearly truncated. ### Does this PR introduce _any_ user-facing change? Only changes docs. ### How was this patch tested? Manual testing of docs. Closes #32508 from srowen/TruncatedPythonDesc. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>	2021-05-12 10:38:59 +09:00
HyukjinKwon	8aaa9e890a	[SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 08:44:18 +09:00
Yikun Jiang	44b7931936	[SPARK-35176][PYTHON] Standardize input validation error type ### What changes were proposed in this pull request? This PR corrects some exception type when the function input params are failed to validate due to TypeError. In order to convenient to review, there are 3 commits in this PR: - Standardize input validation error type on sql - Standardize input validation error type on ml - Standardize input validation error type on pandas ### Why are the changes needed? As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them. [1] https://docs.python.org/3/library/exceptions.html#TypeError Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch. ### Does this PR introduce _any_ user-facing change? Yes, code can raise the right TypeError instead of ValueError. ### How was this patch tested? Existing test case and UT Closes #32368 from Yikun/SPARK-35176. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 15:34:24 +09:00
Ludovic Henry	5b77ebb57b	[SPARK-35150][ML] Accelerate fallback BLAS with dev.ludovic.netlib ### What changes were proposed in this pull request? Following https://github.com/apache/spark/pull/30810, I've continued looking for ways to accelerate the usage of BLAS in Spark. With this PR, I integrate work done in the [`dev.ludovic.netlib`](https://github.com/luhenry/netlib/) Maven package. The `dev.ludovic.netlib` library wraps the original `com.github.fommil.netlib` library and focus on accelerating the linear algebra routines in use in Spark. When running the `org.apache.spark.ml.linalg.BLASBenchmark` benchmarking suite, I get the results at [1] on an Intel machine. Moreover, this library is thoroughly tested to return the exact same results as the reference implementation. Under the hood, it reimplements the necessary algorithms in pure autovectorization-friendly Java 8, as well as takes advantage of the Vector API and Foreign Linker API introduced in JDK 16 when available. A table summarising which version gets loaded in which case: ``` \| \| BLAS.nativeBLAS \| BLAS.javaBLAS \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| with -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.NetlibNativeBLAS, a \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| wrapper for com.github.fommil:all \| (JDK16+, relies on the Vector API, requires \| \| \| 2. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| relies on the Foreign Linker API, requires \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| `--add-modules=jdk.incubator.foreign \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| -Dforeign.restricted=warn`) \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| 3. fails to load, falls back to BLAS.javaBLAS in \| wrapper for com.github.fommil:core \| \| \| org.apache.spark.ml.linalg.BLAS \| \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| \| without -Pnetlib-lgpl \| 1. dev.ludovic.netlib.blas.ForeignBLAS (JDK16+, \| 1. dev.ludovic.netlib.blas.VectorizedBLAS \| \| \| relies on the Foreign Linker API, requires \| (JDK16+, relies on the Vector API, requires \| \| \| `--add-modules=jdk.incubator.foreign \| `--add-modules=jdk.incubator.vector` on JDK16) \| \| \| -Dforeign.restricted=warn`) \| 2. dev.ludovic.netlib.blas.Java11BLAS (JDK11+) \| \| \| 2. fails to load, falls back to BLAS.javaBLAS in \| 3. dev.ludovic.netlib.blas.JavaBLAS \| \| \| org.apache.spark.ml.linalg.BLAS \| 4. dev.ludovic.netlib.blas.NetlibF2jBLAS, a \| \| \| \| wrapper for com.github.fommil:core \| \| --------------------- \| -------------------------------------------------- \| -------------------------------------------------- \| ``` ### Why are the changes needed? Accelerates linear algebra operations when the pure-java fallback method is in use. Transparently falls back to native implementation (OpenBLAS, MKL) when available. ### Does this PR introduce _any_ user-facing change? No, all changes are transparent to the user. ### How was this patch tested? The `dev.ludovic.netlib` library has its own test suite [2]. It has also been validated by running the Spark test suite and benchmarking suite. [1] Results for `org.apache.spark.ml.linalg.BLASBenchmark`: #### JDK8: ``` [info] OpenJDK 64-Bit Server VM 1.8.0_292-b10 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java8BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 223 232 8 448.0 2.2 1.0X [info] java 221 228 7 453.0 2.2 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 122 128 4 821.2 1.2 1.0X [info] java 122 128 4 822.3 1.2 1.0X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 112 2 921.4 1.1 1.0X [info] java 70 74 3 1423.5 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.1 1.0 1.0X [info] java 47 49 2 2121.7 0.5 2.0X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 184 195 8 544.3 1.8 1.0X [info] java 185 196 7 539.5 1.9 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 99 104 4 1011.9 1.0 1.0X [info] java 99 104 4 1010.4 1.0 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 947.2 1.1 1.0X [info] java 0 0 0 1584.8 0.6 1.7X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 867.4 1.2 1.0X [info] java 1 1 0 865.0 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 485.9 2.1 1.0X [info] java 1 1 0 486.8 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1843.0 0.5 1.0X [info] java 0 0 0 2690.6 0.4 1.5X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1214.7 0.8 1.0X [info] java 0 0 0 2536.8 0.4 2.1X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1895.9 0.5 1.0X [info] java 0 0 0 2961.1 0.3 1.6X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1223.4 0.8 1.0X [info] java 0 0 0 3091.4 0.3 2.5X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 560 575 20 1787.1 0.6 1.0X [info] java 226 232 5 4432.4 0.2 2.5X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 570 586 23 1755.2 0.6 1.0X [info] java 227 232 4 4410.1 0.2 2.5X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 863 879 17 1158.4 0.9 1.0X [info] java 227 231 3 4407.9 0.2 3.8X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1282 1305 23 780.0 1.3 1.0X [info] java 227 232 4 4413.4 0.2 5.7X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 538 548 8 1858.6 0.5 1.0X [info] java 221 226 3 4521.1 0.2 2.4X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 549 558 10 1819.9 0.5 1.0X [info] java 222 229 7 4503.5 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 838 852 12 1193.0 0.8 1.0X [info] java 222 229 5 4500.5 0.2 3.8X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 905 919 18 1104.8 0.9 1.0X [info] java 221 228 5 4521.3 0.2 4.1X ``` #### JDK11: ``` [info] OpenJDK 64-Bit Server VM 11.0.11+9-LTS on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] nativeBLAS = dev.ludovic.netlib.blas.Java11BLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 195 204 10 512.7 2.0 1.0X [info] java 195 202 7 512.4 2.0 1.0X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 113 4 923.3 1.1 1.0X [info] java 102 107 4 984.4 1.0 1.1X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 107 110 3 938.1 1.1 1.0X [info] java 69 72 3 1447.1 0.7 1.5X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 98 2 1046.5 1.0 1.0X [info] java 43 45 2 2317.1 0.4 2.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 155 168 8 644.2 1.6 1.0X [info] java 158 169 8 632.8 1.6 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 90 4 1178.1 0.8 1.0X [info] java 86 90 4 1167.7 0.9 1.0X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 1182.1 0.8 1.0X [info] java 0 0 0 1432.1 0.7 1.2X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 898.7 1.1 1.0X [info] java 1 1 0 891.5 1.1 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 495.4 2.0 1.0X [info] java 1 1 0 495.7 2.0 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2271.6 0.4 1.0X [info] java 0 0 0 3648.1 0.3 1.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1229.3 0.8 1.0X [info] java 0 0 0 2711.3 0.4 2.2X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2677.5 0.4 1.0X [info] java 0 0 0 3288.2 0.3 1.2X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1233.0 0.8 1.0X [info] java 0 0 0 2766.3 0.4 2.2X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 520 536 16 1923.6 0.5 1.0X [info] java 214 221 7 4669.5 0.2 2.4X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 593 612 17 1686.5 0.6 1.0X [info] java 215 219 3 4643.3 0.2 2.8X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 853 870 16 1172.8 0.9 1.0X [info] java 215 218 3 4659.7 0.2 4.0X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1350 1370 23 740.8 1.3 1.0X [info] java 215 219 4 4656.6 0.2 6.3X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 460 468 6 2173.2 0.5 1.0X [info] java 210 213 2 4752.7 0.2 2.2X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 535 544 8 1869.3 0.5 1.0X [info] java 210 215 5 4761.8 0.2 2.5X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 843 853 11 1186.8 0.8 1.0X [info] java 209 214 4 4793.4 0.2 4.0X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 891 904 15 1122.0 0.9 1.0X [info] java 209 214 4 4777.2 0.2 4.3X ``` #### JDK16: ``` [info] OpenJDK 64-Bit Server VM 16+36 on Linux 5.8.0-50-generic [info] Intel(R) Xeon(R) E-2276G CPU 3.80GHz [info] [info] f2jBLAS = dev.ludovic.netlib.blas.NetlibF2jBLAS [info] javaBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] nativeBLAS = dev.ludovic.netlib.blas.VectorizedBLAS [info] [info] daxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 194 199 7 515.7 1.9 1.0X [info] java 181 186 3 551.1 1.8 1.1X [info] [info] saxpy: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 109 115 4 915.0 1.1 1.0X [info] java 88 92 3 1138.8 0.9 1.2X [info] [info] ddot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 108 110 2 922.6 1.1 1.0X [info] java 54 56 2 1839.2 0.5 2.0X [info] [info] sdot: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 96 97 2 1046.1 1.0 1.0X [info] java 29 30 1 3393.4 0.3 3.2X [info] [info] dscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 156 165 5 643.0 1.6 1.0X [info] java 150 159 5 667.1 1.5 1.0X [info] [info] sscal: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 85 91 6 1171.0 0.9 1.0X [info] java 75 79 3 1340.6 0.7 1.1X [info] [info] dspmv[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 917.0 1.1 1.0X [info] java 0 0 0 8147.2 0.1 8.9X [info] [info] dspr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 859.3 1.2 1.0X [info] java 1 1 0 859.3 1.2 1.0X [info] [info] dsyr[U]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 482.1 2.1 1.0X [info] java 1 1 0 482.6 2.1 1.0X [info] [info] dgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2214.2 0.5 1.0X [info] java 0 0 0 7975.8 0.1 3.6X [info] [info] dgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1231.4 0.8 1.0X [info] java 0 0 0 8680.9 0.1 7.0X [info] [info] sgemv[N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 0 0 0 2684.3 0.4 1.0X [info] java 0 0 0 18527.1 0.1 6.9X [info] [info] sgemv[T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1 1 0 1235.4 0.8 1.0X [info] java 0 0 0 17347.9 0.1 14.0X [info] [info] dgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 530 552 18 1887.5 0.5 1.0X [info] java 58 64 3 17143.9 0.1 9.1X [info] [info] dgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 598 620 17 1671.1 0.6 1.0X [info] java 58 64 3 17196.6 0.1 10.3X [info] [info] dgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 834 847 14 1199.4 0.8 1.0X [info] java 57 63 4 17486.9 0.1 14.6X [info] [info] dgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 1338 1366 22 747.3 1.3 1.0X [info] java 58 63 3 17356.6 0.1 23.2X [info] [info] sgemm[N,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 489 501 9 2045.5 0.5 1.0X [info] java 36 38 2 27721.9 0.0 13.6X [info] [info] sgemm[N,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 478 488 9 2094.0 0.5 1.0X [info] java 36 38 2 27813.2 0.0 13.3X [info] [info] sgemm[T,N]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 825 837 10 1211.6 0.8 1.0X [info] java 35 38 2 28433.1 0.0 23.5X [info] [info] sgemm[T,T]: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] f2j 900 918 15 1111.6 0.9 1.0X [info] java 36 38 2 28073.0 0.0 25.3X ``` [2] https://github.com/luhenry/netlib/tree/master/blas/src/test/java/dev/ludovic/netlib/blas Closes #32253 from luhenry/master. Authored-by: Ludovic Henry <git@ludovic.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 14:00:59 -05:00
Julien Lafaye	592230e47b	[MINOR][DOCS][ML] Explicit return type of array_to_vector utility function There are two types of dense vectors: * pyspark.ml.linalg.DenseVector * pyspark.mllib.linalg.DenseVector In spark-3.1.1, array_to_vector returns instances of pyspark.ml.linalg.DenseVector. The documentation is ambiguous & can lead to the false conclusion that instances of pyspark.mllib.linalg.DenseVector will be returned. Conversion from ml versions to mllib versions can easly be achieved with mlutils.convertVectorColumnsToML helper. ### What changes were proposed in this pull request? Make documentation more explicit ### Why are the changes needed? The documentation is a bit misleading and users can lose time investigating & realizing there are two DenseVector types. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test were run as only the documentation was changed Closes #32255 from jlafaye/master. Authored-by: Julien Lafaye <jlafaye@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-04-27 09:08:26 -05:00
Ruifeng Zheng	1f150b9392	[SPARK-35024][ML] Refactor LinearSVC - support virtual centering ### What changes were proposed in this pull request? 1, remove existing agg, and use a new agg supporting virtual centering 2, add related testsuites ### Why are the changes needed? centering vectors should accelerate convergence, and generate solution more close to R ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated testsuites and added testsuites Closes #32124 from zhengruifeng/svc_agg_refactor. Authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2021-04-25 13:16:46 +08:00
Xinrong Meng	4fcbf59079	[SPARK-35040][PYTHON] Remove Spark-version related codes from test codes ### What changes were proposed in this pull request? Removes PySpark version dependent codes from pyspark.pandas test codes. ### Why are the changes needed? There are several places to check the PySpark version and switch the logic, but now those are not necessary. We should remove them. We will do the same thing after we finish porting tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32300 from xinrong-databricks/port.rmv_spark_version_chk_in_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-22 18:01:07 -07:00
Xinrong Meng	4d2b559d92	[SPARK-34999][PYTHON] Consolidate PySpark testing utils ### What changes were proposed in this pull request? Consolidate PySpark testing utils by removing `python/pyspark/pandas/testing`, and then creating a file `pandasutils` under `python/pyspark/testing` for test utilities used in `pyspark/pandas`. ### Why are the changes needed? `python/pyspark/pandas/testing` hold test utilites for pandas-on-spark, and `python/pyspark/testing` contain test utilities for pyspark. Consolidating them makes code cleaner and easier to maintain. Updated import statements are as shown below: - from pyspark.testing.sqlutils import SQLTestUtils - from pyspark.testing.pandasutils import PandasOnSparkTestCase, TestUtils (PandasOnSparkTestCase is the original ReusedSQLTestCase in `python/pyspark/pandas/testing/utils.py`) Minor improvements include: - Usage of missing library's requirement_message - `except ImportError` rather than `except` - import pyspark.pandas alias as `ps` rather than `pp` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests under python/pyspark/pandas/tests. Closes #32177 from xinrong-databricks/port.merge_utils. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-22 13:07:35 -07:00
harupy	b6350f5bb0	[SPARK-35142][PYTHON][ML] Fix incorrect return type for `rawPredictionUDF` in `OneVsRestModel` ### What changes were proposed in this pull request? Fixes incorrect return type for `rawPredictionUDF` in `OneVsRestModel`. ### Why are the changes needed? Bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #32245 from harupy/SPARK-35142. Authored-by: harupy <17039389+harupy@users.noreply.github.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2021-04-21 16:29:10 +08:00
itholic	91bd38467e	[SPARK-34995] Port/integrate Koalas remaining codes into PySpark ### What changes were proposed in this pull request? There are some more changes in Koalas such as [databricks/koalas#2141](`c8f803d6be`), [databricks/koalas#2143](`913d68868d`) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`. ### Why are the changes needed? We should port the whole Koalas codes into PySpark and synchronize them. ### Does this PR introduce _any_ user-facing change? Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring. ### How was this patch tested? Manually tested in local. Closes #32197 from itholic/SPARK-34995-fix. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-16 17:42:03 +09:00
Xinrong Meng	4aee19efb4	[SPARK-35032][PYTHON] Port Koalas Index unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Index unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the Index unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable Index unit tests. Closes #32139 from xinrong-databricks/port.indexes_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-16 08:53:30 +09:00
HyukjinKwon	637f59360b	Revert "[SPARK-34995] Port/integrate Koalas remaining codes into PySpark" This reverts commit `9689c44b60`.	2021-04-15 21:01:47 +09:00
itholic	9689c44b60	[SPARK-34995] Port/integrate Koalas remaining codes into PySpark ### What changes were proposed in this pull request? There are some more changes in Koalas such as [databricks/koalas#2141](`c8f803d6be`), [databricks/koalas#2143](`913d68868d`) after the main code porting, this PR is to synchronize those changes with the `pyspark.pandas`. ### Why are the changes needed? We should port the whole Koalas codes into PySpark and synchronize them. ### Does this PR introduce _any_ user-facing change? Fixed some incompatible behavior with pandas 1.2.0 and added more to the `to_markdown` docstring. ### How was this patch tested? Manually tested in local. Closes #32154 from itholic/SPARK-34995. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 19:13:08 +09:00
HyukjinKwon	7ff9d2e3ee	[SPARK-35071][PYTHON] Rename Koalas to pandas-on-Spark in main codes ### What changes were proposed in this pull request? This PR proposes to rename Koalas to pandas-on-Spark in main codes ### Why are the changes needed? To have the correct name in PySpark. NOTE that the official name in the main documentation will be pandas APIs on Spark to be extra clear. pandas-on-Spark is not the official term. ### Does this PR introduce _any_ user-facing change? No, it's master-only change. It changes the docstring and class names. ### How was this patch tested? Manually tested via: ```bash ./python/run-tests --python-executable=python3 --modules pyspark-pandas ``` Closes #32166 from HyukjinKwon/rename-koalas. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 12:48:59 +09:00
xinrong-databricks	58feb85145	[SPARK-35034][PYTHON] Port Koalas miscellaneous unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas miscellaneous unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable miscellaneous unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable miscellaneous unit tests. Closes #32152 from xinrong-databricks/port.misc_tests. Lead-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Co-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-15 11:45:15 +09:00
Yikun Jiang	31555f7779	[SPARK-34630][PYTHON][FOLLOWUP] Add __version__ into pyspark init __all__ ### What changes were proposed in this pull request? This patch add `__version__` into pyspark.__init__.__all__ to make the `__version__` as exported explicitly, see more in https://github.com/apache/spark/pull/32110#issuecomment-817331896 ### Why are the changes needed? 1. make the `__version__` as exported explicitly 2. cleanup `noqa: F401` on `__version` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Python related CI passed Closes #32125 from Yikun/SPARK-34629-Follow. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-04-14 23:36:25 +02:00
Takuya UESHIN	4ae57d5b3a	[SPARK-35039][PYTHON] Remove PySpark version dependent codes ### What changes were proposed in this pull request? Removes PySpark version dependent codes from `pyspark.pandas` main codes. ### Why are the changes needed? There are several places to check the PySpark version and switch the logic, but now those are not necessary. We should remove them. We will do the same thing after we finish porting tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32138 from ueshin/issues/SPARK-35039/pyspark_version. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 14:30:48 +09:00
Xinrong Meng	47d62af2a9	[SPARK-35035][PYTHON] Port Koalas internal implementation unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas internal implementation unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the internal implementation unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable internal implementation unit tests. Closes #32137 from xinrong-databricks/port.test_internal_impl. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:59:33 +09:00
Xinrong Meng	cd1e8e8158	[SPARK-35033][PYTHON] Port Koalas plot unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas plot unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the plot unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable plot unit tests. Closes #32151 from xinrong-databricks/port.plot_tests. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:20:16 +09:00
Alex Mooney	faa928cefc	[MINOR][PYTHON][DOCS] Fix docstring for pyspark.sql.DataFrameWriter.json lineSep param ### What changes were proposed in this pull request? Add a new line to the `lineSep` parameter so that the doc renders correctly. ### Why are the changes needed? > <img width="608" alt="image" src="https://user-images.githubusercontent.com/8269566/114631408-5c608900-9c71-11eb-8ded-ae1e21ae48b2.png"> The first line of the description is part of the signature and is bolded. ### Does this PR introduce _any_ user-facing change? Yes, it changes how the docs for `pyspark.sql.DataFrameWriter.json` are rendered. ### How was this patch tested? I didn't test it; I don't have the doc rendering tool chain on my machine, but the change is obvious. Closes #32153 from AlexMooney/patch-1. Authored-by: Alex Mooney <alexmooney@fastmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-14 13:14:51 +09:00
Xinrong Meng	8ebc3fca8c	[SPARK-35012][PYTHON] Port Koalas DataFrame-related unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame-related unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not fully tested. We should enable the DataFrame-related unit tests first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable DataFrame-related unit tests. Closes #32131 from xinrong-databricks/port.test_dataframe_related. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: Takuya UESHIN <ueshin@databricks.com>	2021-04-13 14:24:08 -07:00
Xinrong Meng	a392633566	[SPARK-34996][PYTHON] Port Koalas Series-related unit tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas Series related unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not fully tested. We should enable the Series related unit tests first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable Series-related unit tests. Closes #32117 from xinrong-databricks/port.test_series_related. Lead-authored-by: Xinrong Meng <xinrong.meng@databricks.com> Co-authored-by: xinrong-databricks <47337188+xinrong-databricks@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 13:03:35 +09:00
Xinrong Meng	9c1f807549	[SPARK-35031][PYTHON] Port Koalas operations on different frames tests into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas operations on different frames unit tests to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested fully. We should enable the operations on different frames unit tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable operations on different frames unit tests. Closes #32133 from xinrong-databricks/port.test_ops_on_diff_frames. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 11:22:51 +09:00
Yikun Jiang	b43f7e6a97	[SPARK-35019][PYTHON][SQL] Fix type hints mismatches in pyspark.sql.* ### What changes were proposed in this pull request? Fix type hints mismatches in pyspark.sql.* ### Why are the changes needed? There were some mismatches in pyspark.sql.* ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? dev/lint-python passed. Closes #32122 from Yikun/SPARK-35019. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-13 11:21:13 +09:00
Luka Sturtewagen	fd8081cd27	[SPARK-34983][PYTHON] Renaming the package alias from pp to ps ### What changes were proposed in this pull request? This PR proposes to fix: ```python import pyspark.pandas as pp ``` to ```python import pyspark.pandas as ps ``` ### Why are the changes needed? `pp` might sound offensive in some contexts. ### Does this PR introduce _any_ user-facing change? The change is in master only. We'll use `ps` as the short name instead of `pp`. ### How was this patch tested? The CI in this PR will test it out. Closes #32108 from LSturtew/renaming_pyspark.pandas. Authored-by: Luka Sturtewagen <luka.sturtewagen@linkit.nl> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-12 11:18:08 +09:00
Takuya UESHIN	ff1fc5ed4b	[SPARK-34972][PYTHON][TEST][FOLLOWUP] Fix pyspark.pandas doctests which could be flaky ### What changes were proposed in this pull request? This is a follow-up of #32069. Makes some doctests which could be flaky skip. ### Why are the changes needed? Some doctests in `pyspark.pandas` module enabled at #32069 could be flaky because the result row order is nondeterministic. - groupby-apply with UDF which has a return type annotation will lose its index. - `Index.symmetric_difference` uses `DataFrame.intersect` and `subtract` internally. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #32116 from ueshin/issues/SPARK-34972/fix_flaky_tests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-11 10:42:00 +09:00
Yikun Jiang	4c1ccdabe8	[SPARK-34630][PYTHON] Add typehint for pyspark.__version__ ### What changes were proposed in this pull request? This PR adds the typehint of pyspark.__version__, which was mentioned in [SPARK-34630](https://issues.apache.org/jira/browse/SPARK-34630). ### Why are the changes needed? There were some short discussion happened in https://github.com/apache/spark/pull/31823#discussion_r593830911 . After further deep investigation on [1][2], we can see the `pyspark.__version__` is added by [setup.py](`c06758834e/python/setup.py (L201)`), it makes `__version__` embedded into pyspark module, that means the `__init__.pyi` is the right place to add the typehint for `__version__`. So, this patch adds the type hint `__version__` in pyspark/__init__.pyi. [1] [PEP-396 Module Version Numbers](https://www.python.org/dev/peps/pep-0396/) [2] https://packaging.python.org/guides/single-sourcing-package-version/ ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? 1. Disable the ignore_error on `ee7bf7d962/python/mypy.ini (L132)` 2. Run mypy: - Before fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark \| grep version python/pyspark/pandas/spark/accessors.py:884: error: Module has no attribute "__version__" ``` - After fix ```shell (venv) ➜ spark git:(SPARK-34629) ✗ mypy --config-file python/mypy.ini python/pyspark \| grep version ``` no output Closes #32110 from Yikun/SPARK-34629. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-11 10:40:08 +09:00
Xinrong Meng	3af2c1bb9c	[SPARK-34886][PYTHON] Port/integrate Koalas DataFrame unit test into PySpark ### What changes were proposed in this pull request? Now that we merged the Koalas main code into the PySpark code base (#32036), we should port the Koalas DataFrame unit test to PySpark. ### Why are the changes needed? Currently, the pandas-on-Spark modules are not tested at all. We should enable the DataFrame unit test first. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enable the DataFrame unit test. Closes #32083 from xinrong-databricks/port.test_dataframe. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-09 15:48:13 +09:00
Takuya UESHIN	2635c3894f	[SPARK-34972][PYTHON] Make pandas-on-Spark doctests work ### What changes were proposed in this pull request? Now that we merged the Koalas main code into PySpark code base (#32036), we should enable doctests on the Spark's infrastructure. ### Why are the changes needed? Currently the pandas-on-Spark modules are not tested at all. We should enable doctests first, and we will port other unit tests separately later. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Enabled the whole doctests. Closes #32069 from ueshin/issues/SPARK-34972/pyspark-pandas_doctests. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-07 20:50:41 +09:00
itholic	caf04f9b77	[SPARK-34890][PYTHON] Port/integrate Koalas main codes into PySpark ### What changes were proposed in this pull request? As a first step of [SPARK-34849](https://issues.apache.org/jira/browse/SPARK-34849), this PR proposes porting the Koalas main code into PySpark. This PR contains minimal changes to the existing Koalas code as follows: 1. `databricks.koalas` -> `pyspark.pandas` 2. `from databricks import koalas as ks` -> `from pyspark import pandas as pp` 3. `ks.xxx -> pp.xxx` Other than them: 1. Added a line to `python/mypy.ini` in order to ignore the mypy test. See related issue at [SPARK-34941](https://issues.apache.org/jira/browse/SPARK-34941). 2. Added a comment to several lines in several files to ignore the flake8 F401. See related issue at [SPARK-34943](https://issues.apache.org/jira/browse/SPARK-34943). When this PR is merged, all the features that were previously used in [Koalas](https://github.com/databricks/koalas) will be available in PySpark as well. Users can access to the pandas API in PySpark as below: ```python >>> from pyspark import pandas as pp >>> ppdf = pp.DataFrame({"A": [1, 2, 3], "B": [15, 20, 25]}) >>> ppdf A B 0 1 15 1 2 20 2 3 25 ``` The existing "options and settings" in Koalas are also available in the same way: ```python >>> from pyspark.pandas.config import set_option, reset_option, get_option >>> ppser1 = pp.Series([1, 2, 3]) >>> ppser2 = pp.Series([3, 4, 5]) >>> ppser1 + ppser2 Traceback (most recent call last): ... ValueError: Cannot combine the series or dataframe because it comes from a different dataframe. In order to allow this operation, enable 'compute.ops_on_diff_frames' option. >>> set_option("compute.ops_on_diff_frames", True) >>> ppser1 + ppser2 0 4 1 6 2 8 dtype: int64 ``` Please also refer to the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html) and [Options and Settings](https://koalas.readthedocs.io/en/latest/user_guide/options.html) for more detail. NOTE that this PR intentionally ports the main codes of Koalas first almost as are with minimal changes because: - Koalas project is fairly large. Making some changes together for PySpark will make it difficult to review the individual change. Koalas dev includes multiple Spark committers who will review. By doing this, the committers will be able to more easily and effectively review and drive the development. - Koalas tests and documentation require major changes to make it look great together with PySpark whereas main codes do not require. - We lately froze the Koalas codebase, and plan to work together on the initial porting. By porting the main codes first as are, it unblocks the Koalas dev to work on other items in parallel. I promise and will make sure on: - Rename Koalas to PySpark pandas APIs and/or pandas-on-Spark accordingly in documentation, and the docstrings and comments in the main codes. - Triage APIs to remove that don’t make sense when Koalas is in PySpark The documentation changes will be tracked in [SPARK-34885](https://issues.apache.org/jira/browse/SPARK-34885), the test code changes will be tracked in [SPARK-34886](https://issues.apache.org/jira/browse/SPARK-34886). ### Why are the changes needed? Please refer to: - [[DISCUSS] Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Support-pandas-API-layer-on-PySpark-td30945.html) - [[VOTE] SPIP: Support pandas API layer on PySpark](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-SPIP-Support-pandas-API-layer-on-PySpark-td30996.html) ### Does this PR introduce _any_ user-facing change? Yes, now users can use the pandas APIs on Spark ### How was this patch tested? Manually tested for exposed major APIs and options as described above. ### Koalas contributors Koalas would not have been possible without the following contributors: ueshin HyukjinKwon rxin xinrong-databricks RainFung charlesdong1991 harupy floscha beobest2 thunterdb garawalid LucasG0 shril deepyaman gioa fwani 90jam thoo AbdealiJK abishekganesh72 gliptak DumbMachine dvgodoy stbof nitlev hjoo gatorsmile tomspur icexelloss awdavidson guyao akhilputhiry scook12 patryk-oleniuk tracek dennyglee athena15 gstaubli WeichenXu123 hsubbaraj lfdversluis ktksq shengjh margaret-databricks LSturtew sllynn manuzhang jijosg sadikovi Closes #32036 from itholic/SPARK-34890. Authored-by: itholic <haejoon.lee@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-06 12:42:39 +09:00
HyukjinKwon	2ca76a57be	[MINOR][DOCS] Use ASCII characters when possible in PySpark documentation ### What changes were proposed in this pull request? This PR replaces the non-ASCII characters to ASCII characters when possible in PySpark documentation ### Why are the changes needed? To avoid unnecessarily using other non-ASCII characters which could lead to the issue such as https://github.com/apache/spark/pull/32047 or https://github.com/apache/spark/pull/22782 ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Found via (Mac OS): ```bash # In Spark root directory cd python pcregrep --color='auto' -n "[\x80-\xFF]" `git ls-files .` ``` Closes #32048 from HyukjinKwon/minor-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-04 09:49:36 +03:00
Kousuke Saruta	14c7bb877d	[SPARK-34872][SQL] quoteIfNeeded should quote a name which contains non-word characters ### What changes were proposed in this pull request? This PR fixes an issue that `quoteIfNeeded` quotes a name only if it contains `.` or ``` ` ```. This method should quote it if it contains non-word characters. ### Why are the changes needed? It's a potential bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31964 from sarutak/fix-quoteIfNeeded. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-29 09:31:24 +00:00
Danny Meijer	ad211ccd9d	[SPARK-34630][PYTHON][SQL] Added typehint for pyspark.sql.Column.contains ### What changes were proposed in this pull request? This PR implements the missing typehints as per SPARK-34630. ### Why are the changes needed? To satisfy the aforementioned Jira ticket ### Does this PR introduce _any_ user-facing change? No, just adding a missing typehint for Project Zen ### How was this patch tested? No tests needed (just adding a typehint) Closes #31823 from dannymeijer/feature/SPARK-34630. Authored-by: Danny Meijer <danny.meijer@nike.com> Signed-off-by: zero323 <mszymkiewicz@gmail.com>	2021-03-24 15:21:19 +01:00
John Ayad	ddfc75ec64	[SPARK-34803][PYSPARK] Pass the raised ImportError if pandas or pyarrow fail to import ### What changes were proposed in this pull request? Pass the raised `ImportError` on failing to import pandas/pyarrow. This will help the user identify whether pandas/pyarrow are indeed not in the environment or if they threw a different `ImportError`. ### Why are the changes needed? This can already happen in Pandas for example where it could throw an `ImportError` on its initialisation path if `dateutil` doesn't satisfy a certain version requirement https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 ### Does this PR introduce _any_ user-facing change? Yes, it will now show the root cause of the exception when pandas or arrow is missing during import. ### How was this patch tested? Manually tested. ```python from pyspark.sql.functions import pandas_udf spark.range(1).select(pandas_udf(lambda x: x)) ``` Before: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/...//spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 53, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` After: ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/pandas/utils.py", line 49, in require_minimum_pyarrow_version import pyarrow ModuleNotFoundError: No module named 'pyarrow' The above exception was the direct cause of the following exception: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/pandas/functions.py", line 332, in pandas_udf require_minimum_pyarrow_version() File "/.../spark/python/pyspark/sql/pandas/utils.py", line 55, in require_minimum_pyarrow_version raise ImportError("PyArrow >= %s must be installed; however, " ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found. ``` Closes #31902 from johnhany97/jayad/spark-34803. Lead-authored-by: John Ayad <johnhany97@gmail.com> Co-authored-by: John H. Ayad <johnhany97@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-22 23:29:28 +09:00
Sean Owen	ed641fbad6	[MINOR][DOCS][ML] Doc 'mode' as a supported Imputer strategy in Pyspark ### What changes were proposed in this pull request? Document `mode` as a supported Imputer strategy in Pyspark docs. ### Why are the changes needed? Support was added in 3.1, and documented in Scala, but some Python docs were missed. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. Closes #31883 from srowen/ImputerModeDocs. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-20 01:16:49 -05:00
Kousuke Saruta	03dd33cc98	[SPARK-25769][SPARK-34636][SPARK-34626][SQL] sql method in UnresolvedAttribute, AttributeReference and Alias don't quote qualified names properly ### What changes were proposed in this pull request? This PR fixes an issue that `sql` method in the following classes which take qualified names don't quote the qualified names properly. * UnresolvedAttribute * AttributeReference * Alias One instance caused by this issue is reported in SPARK-34626. ``` UnresolvedAttribute("a" :: "b" :: Nil).sql `a.b` // expected: `a`.`b` ``` And other instances are like as follows. ``` UnresolvedAttribute("a`b"::"c.d"::Nil).sql a`b.`c.d` // expected: `a``b`.`c.d` AttributeReference("a.b", IntegerType)(qualifier = "c.d"::Nil).sql c.d.`a.b` // expected: `c.d`.`a.b` Alias(AttributeReference("a", IntegerType)(), "b.c")(qualifier = "d.e"::Nil).sql `a` AS d.e.`b.c` // expected: `a` AS `d.e`.`b.c` ``` ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. Closes #31754 from sarutak/fix-qualified-names. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-12 02:58:46 +00:00
Peter Toth	ab8a9a0ceb	[SPARK-34545][SQL] Fix issues with valueCompare feature of pyrolite ### What changes were proposed in this pull request? pyrolite 4.21 introduced and enabled value comparison by default (`valueCompare=true`) during object memoization and serialization: https://github.com/irmen/Pyrolite/blob/pyrolite-4.21/java/src/main/java/net/razorvine/pickle/Pickler.java#L112-L122 This change has undesired effect when we serialize a row (actually `GenericRowWithSchema`) to be passed to python: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L60. A simple example is that ``` new GenericRowWithSchema(Array(1.0, 1.0), StructType(Seq(StructField("_1", DoubleType), StructField("_2", DoubleType)))) ``` and ``` new GenericRowWithSchema(Array(1, 1), StructType(Seq(StructField("_1", IntegerType), StructField("_2", IntegerType)))) ``` are currently equal and the second instance is replaced to the short code of the first one during serialization. ### Why are the changes needed? The above can cause nasty issues like the one in https://issues.apache.org/jira/browse/SPARK-34545 description: ``` >>> from pyspark.sql.functions import udf >>> from pyspark.sql.types import * >>> >>> def udf1(data_type): def u1(e): return e[0] return udf(u1, data_type) >>> >>> df = spark.createDataFrame([((1.0, 1.0), (1, 1))], ['c1', 'c2']) >>> >>> df = df.withColumn("c3", udf1(DoubleType())("c1")) >>> df = df.withColumn("c4", udf1(IntegerType())("c2")) >>> >>> df.select("c3").show() +---+ \| c3\| +---+ \|1.0\| +---+ >>> df.select("c4").show() +---+ \| c4\| +---+ \| 1\| +---+ >>> df.select("c3", "c4").show() +---+----+ \| c3\| c4\| +---+----+ \|1.0\|null\| +---+----+ ``` This is because during serialization from JVM to Python `GenericRowWithSchema(1.0, 1.0)` (`c1`) is memoized first and when `GenericRowWithSchema(1, 1)` (`c2`) comes next, it is replaced to some short code of the `c1` (instead of serializing `c2` out) as they are `equal()`. The python functions then runs but the return type of `c4` is expected to be `IntegerType` and if a different type (`DoubleType`) comes back from python then it is discarded: https://github.com/apache/spark/blob/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala#L108-L113 After this PR: ``` >>> df.select("c3", "c4").show() +---+---+ \| c3\| c4\| +---+---+ \|1.0\| 1\| +---+---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new UT + manual tests. Closes #31682 from peter-toth/SPARK-34545-fix-row-comparison. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-03-07 19:12:42 -06:00
Sean Owen	2f30cdebb1	[SPARK-34642][DOCS][ML] Fix TypeError in Pyspark Linear Regression docs ### What changes were proposed in this pull request? Fix a call to setParams in the Linear Regression docs example in Pyspark to avoid a TypeError. ### Why are the changes needed? The example is slightly wrong and we should not show an error in the docs. ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Existing tests Closes #31760 from srowen/SPARK-34642. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-06 07:32:01 -08:00
Takuya UESHIN	331d459ee7	[SPARK-34610][PYTHON][TEST] Fix Python UDF used in GroupedAggPandasUDFTests ### What changes were proposed in this pull request? Fixes a Python UDF `plus_one` used in `GroupedAggPandasUDFTests` to always return float (double) values. ### Why are the changes needed? The Python UDF `plus_one` used in `GroupedAggPandasUDFTests` is always returning `v + 1` regardless of its type. The return type of the UDF is 'double', so if the input is int, the result will be `null`. ```py >>> df = spark.range(10).toDF('id') \ ... .withColumn("vs", array([lit(i * 1.0) + col('id') for i in range(20, 30)])) \ ... .withColumn("v", explode(col('vs'))) \ ... .drop('vs') \ ... .withColumn('w', lit(1.0)) >>> udf('double') ... def plus_one(v): ... assert isinstance(v, (int, float)) ... return v + 1 ... >>> pandas_udf('double', PandasUDFType.GROUPED_AGG) ... def sum_udf(v): ... return v.sum() ... >>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).show() +------------+----------+ \|plus_one(id)\|sum_udf(v)\| +------------+----------+ \| null\| 2900.0\| +------------+----------+ ``` This is meaningless and should be: ```py >>> udf('double') ... def plus_one(v): ... assert isinstance(v, (int, float)) ... return float(v + 1) ... >>> df.groupby(plus_one(df.id)).agg(sum_udf(df.v)).sort('plus_one(id)').show() +------------+----------+ \|plus_one(id)\|sum_udf(v)\| +------------+----------+ \| 1.0\| 245.0\| \| 2.0\| 255.0\| \| 3.0\| 265.0\| \| 4.0\| 275.0\| \| 5.0\| 285.0\| \| 6.0\| 295.0\| \| 7.0\| 305.0\| \| 8.0\| 315.0\| \| 9.0\| 325.0\| \| 10.0\| 335.0\| +------------+----------+ ``` ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Fixed the test. Closes #31730 from ueshin/issues/SPARK-34610/test_pandas_udf_grouped_agg. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 10:03:54 +09:00
HyukjinKwon	800590035c	[SPARK-34604][PYTHON][TESTS] Use eventually in TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse ### What changes were proposed in this pull request? `TaskContextTestsWithWorkerReuse.test_task_context_correct_with_python_worker_reuse` can be flaky and fails sometimes: ``` ====================================================================== ERROR [1.798s]: test_task_context_correct_with_python_worker_reuse (pyspark.tests.test_taskcontext.TaskContextTestsWithWorkerReuse) ... test_task_context_correct_with_python_worker_reuse self.assertTrue(pid in worker_pids) AssertionError: False is not true ---------------------------------------------------------------------- ``` I suspect that the Python worker was killed for whatever reason and new attempt created a new Python worker. This PR fixes the flakiness simply by retrying the test case. ### Why are the changes needed? To make the tests more robust. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested it by controlling the conditions manually in the test codes. Closes #31723 from HyukjinKwon/SPARK-34604. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-04 08:40:48 +09:00
Richard Penney	7d0743b493	[SPARK-33678][SQL] Product aggregation function ### Why is this change being proposed? This patch adds support for a new "product" aggregation function in `sql.functions` which multiplies-together all values in an aggregation group. This is likely to be useful in statistical applications which involve combining probabilities, or financial applications that involve combining cumulative interest rates, but is also a versatile mathematical operation of similar status to `sum` or `stddev`. Other users [have noted](https://stackoverflow.com/questions/52991640/cumulative-product-in-spark) the absence of such a function in current releases of Spark. This function is both much more concise than an expression of the form `exp(sum(log(...)))`, and avoids awkward edge-cases associated with some values being zero or negative, as well as being less computationally costly. ### Does this PR introduce _any_ user-facing change? No - only adds new function. ### How was this patch tested? Built-in tests have been added for the new `catalyst.expressions.aggregate.Product` class and its invocation via the (scala) `sql.functions.product` function. The latter, and the PySpark wrapper have also been manually tested in spark-shell and pyspark sessions. The SparkR wrapper is currently untested, and may need separate validation (I'm not an "R" user myself). An illustration of the new functionality, within PySpark is as follows: ``` import pyspark.sql.functions as pf, pyspark.sql.window as pw df = sqlContext.range(1, 17).toDF("x") win = pw.Window.partitionBy(pf.lit(1)).orderBy(pf.col("x")) df.withColumn("factorial", pf.product("x").over(win)).show(20, False) +---+---------------+ \|x \|factorial \| +---+---------------+ \|1 \|1.0 \| \|2 \|2.0 \| \|3 \|6.0 \| \|4 \|24.0 \| \|5 \|120.0 \| \|6 \|720.0 \| \|7 \|5040.0 \| \|8 \|40320.0 \| \|9 \|362880.0 \| \|10 \|3628800.0 \| \|11 \|3.99168E7 \| \|12 \|4.790016E8 \| \|13 \|6.2270208E9 \| \|14 \|8.71782912E10 \| \|15 \|1.307674368E12 \| \|16 \|2.0922789888E13\| +---+---------------+ ``` Closes #30745 from rwpenney/feature/agg-product. Lead-authored-by: Richard Penney <rwp@rwpenney.uk> Co-authored-by: Richard Penney <rwpenney@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-03-02 16:51:07 +09:00
Phillip Henry	397b843890	[SPARK-34415][ML] Randomization in hyperparameter optimization ### What changes were proposed in this pull request? Code in the PR generates random parameters for hyperparameter tuning. A discussion with Sean Owen can be found on the dev mailing list here: http://apache-spark-developers-list.1001551.n3.nabble.com/Hyperparameter-Optimization-via-Randomization-td30629.html All code is entirely my own work and I license the work to the project under the project’s open source license. ### Why are the changes needed? Randomization can be a more effective techinique than a grid search since min/max points can fall between the grid and never be found. Randomisation is not so restricted although the probability of finding minima/maxima is dependent on the number of attempts. Alice Zheng has an accessible description on how this technique works at https://www.oreilly.com/library/view/evaluating-machine-learning/9781492048756/ch04.html Although there are Python libraries with more sophisticated techniques, not every Spark developer is using Python. ### Does this PR introduce _any_ user-facing change? A new class (`ParamRandomBuilder.scala`) and its tests have been created but there is no change to existing code. This class offers an alternative to `ParamGridBuilder` and can be dropped into the code wherever `ParamGridBuilder` appears. Indeed, it extends `ParamGridBuilder` and is completely compatible with its interface. It merely adds one method that provides a range over which a hyperparameter will be randomly defined. ### How was this patch tested? Tests `ParamRandomBuilderSuite.scala` and `RandomRangesSuite.scala` were added. `ParamRandomBuilderSuite` is the analogue of the already existing `ParamGridBuilderSuite` which tests the user-facing interface. `RandomRangesSuite` uses ScalaCheck to test the random ranges over which hyperparameters are distributed. Closes #31535 from PhillHenry/ParamRandomBuilder. Authored-by: Phillip Henry <PhillHenry@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-27 08:34:39 -06:00
Max Gekk	5957bc18a1	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs ### What changes were proposed in this pull request? Move the datetime rebase SQL configs from the `legacy` namespace by: 1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`. 2. Add the legacy configs as alternatives 3. Deprecate the legacy rebase configs. ### Why are the changes needed? The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs. ### Does this PR introduce _any_ user-facing change? Should not. Users will see a warning if they still use one of the legacy configs. ### How was this patch tested? 1. Manually checking new configs: ```scala scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res0: String = EXCEPTION scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") 21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead. scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res2: String = LEGACY ``` 2. By running a datetime rebasing test suite: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31576 from MaxGekk/rebase-confs-alternatives. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-17 14:04:47 +00:00
Eric Lemmon	e3b6e4ad43	[SPARK-33434][PYTHON][DOCS] Added RuntimeConfig to PySpark docs ### What changes were proposed in this pull request? Documentation for `SparkSession.conf.isModifiable` is missing from the Python API site, so we added a Configuration section to the Spark SQL page to expose docs for the `RuntimeConfig` class (the class containing `isModifiable`). Then a `:class:` reference to `RuntimeConfig` was added to the `SparkSession.conf` docstring to create a link there as well. ### Why are the changes needed? No docs were generated for `pyspark.sql.conf.RuntimeConfig`. ### Does this PR introduce _any_ user-facing change? Yes--a new Configuration section to the Spark SQL page and a `Returns` section of the `SparkSession.conf` docstring, so this will now show a link to the `pyspark.sql.conf.RuntimeConfig` page. This is a change compared to both the released Spark version and the unreleased master branch. ### How was this patch tested? First built the Python docs: ```bash cd $SPARK_HOME/docs SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve ``` Then verified all pages and links: 1. Configuration link displayed on the API Reference page, and it clicks through to Spark SQL page: http://localhost:4000/api/python/reference/index.html ![image](https://user-images.githubusercontent.com/1160861/107601918-a2f02380-6bed-11eb-9b8f-974a0681a2a9.png) 2. Configuration section displayed on the Spark SQL page, and the RuntimeConfig link clicks through to the RuntimeConfig page: http://localhost:4000/api/python/reference/pyspark.sql.html#configuration ![image](https://user-images.githubusercontent.com/1160861/107602058-0d08c880-6bee-11eb-8cbb-ad8c47588085.png)** 3. RuntimeConfig page displayed: http://localhost:4000/api/python/reference/api/pyspark.sql.conf.RuntimeConfig.html ![image](https://user-images.githubusercontent.com/1160861/107602278-94eed280-6bee-11eb-95fc-445ea62ac1a4.png) 4. SparkSession.conf page displays the RuntimeConfig link, and it navigates to the RuntimeConfig page: http://localhost:4000/api/python/reference/api/pyspark.sql.SparkSession.conf.html ![image](https://user-images.githubusercontent.com/1160861/107602435-1f373680-6bef-11eb-985a-b72432464940.png) Closes #31483 from Eric-Lemmon/SPARK-33434-document-isModifiable. Authored-by: Eric Lemmon <eric@lemmon.cc> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-02-13 09:32:55 -06:00
HyukjinKwon	92a83463c9	[SPARK-34408][PYTHON] Refactor spark.udf.register to share the same path to generate UDF instance ### What changes were proposed in this pull request? This PR proposes to use `_create_udf` where we need to create `UserDefinedFunction` to maintain codes easier. ### Why are the changes needed? For the better readability of codes and maintenance. ### Does this PR introduce _any_ user-facing change? No, refactoring. ### How was this patch tested? Ran the existing unittests. CI in this PR should test it out too. Closes #31537 from HyukjinKwon/SPARK-34408. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-11 10:57:02 +09:00

1 2 3 4 5 ...

2683 commits